Any idea why I might be observing different behavi...
# help
c
Any idea why I might be observing different behavior between the GUI and
lakectl
when performing multiple imports? Let's say I have S3 bucket 1 with cat.jpg at

s3://cat_bucket/cat.jpg

and S3 bucket 2 with dog.jpg at

s3://dog_bucket/dog.jpg

In GUI: • I import cat.jpg to main branch, merge _main_imported • Then, I import dog.jpg to main branch, merge _main_imported Output (expected): cat.jpg and dog.jpg are now both in main In CLI: •
lakectl import --from <s3://cat_bucket/> --to <lakefs://repo/main/>
and manually merge _main_imported •
lakectl import --from <s3://dog_bucket/> --to <lakefs://repo/main/>
. Now, when I go to compare branches, I can see that cat.jpg is being removed if I merge Output (not expected): only dog.jpg is now in main
a
@Conor Simmons What you experienced with the CLI is correct behavior. I was testing this myself last week. Let me test GUI again.
@Conor Simmons What version of lakeFS are you using?
c
I'm using version 0.92.0
If the CLI is correct behavior... is there any way to do what I want to do like in the GUI? I want to be able to important from multiple S3 buckets/paths/prefixes and combine those imports into a central repository
a
Have you tried multiple CLI import without merging first? Once you imported all files from multiple S3 buckets/paths/prefixes, then merge to main
c
I just tried it and it has the same issue
Do you intend to support this?
I want to be able to important from multiple S3 buckets/paths/prefixes and combine those imports into a central repository
a
Let me check few things. I was able to accomplish this requirement by using our Python API.
c
Ok, thanks. I didn't know that was an import functionality in the python API
a
There is no Import API in Python currently (we are working on that) but I wrote some extra code to make it work.
👍 1
Or you can use
stage_object
API in Python or
lakectl fs stage
if you know the file size (or get the file size from S3) which might be easier route.
c
Ah ok that could work I didn't know about that. It's very similar to import right?
a
Yes. If you need Python code then I can provide an example using
stage_object
API.
c
I do have access to the size in bytes - that shouldn't be an issue
Sure, it wouldn't hurt
a
Let me test it again to make sure that it will meet your requirements. Give me 15-30 minutes to test.
👍 1
c
Thanks, no rush
a
@Conor Simmons I tested with
stage_object
and it worked. You can stage a single file or full folder. Here is the code (make sure to pass the file size in bytes to
object_stage
function):
Copy code
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
from lakefs_client.model.object_stage_creation import ObjectStageCreation
from lakefs_client.model.object_user_metadata import ObjectUserMetadata
 
LAKEFS_ENDPOINT="lakefs enpoint"
LAKEFS_ACCESS_KEY="xxxxx"
LAKEFS_SECRET_KEY="xxxxxx"

configuration = lakefs_client.Configuration()
configuration.username = LAKEFS_ACCESS_KEY
configuration.password = LAKEFS_SECRET_KEY
configuration.host = LAKEFS_ENDPOINT
client = LakeFSClient(configuration)
 
def list_files(repo_name, branch):
  api_response = client.objects.list_objects(repo_name, branch)
  print([obj["path"] for obj in api_response.results])
 
def object_stage(source_uri):
    object_stage_creation = ObjectStageCreation(
        physical_address=source_uri,
        checksum="",
        size_bytes=1,
        mtime=1,
        metadata=ObjectUserMetadata( # optional
            key="version: 1.0",
        ),
        content_type="",
    ) # ObjectStageCreation | 
    return object_stage_creation
 
def stage_objects(repo_name, importBranch, sourceBranch, source_uri, path):
   
    object_stage_creation = object_stage(source_uri)
    print(object_stage_creation)
    try:       
        api_response_1 = client.objects.stage_object(repo_name, importBranch, path, object_stage_creation)
        print(api_response_1)
       
    except lakefs_client.ApiException as e:
        print("Exception when calling objects->stage_object: %s\n" % e)
 
    client.commits.commit(
        repository=repo_name,
        branch=importBranch,
        commit_creation=models.CommitCreation(
            message='v1.0',
            metadata={'version': '1.0'}))

repo = "my-repo"
import_branch = "_main_imported"
source_branch = "main"

client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=import_branch,
        source=source_branch))

source_uri = "<s3://sample-dog-images/n02085620-Chihuahua/n02085620_10074.jpg>" # A URI on the object store to import from.
path="dogs/n02085620_10074.jpg"

stage_objects(repo, import_branch, source_branch, source_uri, path)
list_files(repo, import_branch)

source_uri = "<s3://sample-dog-images/n02085620-Chihuahua/n02085620_10131.jpg>" # A URI on the object store to import from.
path="dogs/n02085620_10131.jpg"

stage_objects(repo, import_branch, source_branch, source_uri, path)
list_files(repo, import_branch)

client.refs.merge_into_branch(
    repository=repo,
    source_ref=import_branch, 
    destination_branch=source_branch)
c
Thank you @Amit Kesarwani! I think this will work
a
👍 Good luck