Hi, i recently started using lakefs with python cl...
# help
u
Hi, i recently started using lakefs with python client. I was a little confused on how to do change detection. The use case i had was there are some data transformation tasks. Each task would process some files in some dir. and commit only changed files. I tried globbing all files in a dir and uploading and commiting. Even though some files are unchanged, the commit would still have the files as part of the commit, this affects the downstream tasks. Is there a way to detect files that have changed from lakefs, so that last commit would only include changed files ?
u
Hi Yaphet, Welcome to the lake!jumping lakefs Two questions in order to understand your case: 1) Do you commit the files to the same branch? 2) Which command did you run in order to see the changes in your commit? List objects or diff commits?
u
Hi Idan, 1. yep files are commited to the same branch, 2. my approach was , upload all files in the local dir, (this also exist in the remote branch). But only a few have changed. Doing LakefsClient.branches.diff_branch(repository=repo, branch=branch).results returns only those have changed. After this i did a commit to the same branch, but going to that commit's files all the files (the ones that have unchanged also appear) .
u
Copy code
files = get_all_files_list("./datasets/input-1", "*")
    client = init_client()
    repo = 'repo-1'
    branch = 'develop'
    diff = upload(files, client, "repo-1", "develop")
    print(diff)
    commit_result = commit(client, 'repo-1', 'develop', 'no files should change ?')
    print(commit_result)
the console result
Copy code
[] --- > print(diff)
{'committer': 'admin',
 'creation_date': 1658163909,
 'id': '18766368e6fb24a4cf91fdef13f18551de2d0c1850b2fa653895fb795d4b343e',
 'message': 'no files should change ?',
 'meta_range_id': '',
 'metadata': {'using': 'python_api'},
 'parents': ['7f245c5efa55ae9fb71489d8ddc398b9af94bc27488522c29d1177e72423b0ad']} ----> print(commit_result)
Upload method snippet :
Copy code
def upload(files: list, client: LakeFSClient, repo: str, branch: str):
    for f in files:
        with open(f,'rb') as stream:
            client.objects.upload_object(repository=repo, branch=branch, path=f, content=stream )
    return client.branches.diff_branch(repository=repo, branch=branch).results
commit method snippet:
Copy code
def commit(client: LakeFSClient, repo, branch, message):
    commit_creation = models.CommitCreation(message=message, metadata={'using': 'python_api'})
    return client.commits.commit(
        repository=repo,
        branch=branch,
        commit_creation=commit_creation,

    )
u
so maybe this is ok? and its supposed to be like this, since i have issues and upload of all files before commit, but is there a way to see what i am trying to upload has a difference before committing, or some way to make sure a commit contains only those that have changed ?
u
Hey Yaphet, If you see files as part of a commit, it means they didn't exist in the branch your commit was committed to. You can see if your upload contained any changes by getting the uncommitted changes your branch contains (which is the result of your upload function)
u
Copy code
for f in files:
        with open(f,'rb') as stream:
            client.objects.upload_object(repository=repo, branch=branch, path=f, content=stream )
    return client.branches.diff_branch(repository=repo, branch=branch).results
this guy right ?
u
yep that returns empty as expected , but the commit has the files,
u
I see now what you meant. laekFS commits are like git commits- they work like "pointers" for a state of a repository. Therefore, a commit point for all the files that were in the repository after it was committed. In order to get what you try to get, you have to use the output of the
diff_branch
command. Hope it makes sense.
u
BTW, you can see the difference between refs(commits/tags/branches) on the compare tab in the UI.
u
got it , does it accept commit ids or just branch names?
u
It accepts both 🙂 In git, tags and branches are just another representation of commits so that branch is basically a pointer for the last commit you pushed to it. I found

this

playlist very interesting to watch, it helped me to learn a lot about how git works.
u
So basically you can compare any two refs- Tag to commit/commit to commit/commit to branch etc...
u
neat yep , doing something like
Copy code
ref_1 = "05bc970748c06490ef9fc3c9571d1164ffb846a4d8d933be547ee3c7fdf7ea44"
    ref_2 = "99eaf85194401e3e948f32f66a244e1ef1667becc56271a59f115c11bae20b80"
    diff = client.refs.diff_refs(repo, ref_2, ref_1)
this worked
u
🤘🏽