Good morning everyone, I’m having some questions ...
# help
m
Good morning everyone, I’m having some questions about the DIFF concepts for LakeFS. With git, if you
git diff sha1 sha2
you might have for example +3 added. If you do it the other ways around
git diff sha2 sha1
you have -3 removed. I was expected the same concept with LakeFS, but this is not the case. I tagged an old commit, and then I do (UI and python sdk have the same results) a diff with the head. One way gives me +3 added, the others way around 0 change. Is it an expected behavior? I’m posting in 🧵 the two screenshots. Have an excellent day !
One way
the other way
If it’s not an expected behavior, feel free to tell me, and I will create a bug ticket. Just didn’t want to spam the issue tracking in Github if it’s expected.
a
Hi @Mickaël Lacour, Welcome to the lake! You may be comparing apples to oranges in this diff example (please excuse the pun). git has 2 kinds of diff, sometimes known as "two dot" and "three dot" diffs. This web UI shows you the 2-dot diff: what you'd get if you merged
compare
into
base
. It shows all commits that are in compare but not in base. And I think you're probably after the three-dot diff: all the differences between the two tags.
So it's really saying what will happen when you click the "Merge" button.
Confusingly, "three-dot" is "two way" (ignoring the common base) and "two-dot" is "three way" (including the common base). So you can try lakectl to see those diffs:
Copy code
❯ lakectl diff --two-way <lakefs://quickstart/main> <lakefs://quickstart/branch>
Left ref: <lakefs://quickstart/main>
Right ref: <lakefs://quickstart/branch>
+ added foo

❯ lakectl diff --two-way <lakefs://quickstart/branch> <lakefs://quickstart/main>
Left ref: <lakefs://quickstart/branch>
Right ref: <lakefs://quickstart/main>
- removed foo
If it helps: the lakeFS web UI is a lot like the GitHub web UI. For instance I'm looking at branch prfd-pr-details right now, here. It says:
This branch is 9 commits ahead of, 4 commits behind master.
It's a bit hard (for me 👓 ) to see, but there are TWO different links in there, one in each direction.
m
oh oki I see, I see.... Shame on me 😄 You are right with the pun! 😄 don’t worry. Thanks a lot for you fast and complete answer ! really appreciate the team support! I will see how to have the two-way with lakefs python sdk ! Thank you again Ariel !!!
a
Sure thing! GUIs are always hard, need to find a balance between usefulness and usability. If the commits included were listed prominently, would that have helped at all?
m
Depending on the use case it could help. I will have to compute (not difficult :D) to get my info but totally doable. But don’t worry, I will find another way of getting the info. My use case: • We have an external process that will automatically delete PII inside datasets (that must comply with a schema) on the main branch (others branch have a TTL and after 30 days, they get automatically deleted). • We have DAG that run automatically training, but the first step of the DAG is to make sure that the dataset is still useful (number of media for example). If not, the DAG will trigger a rebuild of a new version of the dataset, else it will continue as before. I wanted to explore the diff provided by LakeFS to avoid scanning the media folder. The diff between my head and main branch would have gave me : -234 medias for example. So it’s not a huge lost if I don’t have it, because today we have this metrics, but tomorrow we will be checking if the dataset (even if the deletion is low) is still not biais. So I will need to parse the remaining data. Don’t worry! Thanks a lot for your inputs, really appreciate them!
and I don’t want to trigger a hook for now to trigger the dag to rebuild the dataset and relaunch the training. (at least for now)
a
If I understand correctly, the issue is that you need to see this information in the lakeFS UI. I can see the value and believe this is a worthwhile feature request. (I cannot commit to it, we will need to consider the added GUI complexity.)
m
No I’m not using the GUI for that, I’m using lakefs python package. I wanted to automate something depending on a diff between two versions.
a
Okay, then you should have that API available, for committed data at least. Look for the type parameter of diffRefs
m
yeap, I’m doing that right now, looking at changes. Thx !
sunglasses lakefs 1