https://lakefs.io/ logo
Title
m

Manoj Babu

04/17/2023, 3:59 PM
Hi all, just thinking about how lakefs handles merges. Atm, lakefs has defined merge strategies(source wins or dest wins) which doesn't require user's involvement. Is it possible to provide a merge request feature on the UI just like how the git cli or hosted services provide? This might be a needed if the user wanted to pick some changes from source and some from dest.. atleast for protected branches. I can understand showing diff on huge data files(TABLES) would be a difficult problem compared to non data files(FILES). However we have collections of data files and non data files versioned in lakefs. So getting to the basics, here's my perspective when talking about diffing FILES, we can clearly see showing the diff as a file won. May be when we talk about diffing TABLES we should be showing the diff as a table in itself. So, users can interactively select the changes from source or destination using SQL and add them to the merge commit. I hope data volume would not be much of a problem, we can leverage apache arrow or any in-memory data storage system to address this issue. Let me know your thoughts on this. Thanks
Correct me if i'm wrong. /\
o

Oz Katz

04/17/2023, 7:03 PM
Thanks for the feedback @Manoj Babu ! insightful as always 🙂 🙂 As lakeFS goes deeper into the object meaning and begins supporting table formats (as done for delta lake and as discussed on #iceberg-integration ), we’ll indeed need a way to introduce more flexible merge behaviors as well. As a starting point, I believe we’ll introduce something aling the lines of a dataset primitive to lakeFS (see the original proposal for Delta support under “defining datasets”) - this would allow adding custom behaviors to different types of data. adding full sql support is tricky, however, since it involves rewriting compressed, possibly very large sets of data, which could become a scalability bottleneck.. we’d probably have to integrate with some external execution engine (Spark or similar) to avoid reinventing the wheel
m

Manoj Babu

04/17/2023, 7:58 PM
You're right. Measuring diff between two refs of a dataset would invite an execution engine to the party when we want it at row level granularity and see exactly what's changed. Just wondering on use-cases when this would be required for an engineer to introspect, Atleast one case which comes to my mind is some sort of logic changes. But this can still be achieved with the existing strategies as long as the engineer is clear on which ref to choose from. There would not be a situation where one needs to pick changes from both sides. Even if any situation arises where one needs to hand pick changes from both the refs, we can compute the diff and store it in the orphaned branch(to avoid recomputation) as a diff-dataset something like this[! https://github.com/datafold/data-diff] . And on the UX, one can leverage SQL(may be data-fusion[! https://github.com/apache/arrow-datafusion] can come to rescue) and pick the changes.
a

Ariel Shaqed (Scolnicov)

04/17/2023, 8:10 PM
My 2 agorot... I agree that interactivity is misleading here: I would expect to be able to use a tool to resolve my conflicts. And I don't want to have to define whether how "interactive" that is. This is even orthogonal to table formats. Now Git itself has very limited support for conflict resolution. What I get in fact during conflict resolution is a staging area (for successfully merged portions) and a working area that is in a "conflicted" state. Everything else (edit manually,
git add
,
git merge --continue
) is just porcelain. As a result, Git doesn't care about what kind of conflict I had. GitHub and GitLab, in this comparison, do care about conflict resolution. The interactive workflow that they offer for conflict resolution in PRs is actually quite limited, and I don't know anyone who uses it. Perhaps I don't know enough people. So on lakeFS-as-Git, I see 2 options -- and each will allow us to get a conflict resolution flow. Option A: I think we could start by being Git here. It's a bit harder because lakeFS has only a staging area and cannot have a working tree. But we could still offer this partial merge result as a commit (and do whatever with the conflicts, just mark them)! A Git staging are is just a commit that is not linked to any other commit. We could do the same: a failed merge could still generate a result into an unlinked commit (or just metarange, but commits are supported by the lakeFS API), and then return the digest of that commit. We could even give that commit a staging area if we wanted to work there! This commit would magically mark conflicts, say as metadata. Resolving a merge is a process on top of lakeFS that somehow re-merges the objects from the merge source and destination, and writes to the staging area. Eventually, all conflicts are resolved, and we can "complete" the merge. It's actually just a re-merge, using the staging area as the source. Option B: Don't bother 🙂 . Say my merge from S to D has a conflict. I branch out of S to S2, merge D into S2 with a strategy that makes D win. Now I look at S versus S2, and wherever there's a conflict I do "something". I commit the result into S2, and merge S2 into D. I am not sure how option A is more powerful than option B. Perhaps we need to automate option B as part of lakeFS-as-GitLab...?
m

Manoj Babu

04/17/2023, 9:11 PM
@Ariel Shaqed (Scolnicov) Thanks for the great explanation. Yes, I agree on what you've mentioned about github/gitlab on PR merge feat. Not sure if people really use it though. From what you've said may be lakefs can provide difftool and mergetool but i'm not so sure if it needs to be aware of the underlying table formats / file(data file) formats. From the options you've mentioned Option B seems simpler as Option A can reach some depths in the process of recursive resolution and mess up staging area, where as option B is something innovative and automating it can be of great advantage(subject to the number of requests) to the user and at the same time it keeps open options limited to the creativity of the user. /\
a

Ariel Shaqed (Scolnicov)

04/18/2023, 6:54 AM
Thanks for the great points! I really hope we'll be advancing on these matters in 2023. Buckle up, I hope it gets interesting...! 🎢
:lakefs: 2