Hey all, we're trying to build a system where `.l...
# help
a
Hey all, we're trying to build a system where
.lakefs_ref.yml
file generated by
lakectl local
is part of a git repository. And the git repo is the source of truth. This results in a 3 step commit process 1. commit
.lakefs_ref.yml
from feature branch to main 2. update
.lakefs_ref.yml
with a commit from main 3. and merge git feature branch to git main Couple of solutions we're thinking about - 1. merge to git-main while
.lakefs_ref.yml
still points to a feature branch. Essentially forget that lakefs has a main branch, and treat it as a chain of commits. 2. do the above, but have a post-merge action on git main that will merge the lakefs feature branch to the lakefs main branch. This would be much easier to reason about if lakefs had a fast forward merge strategy, and we didn't need a merge commit Are there any other workflows recommended for this? Open to all alternatives, thanks!
a
Hi @Aayush Bhasin, Either of your options will work, of course - as well as many other possibilities. FF merges are #3316 and have so far seen few requests from users; could you add your use-case on that issue, please, to help us track actual requirements for this feature? I guess what you want do depends on your end goal. I think we had this blog post, and also this howto.
a
Thanks Ariel! Our end goal is to have the state of our data tracked alongside git commits, so the code <> data pairing can be reproduced at any point. We're already using lakefs local, I used the links you provided to set up initially. I think the confusion is what do we do with lakefs branches (the guide ends with a commit to a lakefs feature branch in the
.lake_ref.yml
file in the git repo). We'd like to reference lakefs main in the git repo, and we'd like to merge the lakefs feature branch to main when the git feature branch merges to git main. Alternatively - if this is an anti-pattern and we're using lakefs incorrectly/sub-optimally, that would also be great to know!
a
What you're doing is great! I would additionally incorporate the lakeFS reference inside the commit directly, using a format that you control. Personally, I would never reference a branch name in a long-lived entity. Branch names are mutable, and the same commit belongs to multiple branches. I would use the commit digest - as you do. Because I really like normalised data sources, I would probably keep the entire lake_ref mapping only as an optimization but not as the source of truth. My source of truth would be a commit digest. (There is not universal agreement here, probably not even within the lakeFS core team. Test accordingly. lakeFS lets you attach user metadata to a commit. And the web UI shows this, you can't even get clickable links. For instance, our Airflow integration uses this to link DAG runs to lakeFS commits. Unfortunately I don't know how to do this in Git. AFAIK best practice is to use a specially formatted comment, or hold the data in a file with a known name.
a
Thanks for the responses! Just to connect the dots here, I had asked about this previously in https://lakefs.slack.com/archives/C016726JLJW/p1725749506199239 and had a conversation with Oz about how we might approach this. I'm not sure I fully understand your latest suggestion Ariel, is the implication of your suggestion that we should not to worry too much about the LakeFS branches, and primarily focus on the LakeFS commit hashes?
best practice is to use a specially formatted comment, or hold the data in a file with a known name
Could the
.lalefs_ref.yaml
file serve this purpose?
I would probably keep the entire lake_ref mapping only as an optimization but not as the source of truth. My source of truth would be a commit digest.
Not sure I fully understand this, are you referring to a git commit digest or LakeFS commit digest? Our current thinking had been to consider the LakeFS commit hash in the lakefs_ref file as the "source of truth" for what the code at that point in the git history produced.
a
• lakeFS branches never identify a commit or a version of data. Exactly like Git branches. • Use a digest to identify a version of data, or an immutable (in the sense of your business logic) tag. Again, exactly like you would do in Git. • So... tie Git and lakeFS by storing digests from the one in the other. • lake_ref is great! But it is necessarily tied to working in a particular way - using lakectl local. This is your call to make! If you ever decide to use something else, say DataBricks or Spark or fsspec / lakefs-spec or anything else, probably distributed - you might want to step away. Coming from dvc, you're obviously closest to lakectl local. It is definitely your closest fit, and provides nearly a drop-in replacement! It might very well fit your business requirements for the foreseeable future. You can always migrate further into distributed versioning in future.
a
Just fyi (if it's helpful) - we settled on a technique to merge non-main branches to git. We have a required test on each PR that checks the new commit forms a linear history from main (i.e. main is in the commit's chain), and a pre-merge hook that merges the commit to lakefs main before merging the PR in git. We have tooling that ensures that new branches are created from lakefs main (this last bit wouldn't be necessary if we had fast forward merges). I added some short explanation to the feature request ticket you linked earlier, will also link this slack thread so there is more context. Thanks for your help and pointers, very much appreciated and we look forward to using lakefs for more use cases!!
👍🏼 1