Hi all. We're using LakeFS as part of a reinforcem...
# help
d
Hi all. We're using LakeFS as part of a reinforcement learning application. One thing we'd like to improve is optimising our processing so that we're not re-processing the same data through the same code versions. For example, when a new branch is created with some changed/new data, we have an action that generates data but shouldn't re-process everything that has already been processed before, unless the processing code has changed. I wonder if there's any features of LakeFS, integrations, or patterns/tools in the LakeFS ecosystem we can uses to create something as we're looking to create as little code/maintenance burden for ourselves as possible.
i
Hey Dan, welcome to the lake lakefs If I understand you correctly, you want to link between your code and data versions. The code should only run on a branch, if the data in that branch wasn’t created by the same code version. The best way to link between a code and a data version is with
lakectl local
. The command stores the
git_commit_id
in lakeFS, then I guess you will run the code if (pseudo):
Copy code
lakefs_branch.commit_metadata.git_commit_id != git_branch.commit_id
Or to be more precise, if the
diff
between
lakefs_branch.commit_metadata.git_commit_id
and
git_branch.commit_id
on the specific code path is 0 lines (assuming a git commit can contain many changes to different irrelevant actions). Does that make sense?
btw, if working locally with the data isn’t required, you can just add the
git_commit_id
to lakeFS when you commit data there, and the rest of the flow remains the same
n
@Dan McGreal, you might be also interested in lakeFS actions, specifically pre/post-create-branch hook