https://lakefs.io/ logo
Title
v

Vino

11/10/2022, 6:14 PM
Hi, I'm trying to understand if lakeFS has "git add" like functionality that lets me choose the changes that go into a commit. That way I can discard some changes and commit only a portion of changes in data.
o

Oz Katz

11/10/2022, 6:16 PM
Ha! @Itai Admi @Barak Amar and I were discussing this exact use case earlier today 🙂
is there a specific use case in mind? a scenario where you believe this would be really useful?
v

Vino

11/10/2022, 6:25 PM
Ahh. That's interesting. Here is what I'm doing: While using lakeFS for a typical data-ingestion use case, I copy the raw data into staging area
(staging-branch/raw/dt=2022-10-11/sample-data.json)
, do data explorations, run transformations and aggregations and write aggregated data in staging branch
(staging-branch/analytics/sampledata-by-country-parquet).
Now I run tests on
/analytics
and need to merge only the
/analytics
dir into my prod. After the merge, I'd delete the staging-branch. In the data teams I worked with, best practice was to not work on data in ingress location directly. Always copied the data to be processed into a staging area. So I tried to simulate the same for a lakeFS demo and ran into this requirement.
o

Oz Katz

11/10/2022, 7:05 PM
Ah I see. That makes sense! would you mind adding this context to the issue?
👍 1
a

Ariel Shaqed (Scolnicov)

11/10/2022, 7:45 PM
My current preferred workaround is to check out another branch, write to that one, commit, and merge back. This lets me modify some objects, check and commit at once. But this method is probably more suitable for automated "data engineering" workflow than for any human-in-the-loop "data science" workflow. In any case I certainly see the appeal! I think that by now why we don't do this is a FAQ with no good answer. I'd really like a resolution of 2512 one way or the other!
:heart_lakefs: 1
👍 1