Hey < Angela Bovo> as of today lakeFS does not offer data de lakeFS #help

Hey <@U066VJ66RA5>, as of today, lakeFS does not o...

Yoni Augarten

11/22/2023, 8:04 AM

Hey @Angela Bovo, as of today, lakeFS does not offer data dedeuplication in the sense you are describing, so this is the expected result. The dedeuplication offered by lakeFS happens at the branch level. That is, creating a branch is a metadata only operation, which causes no data to be copied.

Angela Bovo

11/22/2023, 8:20 AM

Thank you for the confirmation @Yoni Augarten. Does this already appear in the lakeFS roadmap? If not, may I suggest adding it?

Yoni Augarten

11/22/2023, 8:23 AM

As far as I know this is currently not in the roadmap, but @Oz Katz and @Tal Sofer can share more details about our future plans. That being said, I will try to come up with a solution that may get you closer to what you need. I will update here later today.

Oz Katz

11/22/2023, 10:10 AM

Hi @Angela Bovo! Indeed, there are no concrete plans to add object-level deduplication support to lakeFS. The reasoning for this: 1. Concurrency control - doing this safely is actually harder than it seems. I tried to illustrate an example in the diagram below (imagine steps 8&9 happening at the same time). While there are obviously ways to do this safely, they don't come for free and would impact both complexity and performance. 2. Product Scope - lakeFS' goal is to allow users to version data at scale and implement best practices such as dev/test isolation from production, production safety, reproducibility and more. Deduplication, while great 🙂 is not in itself a goal of lakeFS. 3. Usefulness - in most scenarios, having the same object stored multiple times in lakeFS is redundant. The typical use case for copying the same exact data is isolation ("let me work on my own copy") - with lakeFS this is done using branches instead of copying, which is faster, more efficient - and yes, also inherently deduplicated. That said - if there's a concrete use case that you believe we're missing, that aligns with the goals of lakeFS - we'd be more than happy to reconsider!

Iddo Avneri

11/22/2023, 11:32 AM

Hi @Angela Bovo. As you might have noticed, the difference betweeen my example and yours, is that in my example, the same exact file was uploaded twice. In yours, the name of the file is the same, but the file content changed. lakeFS takes advantage of the fact that the object store is immutable and we do copy-on-write. So when a new version of the file is created, the next commit will point to that new version. This way, you can still benefit from the git actions, + the design advantages Oz mentioned. However, you don’t have to store all these duplicated files forever. LakeFS GC (garbage collection) allows you to remove older versions from the object store. It’s powerful since the GC is repository and branch specific, allowing you a lot of flexibility for configuring the retention policy for different data sets.

Angela Bovo

11/22/2023, 1:12 PM

No, that’s not what happened in my example. As you can see in the screenshot, the file upload is called twice in a same cell. The file doesn’t change in between these two calls. The content is the same.

Iddo Avneri

11/22/2023, 3:19 PM

Thanks for the clarification @Angela Bovo! The short answer is: A GC run will delete this extra copy as part of the uncommitted garbage collection process. A bit more context: Since we can’t tell until the file is already uploaded, that it is, in fact, the exact same file (and the exact same hash), a copy of the new uploaded file is stored in the bucket (that’s why you see those output messages in your notebook). Since the file did not change, we won’t actually commit it, and it will stay in the bucket until the next relevant GC run. (I think you need to wait for 24 hours, since we want to make sure we don’t delete files mid process). Does this help?

Angela Bovo

11/22/2023, 4:55 PM

Thank you Iddo. I understand that this makes sense in the lakeFS workflow. Unfortunately, this is further away from our use case that we’d like (a landing zone for raw data where we don’t plan on deleting data, like, ever) so I’m not sure that we will end up using lakeFS after all.

Iddo Avneri

11/22/2023, 4:57 PM

Happy to jump on a call and discuss. You don’t need to delete any data. If you like, the GC can delete only the uncommitted files and you will achieve the deduplication you are looking for.

Iddo Avneri

11/22/2023, 4:57 PM

(I might be missing something, so feel free to explain).

Angela Bovo

11/22/2023, 6:29 PM

Our wish is the following: scheduled data ingestion. At each run of the scheduler, obtain the current version of a file and store it in our landing zone. Commit regardless of whether the file has changed or not so we can have date metadata about the state of our file at this point in time. Which leaves us with two possibilities: • use lakeFS upload. ◦ if the file has changed, everything goes fine ◦ if it hasn’t: ▪︎ if we don’t check anything, we upload it and therefore duplicate it, then we commit, so it stays duplicated ▪︎ if we check that it’s the same, we could upload nothing, but then maybe we can’t commit because nothing else has changed. • use lakeFS import ◦ would this mean re-importing the whole bucket at each run of our scheduled ingestion to know about the new files? If so, it sounds awfully inefficient ◦ if the file hasn’t changed, maybe we can’t commit because nothing else has changed. Now maybe I missed something, in which case I’d be glad for a correction! But so far, I don’t think we can use lakeFS commits for our desired date metadata.

Iddo Avneri

11/22/2023, 6:49 PM

Hi Angela, this is not a correct description 🙂 Again maybe a call will be better. If you upload with lakeFS this will work OOTB for you (once you have Garbage collection running)

Iddo Avneri

11/22/2023, 6:52 PM

Use lakeFS upload: • If the file was changed, everything goes fine • If it hasn’t: ◦ A duplicate version of the file is placed on the object store. ◦ When you commit, since the file is identical lakeFS will NOT commit the new file (there is nothing you need to do on your side for this to happen). ◦ Next GC cycle, the duplicated file (which was never committed since it is the same file) will be deleted from the object store.

Iddo Avneri

11/22/2023, 6:53 PM

So defacto - you will never have 2 copies of the same identical file after the GC run.

Angela Bovo

11/22/2023, 6:55 PM

OK, so commits don’t look at the physical address at all, only at the file contents? There remains the problem of not being able to commit though 😞

Iddo Avneri

11/22/2023, 6:58 PM

You are able to commit…

Iddo Avneri

11/22/2023, 6:58 PM

Let me record a quick video and share with you?

Angela Bovo

11/22/2023, 7:00 PM

I mean if there aren’t any changes.

Iddo Avneri

11/22/2023, 7:03 PM

You will commit regularly. lakeFS will identify on it’s own that there are no changes.

Angela Bovo

11/22/2023, 7:05 PM

Let’s say that my ingestion is scheduled every hour. Therefore, I would like a commit every hour as well. But if at one point no data has changed, lakeFS won’t let me commit, right? Then I’d have a time point with no metadata, same as if my ingestion had failed to run.

Iddo Avneri

11/22/2023, 7:07 PM

You can run a commit in your code and will come back with an output that there is no uncommitted changes. If it is important for you to have a point in time per hour snap shot you can add a tiny txt file that will include the date and time (that will obviously change in every commit).

Iddo Avneri

11/22/2023, 7:08 PM

So the scheduled ingestion will: 1. Create a time stamp file. 2. upload. 3. Commit.

Ariel Shaqed (Scolnicov)

11/23/2023, 10:53 AM

Hi @Angela Bovo, Would adding an "allow empty *commit*" flag to CommitCreation work for you? I think that makes a lot of sense:

git commit

supports both

--allow-empty

and

--allow-empty-message

, we only support the latter.

Angela Bovo

11/23/2023, 1:06 PM

That would make sense to me as well, if there is no technical reason that kept you from doing it so far.

Oz Katz

11/23/2023, 2:36 PM

@Ariel Shaqed (Scolnicov) good idea! Can you open an issue?

sunglasses lakefs 1

Ariel Shaqed (Scolnicov)

11/23/2023, 5:51 PM

Opened #7042. If you can thumbs up 👍 on the issue or comment, it may help.

🤘 1

👍 1

Lynn Rozen

11/28/2023, 3:14 PM

I wanted to let you know that this is now in progress. We'll keep you updated as it's done 🙂

sunglasses lakefs 1

👍 1

Iddo Avneri

11/28/2023, 3:14 PM

@Angela Bovo

👍 1

einat.orr

01/20/2024, 8:05 AM

@Angela Bovo Empty commit option was released as part of V1.6 a couple of weeks ago

🤘 1

9 Views

Open in Slack

Previous Next