Hey <@U066VJ66RA5>, as of today, lakeFS does not o...
# help
y
Hey @Angela Bovo, as of today, lakeFS does not offer data dedeuplication in the sense you are describing, so this is the expected result. The dedeuplication offered by lakeFS happens at the branch level. That is, creating a branch is a metadata only operation, which causes no data to be copied.
a
Thank you for the confirmation @Yoni Augarten. Does this already appear in the lakeFS roadmap? If not, may I suggest adding it?
y
As far as I know this is currently not in the roadmap, but @Oz Katz and @Tal Sofer can share more details about our future plans. That being said, I will try to come up with a solution that may get you closer to what you need. I will update here later today.
o
Hi @Angela Bovo! Indeed, there are no concrete plans to add object-level deduplication support to lakeFS. The reasoning for this: 1. Concurrency control - doing this safely is actually harder than it seems. I tried to illustrate an example in the diagram below (imagine steps 8&9 happening at the same time). While there are obviously ways to do this safely, they don't come for free and would impact both complexity and performance. 2. Product Scope - lakeFS' goal is to allow users to version data at scale and implement best practices such as dev/test isolation from production, production safety, reproducibility and more. Deduplication, while great 🙂 is not in itself a goal of lakeFS. 3. Usefulness - in most scenarios, having the same object stored multiple times in lakeFS is redundant. The typical use case for copying the same exact data is isolation ("let me work on my own copy") - with lakeFS this is done using branches instead of copying, which is faster, more efficient - and yes, also inherently deduplicated. That said - if there's a concrete use case that you believe we're missing, that aligns with the goals of lakeFS - we'd be more than happy to reconsider!
i
Hi @Angela Bovo. As you might have noticed, the difference betweeen my example and yours, is that in my example, the same exact file was uploaded twice. In yours, the name of the file is the same, but the file content changed. lakeFS takes advantage of the fact that the object store is immutable and we do copy-on-write. So when a new version of the file is created, the next commit will point to that new version. This way, you can still benefit from the git actions, + the design advantages Oz mentioned. However, you don’t have to store all these duplicated files forever. LakeFS GC (garbage collection) allows you to remove older versions from the object store. It’s powerful since the GC is repository and branch specific, allowing you a lot of flexibility for configuring the retention policy for different data sets.
a
No, that’s not what happened in my example. As you can see in the screenshot, the file upload is called twice in a same cell. The file doesn’t change in between these two calls. The content is the same.
i
Thanks for the clarification @Angela Bovo! The short answer is: A GC run will delete this extra copy as part of the uncommitted garbage collection process. A bit more context: Since we can’t tell until the file is already uploaded, that it is, in fact, the exact same file (and the exact same hash), a copy of the new uploaded file is stored in the bucket (that’s why you see those output messages in your notebook). Since the file did not change, we won’t actually commit it, and it will stay in the bucket until the next relevant GC run. (I think you need to wait for 24 hours, since we want to make sure we don’t delete files mid process). Does this help?
a
Thank you Iddo. I understand that this makes sense in the lakeFS workflow. Unfortunately, this is further away from our use case that we’d like (a landing zone for raw data where we don’t plan on deleting data, like, ever) so I’m not sure that we will end up using lakeFS after all.
i
Happy to jump on a call and discuss. You don’t need to delete any data. If you like, the GC can delete only the uncommitted files and you will achieve the deduplication you are looking for.
(I might be missing something, so feel free to explain).
a
Our wish is the following: scheduled data ingestion. At each run of the scheduler, obtain the current version of a file and store it in our landing zone. Commit regardless of whether the file has changed or not so we can have date metadata about the state of our file at this point in time. Which leaves us with two possibilities: • use lakeFS upload. ◦ if the file has changed, everything goes fine ◦ if it hasn’t: ▪︎ if we don’t check anything, we upload it and therefore duplicate it, then we commit, so it stays duplicated ▪︎ if we check that it’s the same, we could upload nothing, but then maybe we can’t commit because nothing else has changed. • use lakeFS import ◦ would this mean re-importing the whole bucket at each run of our scheduled ingestion to know about the new files? If so, it sounds awfully inefficient ◦ if the file hasn’t changed, maybe we can’t commit because nothing else has changed. Now maybe I missed something, in which case I’d be glad for a correction! But so far, I don’t think we can use lakeFS commits for our desired date metadata.
i
Hi Angela, this is not a correct description 🙂 Again maybe a call will be better. If you upload with lakeFS this will work OOTB for you (once you have Garbage collection running)
Use lakeFS upload: • If the file was changed, everything goes fine • If it hasn’t: ◦ A duplicate version of the file is placed on the object store. ◦ When you commit, since the file is identical lakeFS will NOT commit the new file (there is nothing you need to do on your side for this to happen). ◦ Next GC cycle, the duplicated file (which was never committed since it is the same file) will be deleted from the object store.
So defacto - you will never have 2 copies of the same identical file after the GC run.
a
OK, so commits don’t look at the physical address at all, only at the file contents? There remains the problem of not being able to commit though 😞
i
You are able to commit…
Let me record a quick video and share with you?
a
I mean if there aren’t any changes.
i
You will commit regularly. lakeFS will identify on it’s own that there are no changes.
a
Let’s say that my ingestion is scheduled every hour. Therefore, I would like a commit every hour as well. But if at one point no data has changed, lakeFS won’t let me commit, right? Then I’d have a time point with no metadata, same as if my ingestion had failed to run.
i
You can run a commit in your code and will come back with an output that there is no uncommitted changes. If it is important for you to have a point in time per hour snap shot you can add a tiny txt file that will include the date and time (that will obviously change in every commit).
So the scheduled ingestion will: 1. Create a time stamp file. 2. upload. 3. Commit.
a
Hi @Angela Bovo, Would adding an "allow empty *commit*" flag to CommitCreation work for you? I think that makes a lot of sense:
git commit
supports both
--allow-empty
and
--allow-empty-message
, we only support the latter.
a
That would make sense to me as well, if there is no technical reason that kept you from doing it so far.
o
@Ariel Shaqed (Scolnicov) good idea! Can you open an issue?
sunglasses lakefs 1
a
Opened #7042. If you can thumbs up 👍 on the issue or comment, it may help.
🤘 1
👍 1
l
I wanted to let you know that this is now in progress. We'll keep you updated as it's done 🙂
👍 1
sunglasses lakefs 1
i
@Angela Bovo
👍 1
e
@Angela Bovo Empty commit option was released as part of V1.6 a couple of weeks ago
🤘 1