https://lakefs.io/ logo
Title
j

John Zielke

03/17/2023, 12:33 PM
Hello, has there been a discussion on how data deduplication based on content in unrelated branches could be achieved? I.e. if a user uploads a file that is the same as the current one or exists in a different branch/repository, that the file would not get stored twice, but it would instead just reference the existing file. I know this would introduce some issues, such as to determine a file hash the whole file would need to be uploaded first (or you would have to implement a protocol that prevents users from retrieving files in other repositories by just claiming to be uploading a file with the same hash), so this is why I'm asking here
i

Iddo Avneri

03/17/2023, 1:36 PM
Hi @John Zielke - thanks for bringing up the discussion. A few notes / clarification: If a file changed / uploaded in more than one branch, there could be a potential conflict on the merge. The conflict resolution can be done by a simple strategy (Source or Destination wins), or, you might prefer to fail the merge and fix “manually”. Having said that, a common practice, would be to have a protected branch, and any type of transformation (ingestion / ETL) will be executed in separate branches. While this doesn’t avoid all chances of a conflict, it typically helps a lot. Especially if you design your system in a way that one data set is changed per branch (for example, one delta table changes per branch, and then merge all together). Finally, regarding the data “deduplication” (although in this case, it is more “cleanup”) . Once you run Garbage Collection (GC), it will take care of any left over files that are no longer needed, if those are no longer being referenced by any commit, according to your GC configuration. HTH
o

Oz Katz

03/17/2023, 3:34 PM
Thanks @Iddo Avneri but I think perhaps the question is “could lakeFS implement a transparent deduplication by storing objects by their hash”. If I understood @John Zielke correctly 🙂 If so, the answer is that in fact, very early releases of lakeFS worked that way! this however, introduced 2 very hard problems: 1. writers had to know the hash of the object before writing (or alternatively, lakeFS had to implement this hashing server side while buffering the write) - both resulting in scalability issues 2. garbage collection was much more complicated. lakeFS would have to “reference count” every individual object and only delete objects that are no longer being referenced, which is complex, slow, and introduces a whole number of race conditions that could otherwise be avoided After analyzing the repository structure of several early installations, we came to the conclusion that for the most part, copy-on-write branches (as implemented in current versions of lakeFS) were a great solution that resulted in less copying of data, and the remaining duplications were scarce - rendering this additional complexity not worth it. That being said, if you have a concrete use case or idea where this makes sense, I’d love to hear it! 🤩
j

John Zielke

03/21/2023, 9:47 AM
Thank you for the reply @Oz Katz, this is about what I expected 🙂 Especially the calculation during the upload is a challenge. Would this be easier if it would just be about "overwriting" the existing file? AFAIK, if you upload the same file twice (say you have a data pipeline that reproducibly creates files), it will store it as a change even though the file has not changed, right? I was wondering if that would be a useful use case. When uploading a file, the calculation could either happen in-memory or what would probably be more general , after the file is uploaded. If the checksum of the new file matches the one current one, the "new one could simply be deleted and the reference kept the same
o

Oz Katz

03/21/2023, 10:41 AM
That's a fair point @John Zielke! I wonder how common of a scenario this is. Apache Spark (and I assume other Hadoop-y systems as well) will add a unique job ID to the file names that will be different each time, so actually overwriting the same object is pretty rare for those cases. Do you have a concrete use case in mind where this will be helpful?