Hello there it s a few days i m discovering LakeFS i liked m lakeFS #help

Hello there, it's a few days i'm discovering LakeF...

Mohamed Azghari

05/16/2024, 4:17 PM

Hello there, it's a few days i'm discovering LakeFS, i liked many features in it, i'm using it with Minio as an S3 backend, and i see that when i commit a new file, let's name it "data.csv", LakeFS store it as an object in minio with a size of 100Mib, but when i update the csv file and i add 30 Mib of data (for example) and i commit again, it creates another obejct with size of 130Mib, and it keeps the old one in minio. I'm wondering if LakeFS has a mechanism to store only the delta in the second object, and point on the 2 object at once to read (incremental restore)? Or there is any other mechanism to optimize the storage in the backend? Thank you in advance.

einat.orr

05/16/2024, 4:24 PM

lakeFS has no such capability. The lakeFS versioning engine implements a copy on write mechanism. It creates a new object for the new version. You can run GC to delete versions you no longer need. If you are looking to manage deltas, you can do that using delta lake or Iceberg table formats. You can still use lakeFS on top of them to version a repository with many datasets.

einat.orr

05/16/2024, 4:24 PM

You can check this blog post for more details:

einat.orr

05/16/2024, 4:24 PM

https://lakefs.io/blog/open-table-formats-vs-data-version-control-systems/

lakefs 1

Mohamed Azghari

05/16/2024, 4:36 PM

Thank you for your Guidance! appreciate it!

🙌 1

Iddo Avneri

05/16/2024, 7:14 PM

@Mohamed Azghari - This blog explains the underlying file representation in lakeFS.

lakefs 1

Mohamed Azghari

05/17/2024, 6:41 PM

Thank you @Iddo Avneri!!

👍 1

2 Views

Open in Slack

Previous Next