Hi, We are evaluating LakeFS for a new Data Lake ...
# help
j
Hi, We are evaluating LakeFS for a new Data Lake on GCP. However, reading the docs I've noticed that garbage collection is not ready yet for this platform/storage.
j
Hi @Juarez Rudsatz 🙂 Yes, you are correct, yet it’s definitely on our roadmap. Would you be willing to share your use case for using lakeFS? (e.g. for dev env, CI/CD for data, something else?) And of course, if you need any further assistance please let us know…
j
I've hit send before finishing the full message (bellow): We are evaluating LakeFS for a new Data Lake on GCP. However, reading the docs I've noticed that garbage collection is not ready yet for this platform/storage. So I couldn't find more information about: 1. if there is any workaround for dealing with size growth until the GC is ready? 2. deleting the work/merged branches would solve or minimize the size? 3. how much percent it will increase the size of the data lake?
j
So there’s some kind of a workaround to do GC: you can use GC’s mark-only mode which will generate a parquet file under
STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=<MARK_ID>/
. Then you can use your own Spark job to read the Parquet file and delete the object marked addresses specified in it (they will be relative to your storage namespace, i.e. the location that you initialized your repo at).
how much percent it will increase the size of the data lake?
I’m not too sure of that as it depends on the amount of data you write and change…
i
Adding a few points regarding question number 3: Since lakeFS reduplicates the data over time, it is typically not a significant increase in storage. Furthermore, if you are also planning to use lakeFS for the capabilities to rollback, this is actually a reduction of storage use (you don’t have to snapshot the same objects over and over again). For those reasons, it is common for our users to start looking into GC only 6 months to a year after adopting the solution in production.
👍 1