Good evening! I have been trying to deploy lakefs ...
# help
p
Good evening! I have been trying to deploy lakefs with a GCS bucket and am dealing with an issue where files that I uploaded but subsequently deleted (without committing) aren't being hard deleted from the repo. I can't seem to trigger a delete through GC either. Is there a way to generate a list of underlying objects that aren't associated with any existing commits? I'm new to the tool so apologies if I am mis-understanding the architecture here. Thank you!
i
Hi @Pratheek Rebala, welcome to lakeFS! Garbage collection in lakeFS is hard - with a file being a part of many commits and branches, understanding what can be deleted is tricky. We do have a garbage collection tool for you, unfortunately it only works with S3 at this time. Moreover, uncommitted garbage collection is yet to be implemented. What are your plans for lakeFS? If you’ll expand more on the use-case, the community may provide you of different ways to tackle it 🙂
p
Thanks for the detailed explanation! Some background...we're a investigative news org and I am evaluating if we can use lakefs for our data lake. We work with a lot of unstructured data (e.g. PDFs, images etc) and our processing jobs usually generate a ton of intermediate data which we do not want to store. With our old workflow, we just write the staging files to the bucket and issue deletes after the run is complete. Is there a workflow with lakefs where this is possible? Also, Is there anyway to fetch a list of dangling objects in the underlying file system?
i
When deleting an uncommitted object from lakeFS, the object will still remain in the physical object storage, but would never be reachable from lakeFS itself. In order to delete it from the storage itself (“hard delete”), you would have to fetch that list of dangling objects and delete them, which is exactly the functionality described in 1933. Unfortunately, I can’t think of an easy win for fetching that list.
p
This makes sense! So the delete is basically a metadata-only operation at the moment?
i
Exactly
p
Got it! Do you think I'd be able to just iterate through the commits to get a list of objects and subsequently delete objects in the underlying storage that aren't in that list?
i
It all depends on your scale. Every cleanup would have to iterate through all the commits, each one containing all the objects in your repo at a certain point in time, so it scales pretty bad. I wonder about the importance of the deletes, is it compliance that you’re worried about?
p
Yep, compliance is one of the big issues for us esp when working with voter data. Also, some of our older jobs are poorly optimized so they generate huge amounts of intermediate data (~4-5x) which would end up incurring non-trivial costs.
I suppose another option could be to use an audit device to keep track of deletes? Maybe if I could monitor the server logs to watch for deletes?
i
It’s hard - not every DELETE request should lead to data being deleted, and some PUT requests should actually delete data. Would you mind we’ll schedule a short meeting to see how we can best assist your use-case?
p
Absolutely, I will DM right now to schedule.
i
Awesome!
a
Hi @Pratheek Rebala! There is no current support for GC on garbage collection, only on Azure. We're actually porting it over to Azure right now. @Itai Admi team-eco would love to be in this call!
The note on https://docs.lakefs.io/reference/garbage-collection.html may need some more prominence. 😞