Question around Garbage Collection: Let say I have...
# help
h
Question around Garbage Collection: Let say I have
f1
in commit
c1
then I delete
f1
in commit
c2
Further commit happen unrelated to
f1
My branch head is at `c8`(and without
f1
)
f1
is commited file but not in the branch head. Does it mean that it will be deleted if I run garbage collection one year later (eg: beyond any retention rule) ?
a
Yes! It will be dropped if both of these conditions hold: ā€¢ Not any branch HEAD and also ā€¢ not in any branch commit that falls in within a retention period. I assume you've read the page, of course. but there is also this page on internals; its "what gets collected" and "what does not get collceted" sections are still relevant even with the new internals.
h
So if I only want to delete file that were part of Uncommited and got reset (so never been part of the history/commit on any branch) I need to make a retention of 999 years and then run the Garbage collector ? In which case any file that can be "reached" or "accessed" via a commit, and not present in any branch HEAD, will not be deleted ?
a
Hang on, I think your original question confused me because it talked about committed objects. Uncommitted objects that are under the control of lakeFS are different. The reference docs do a good job of explaining: any inaccessible uncommitted object is removed by the first GC after some period of time (measured in hours). And of course we run that automatically for lakeFS Cloud customers šŸ™‚ Objects that you uploaded to a branch but never committed are only reachable by that branch. So if you delete them from the branch and/or delete the branch, uncommitted GC will clean them up.
h
Yes. That is also my understanding about Uncommitted object + GC I was going to run the GC in order to delete all the Uncommited stuffs that are no longer accessible because of all the reset and deleted temporary branch that never get merged My concern is that if I don't set the retention rule properly, during that same GC run, some of my file that are belonging to a main branch, commited long time ago, may also be deleted, simply because they are old and not appearing in the branch HEAD anymore.
a
I'm sorry, I didn't get the last part of that. Uncommitted unreachable data that lives in the storage namespace is anyway inaccessible, and will simply get removed. This is safe.
h
Uncommitted unreachable data that lives in the storage namespace is anyway inaccessible, and will simply get removed. This is safe.
Yes. That part I am not worried about.
Oh, by the way, forgot to mentioned that I am talking about our self host lakefs server here
so we need to run the GC manually. We currently never ran it. Contemplating running it against our dev lakefs server
thus reading and triple reading the docs ...
Based on this: I want the 2. to happen and it will happen if I run the GC But the 1. is the scary part where I do not want to delete anything even if it's old. So I just wanted to confirm that to prevent 1 to happen, I need to put a retention rule with very long age, like 9999 days?
Oh, I think I missed this one: So by default, without retention rule set, commited object will not get deleted by the GC ?
a
Yes. Also, we used to have a mark-only mode, which didn't actually delete anything. Let me verify that we still have that, it's a good way to see what savings you would get with a given policy.
h
the doc say that it should be available
šŸ‘ 1