We have several folders `folder1`, `folder2`, `fol...
# help
h
We have several folders
folder1
,
folder2
,
folder3
, each containing files. They added and removed with different commit. We now want to delete all underlying file content of
folder1
across all the commit history and branches. Is it possible ? I guess we will need to go through each commit. For each commit, look at all underlying blob that are behind file in
folder1
and delete them on the underlying storage. In some form, it;s very similar to what the garbage collector do. Any suggestion ?
I had a first discussion here
GC delete file based on time, where here I want to delete based on "path" within the repo ....
e
Hi @HT, I'd like to make sure I understand the use-case... So you want to delete
folder1
in the underlaying object store and remove any reference to it from all of the commits across all of the branches?
h
just the underlying object. Leaving all the metadata intact. So you can still "checkout" a commit, just the files in in
folder1
will be "empty" while others files are still avail
e
Ok, I understand
h
the big picture: each folder contain data of a customer. One day the customer folder1 may want us to delete all their data. So we will need to go through the lakefs history and delete all that customer's related underlying files, while leaving intact all the history and other customer data
e
Ok, so GC works the other way around, i.e., remove files that aren't referenced anymore - in this case you don't know what can be remove and you use the metadata to find files you can remove and remove them Since you know what files you want to remove, why not just delete them from the underlying storage?
h
dumb question but: given a commit
c
and a path
p
in a repo, how do I know what is the file in the underlying storage ?? When I look in
data
in the underlying storage it's all cryptic path to me ...
image.png
@Elad Lachmi?
e
Thinking... 🙂
h
oh ... it was not a too dumb question then ...
e
Maybe not
h
I mean, in theory, it should be the way to go: for each commit: • Look for all paths of interest
paths
• Translate them to underlying paths Gather all of those paths across all commit history. Delete all those path in the underlying storage. The tricky part is translate lakefs path to underlying path. Basically understanding this
e
When you call the
/repositories/{repository}/refs/{ref}/objects/ls
endpoint (and
ref
can be branch/commit/etc.), one of the properties of the result objects is
physical_address
, which is the location in the object store
h
I will give it a try
Thanks
e
Sure, np
h
(by the way, are you are treeverse employee ?)
e
Yes, I am
jumping lakefs 1
Something else I’d like to suggest thinking about… The GC policy manages how long objects should be retained once they aren’t at the HEAD of any branch or commit If you’re diligent in removing unused/stale branches and most objects end up being at the HEAD of the main branch and you delete them from there in combination with a 1-day retention policy and a daily GC run It seems to me like that might get you where you want to go without writing some custom process
h
In our case we do want to be able to travel back in time in order to reproduce our Deep Learning model from X months ago. Thus GC is only good for deleting "uncommited" and un-reference files. But otherwise, we want to keep everything !!
except for the rare and annoying case where a customer eventually want his own data to be wipe out
e
Ok, that makes sense
h
/repositories/{repository}/refs/{ref}/objects/ls
: what is the equivalent in python sdk ??
e
Hi @HT, For future reference, all the python SDK functions are mapped to the lakeFS endpoint here
h
nice ! Thanks !!
lakefs 1
e
Sure, np
h
Hi all. I manually deleted underlying file (aka
physical_address
) of specific paths. Then when I try to download those file via S3 API of a commit that reference those paths, I get
[Errno 121] Internal Server Error
I was hopping for a error similar to what would happen if the file were delete by the garbage collector (as mentioned here) :
410 Gone
Looks like the GC do something to inform the lakefs server that those file are gone and don't try to look for them My question then become: when I delete the underlying file, what do I need to do more to let lakefs server know and handle this correctely ? (self hosted lakefs server) @Elad Lachmi
a
This is odd. AFAIK there is no logic for sending 410 Gone other than what you described: if the object cannot be read from underlying storage, it's "Gone". This is something that would happen if metadata (ranges or metaranges) were missing, but not data. I'm sorry to have to trouble you again, but could you detail what exactly you deleted, and and/or send lakeFS server logs? (As always, please go over logs to ensure there is nothing there you don't want shown; actual paths, branch name, and repo names are probably on that list). Feel free to send privately to @Elad Lachmi and @Ariel Shaqed (Scolnicov) if that helps!
h
ok. Thanks. I will dig into the logs. Now that I know this is not expected.
I know that I am using an old version 107 but it should not be the cause right >
✅ 1
e
Probably not
h
our underlying storage is a Azure Blob with soft delete
e
This seems to be handled correctly in the Azure blob store adapter I'll need to look into this a bit