Hello! <@U04R5JR13K9> and I are wondering if the f...
# help
n
Hello! @Adrian Rumpold and I are wondering if the fact that a storage namespace is not emptied completely on repository deletion is intentional or a bug. In our concrete case, this breaks idempotence of repository creation calls, which we are currently exploring here: https://github.com/appliedAI-Initiative/lakefs-spec/pull/154 For a concrete problem, imagine the case “I want to run a Jupyter Notebook, in which a temporary repository gets created before start and deleted again after successful execution”. This notebook would not run twice in succession because the storage namespace is never cleaned up. Thanks!
h
I came across this issue with my pytest and the solution was to make repo name uniq (timestamp or random hash) Another example that is less a bug: deleting branch. It take some time for a deleted branch to clear out. So if you run consecutively create branch and delete branch using the same name, you will end up with something along the line "branch already exist, cannot create"
As for why the repo is not deleted, I believe that one of the principle of lakefs is to never delete underlying storage. To do that, I think you need the Garbage collector
👍 1
👍🏻 1
a
@HT actually I'm not sure deleting a branch should behave like that! Branches are purely mutable objects. Could you please open an issue for that? About repository deletion: yes, this is intentional. Deleting all your data can take a lot of time and is really scary, after all... Definitely doable of course. @Tal Sofer / @Oz Katz WDYT?
h
@Ariel Shaqed (Scolnicov) Looks like I cannot reproduce it ... I will keep an eye on this next time I see it. Good to know that it's not expected. Which means something is off in our deployment
a
Just my two cents on repo deletion. I can see why you would not want this behavior by default (especially for repositories in external object storage). However, for the
local
blockstore type, I could see real value in being able to wipe the underlying data along with the metadata.
y
I can see why that makes sense, @Adrian Rumpold. As mentioned by @HT, for safety reasons lakeFS is designed to be run without delete permissions on the storage layer. I would hesitate before adding special treatment for the local blockstore type. Maybe @Oz Katz and @Tal Sofer as product managers will have something to add.
👍 1
👍🏼 1
t
Hi @Nicholas Junge! nice meeting you heart lakefs Thanks for giving this example. Can you please share the use case you have in mind for running a notebook that creates a temporary repo?
n
Hey @Tal Sofer, our use case in which this came up is a tutorial notebook for a Python project - the notebook runs on a local instance, and creates a repo if not already existing. We attempted to design all of our resource creation APIs (branches, tags, repos) idempotently, but got stopped by the backend in the latter case. If we create some immutable state e.g. on a branch in that repo, it can result in different behavior on re-execution. We can always reap the instance as long as we’re local, but in principle, it’s nice to have a hammer that does this without much question.
h
Alternatively, in your case, for tutorial sake, after deleting the repo in lakefs, you can manually delete the corresponding folder in your local blockstore. Like mount the docker storage to your host (assuming that you are running your lakefs server with docker compose ? ) Run your tuto, creting repo, deleting repo. Then issue a `os.rmtree()`on the host path ?
a
I think @Yoni Augarten is (always...) right: the price for such a feature is a much-increased scope for confusion: we will now have an optional feature (delete storage namespace) that requires additional authorization (delete stuff) to perform. Plus, implementing it across all blockstores leaves different edge cases for each blockstore, while implementing it for just one blockstore is again confusing. But I can think of two ways to address both of these! First way: • Add "delete object" (and similar) to the list of permissions that we say lakeFS needs. It can anyway upload an empty object on top of an existing one, which is not very different from deletion. Even on versioned stores! And document that if you don't give deletion then feature X won't work. • Start with a minimal implementation of "delete storage namespace" on the block adapter: it deletes the
dummy
file and the configuration objects. Now the storage namespace is reusable. And for local, we can implement an actual delete subdirectory. For non-local, I think GC will be able to remove the remnants. Alternative way: • Don't change permissions. • Add a flag to "create repo" that will mean "_overwrite existing repo yes this is scary_". When set, creating a repo ignores an existing
dummy
file (like we used to) and also empties out any existing configuration files. For local things will pile up until the user `rm -r`s the namespace; for non-local, I think GC will be able to remove the remnants.
t
Thanks @Ariel Shaqed (Scolnicov)! @Nicholas Junge the way I’d suggest to go about the tutorial you are building is use docker as virtual environment with an ephemeral storage. would that work for you? If this doesn’t work for you, then you can manually delete the corresponding folder as @HT suggested as a workaround. Except for safety reasons for which lakeFS does not run with delete permissions and the time it may take to delete a storage namespace, a lakeFS repo is in nature a long-lived entity that’s logically not expected to get deleted and recreated over and over again. Therefore we don’t think of creatRepo as an idempotent operation.