Hello all, Is there an option for bulk delete of ...
# dev
o
Hello all, Is there an option for bulk delete of object paths through the lakeFS API? At the moment we use the delete object api one by one (async) in order to delete many objects. I think performance wise, it will be much better to do this parallelism server side instead of issuing many http requests in the client side (We can have more than 1000 files requests for deletion)
b
Currently just client side - lakectl supports recursive delete.
o
In the API it doesn’t support the recursive option (see this thread) BUT we don’t need recursive here, we need to delete specific files from directory and not the whole directory….
b
So something like delete objects with a chunk?
👍 1
just to reduce the number of calls?
Currently you can parallel the delete object calls - but I guess you are trying to target specific API.
o
Yes, we do parallel the API calls today, I thought about something server side using API, instead of creating multiple parallel requests
It won’t just reduce the number of calls, it will also improve the performance of this operation
b
I'll open an issue to capture the above and a way to track this request
🙏 2
Currently the only way your lakeFS can enable it is using the S3 gateway as we support delete-objects
o
Thanks! Yes, I know, but it’s problematic to use it in our specific use case
👍 1
b
So in one request you can delete multiple keys - but having the same using our API is needed
o
@Barak Amar @Ori Adijes let's continue this discussion on a GitHub issue? I think it's also worth exploring @Ariel Shaqed (Scolnicov)’s suggestion for range deletes, if applicable (https://github.com/treeverse/lakeFS/issues/2092#issuecomment-947434893)
👍 1
b
o
@Daniel Satubi @Lior Resisi 👆 FYI
👑 1
b
@Ori Adijes wanted to render more information about the API. 1. The number of keys per request will be limited by the server (ex: 1000) 2. Each key will be deleted and an array of results/errors will be returned - it will not be part of a single transaction of all or nothing
it is align with what you had in mind?
o
Yes it’s ok, but if there was an error in one of the files, we should fail the request. Is it ok?
b
Asking to delete o1,o2,o3 and o2 fails - it means that o1,o3 are gone, and you will get an array of results saying [{}, {err}, {}] << abstract
communicate which object wasn't deleted and the reason
a
I'd like to take range deletes off of this table. While in all modesty I think I managed to steal a really great idea, it is not particularly relevant to the current implementation on top of Postgres, and -- while I do believe range deletes could be reasonably efficiently implemented in this scenario -- they will not necessarily map well to this use-case.
o
Honestly, I think its a very legit and normal usecase. We will just do the same client side, instead of server side and impact the performance. So, if the usecase is wrong, the client side usage is also wrong, no matter of where it is being implemented. @Barak Amar, So you mean the API will still return success code if one of the object deletions were failed? I think we should also add a query option of “failFast” to fail immediately when one of the object deletions were failed
👍🏼 1
b
we can return 204 no content in case of "complete" success. and 200 in case we return a body with response that indicate which one didn't made it
You want a way to indicate if you need to re-request or build a new request vs nothing to worry about
o
@Barak Amar @Ori Adijes this is an important discussion, and one that will very likely affect the design of a lakeFS feature. If we could move the discussion itself and the decision to GitHub, I think it will be easier for others to find, understand or contribute to later.
👍🏼 1
👍 2