Hi all, by default, `lakefs repo list` can only li...
# help
h
Hi all, by default,
lakefs repo list
can only list a maximum of
1000
Is there a way to bump this number up ? Because to programatically list more 1000 repos is a bit a pain, greping for
for more results run with --amount 1000 --after "a-repo-name"
actually, once you run laktctl in a script, there are no feedback
for more results run with --amount 1000 --after "a-repo-name"
....
g
Hi @HT, There is no way to currently list more than 1000, when listing objects lakectl takes care of pagination for you, this is not the case for listing repos. Can you please expand a bit on your use-case, how many repos do you have?
h
I am testing with about 4000 repos
We are thinking having a repo per customer, in case we have a request to delete their data, then only that repo will be impacted
g
I understand, If you like you can open an issue for listing all repos using lakectl and we will look into it soon. (If you prefer, I can open it too)
For the second thing… If I understand correctly, the pagination isn’t working for you, it returns nothing for the second page. is that correct?
h
When i do listing the first 1000 with lakectl repo list 1000 I get 1000 line in the stdout... But how do I know there are more ?
I end up re running with --after and the last returned repo name Repeat untill no repo is returned
I can live with that, but if i can just --amount 10000, then problem solved :p
g
For now 😄
h
I know ... Just been lazy ;D
Originally i just wanted to do rclone with glob pattern, but then i only get 1000 repos ...
So i now have to build the list of file manually. But the weird thing is that even with a list of 4000 files to copy, rclone only 1000 files.... ?!?!
g
Looking into it
Just to make sure, the 4000 files are in the same repository and when copying with rclone, only 1000 are being copied?
h
It's single file per repo. So 4000 files in 4000 repos
Copy code
rclone copy --files-from /dev/shm/downloadParquet.sh.tmp sandbox: /data/hieu/deleteme/demo/ --dry-run --transfers=64
with :
Copy code
$ head /dev/shm/downloadParquet.sh.tmp
20180523-140551-spurderuel-clone000/main/handy/annotations.parquet
20180523-140551-spurderuel-clone001/main/handy/annotations.parquet
20180523-140551-spurderuel-clone002/main/handy/annotations.parquet
20180523-140551-spurderuel-clone003/main/handy/annotations.parquet
20180523-140551-spurderuel-clone004/main/handy/annotations.parquet
20180523-140551-spurderuel-clone005/main/handy/annotations.parquet
20180523-140551-spurderuel-clone006/main/handy/annotations.parquet
20180523-140551-spurderuel-clone007/main/handy/annotations.parquet
20180523-140551-spurderuel-clone008/main/handy/annotations.parquet
20180523-140551-spurderuel-clone009/main/handy/annotations.parquet
rclone do 1000 files then stop without error ...
not sure who's the issue here ... very likely me but ... not sure where ...
g
I need to understand a bit more about the use case working with multiple repositories at the same time, can you please provide more information about your use case?
h
We have "captures" coming from different customers. I am planing to organize such that each customer have their own repo. In each repo, the file structure will be similar. The reason for spliting repo per customer is as mentioned above: in case a customer want their data to be deleted from our system, then I just delete that repo. If I gather multiple custiomer under the same repo, then deleting a single customer over all the versions/commit become : a wipe out of history for that repo ...
g
In that case, lakeFS Garbage Collection can handle that for you, If you just delete the files from lakeFS, you don’t need to delete it from all versions, you just need the files to be deleted in all working branches. Using the GC configuration you can decide how far you want to save data for each branch. Only the Data will be deleted (not the metadata), so you will be able to see that the file existed but you will not be able access the content of the file because it was deleted.
h
Does that mean that i will need a separate system that track what files do exist and which are ghost only?
g
If deleting data is the only reason for holding many repositories I would suggest considering using lakeFS GC for that, many lakeFS commands that provide atomicity and reproducibility are per repo and will be less useful when working across repos
h
Ok. I will have a look at GC
g
Does that mean that i will need a separate system that track what files do exist and which are ghost only?
No, it’s deleted by a retention policy This should make it more clear https://docs.lakefs.io/howto/gc-internals.html
Hope it works for you
h
This gc system does add a layer of complexity ...
g
Yes it does 😅
h
In the case of multi repo, i was thinking that if we do modification across repo, we will split the task to per repo and do all change per repo. We are not expecting this to happen often anyway.
But is it normal that rclone struggle to fetch data across repo with server containing a large number repo ? Or is it just me missing something ?
g
It’s not normal, I am looking into that right now, rclone should be able to paginate for you
h
like
rclone lsd sandbox: | wc -l
will only return 1000
Copy code
$ rclone --version
rclone v1.62.2
- os/version: opensuse-leap 15.4 (64 bit)
- os/kernel: 5.14.21-150400.24.60-default (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.20.2
- go/linking: static
- go/tags: none
now it is working with :`rclone --no-traverse copy --files-from /dev/shm/downloadParquet.sh.tmp sandbox: /data/hieu/deleteme/demo/ --dry-run -v` This
--no-traverse
is the magic ...
but
rclone --no-traverse lsd sandbox: | wc -l
still only give 1000
g
Managed to reproduce, list repositories does have an issue, sorry about that We will provide a fix this week
Would you like to open the issue, or should I?
Thanks for pointing this out 🙏
h
I can open a ticket, but I am not sure what kind of issue are you expecting to solve: just the listing ? or even the copy without --no-traverse ?
I can try to put all of them 😜
g
Mention all of them, hopefully it will be fixed the same way
Thanks again!
h
g
Issue was fixed, and will be released as part of the next lakeFS version There is no limit in the gateway for repos now
🙏 1
👍 1
h
Is it seemless to upgrade from v0.100 to the next release ?
i
Should be straightforward
🙏 1