Hello team, I want to upload datasets to lakefs a...
# help
m
Hello team, I want to upload datasets to lakefs and version them. Each dataset is a separate folder with random files. For example I have folders (datasets) A and B. Files in A: a1, aa1 Files in B: b1, bb1 datasets-versions.yaml: A: v0.0.1 B: v0.0.1 I want to update dataset A - rewrite folder A contents. So after uploading new dataset, folder A contents are the following: Files in A: a2, aa2, aaa2 datasets-versions.yaml: A: v0.0.2 B: v0.0.1 I can do this by using commands:
lakectl fs rm -r <lakefs://repo/branch/A>
lakectl fs upload -r <lakefs://repo/branch/A> -s A
My question is: How can I do this using Python lakefs package?
o
Hi @mpn mbn You can find an example with Python here
a
h
in your example, what is
datasets-versions.yaml
? In which folder is it ?\
m
@HT it is in branch's root, no folder.
Probably I have to
branch.delete_objects([o.path for o in branch.objects(prefix="A/")])
and then parse files in local
A
dataset dir and upload each file individually.
h
lakectl fs rm -r <lakefs://repo/branch/A>
lakectl fs upload -r <lakefs://repo/branch/A> -s A
If you can use
rclone
, then you can just do
rclone sync
If you have to use python: you can try
fsspec/s3fs
They have a sync command, called rsync ( I wrote my own 😛 ) Not sure it's a good idea to have this
datasets-versions.yaml
that hold version information : 1. you are versioning within a versioning system (lakefs). Alternative would be using lakefs metadata or tag may be ? 2. this can create conflict if you are working in the same time on 2 different branch and later merge them: you will need to manually overwrite each other as LakeFS do not do content merge.
lakefs 2