Morning guys, I am working with <LakeFS python sdk...
# help
g
Morning guys, I am working with LakeFS python sdk and I have a couple of questions: 1. I have seen that one can use boto3 client to get the data from lakeFS, however I would like to know if I can use a client instance to pull all the data inside a repo as well 2. Do you have any integration with Huggingface datasets or pyarrow? 3. It seems that whenever I try to create a tag through the sdk, it gives me error (provided that I insert the latest commit id and the version number). Thanks to everyone for any answer.
n
Hi @Giacomo Matrone, Glad to see you are using the new Python SDK. Please note that the current library you are using is still in beta stage, so some of the functionalities are still being developed. Regarding your questions: 1. The SDK provide a file like interface to read objects from lakeFS. You can use the Object.reader() either as a context or a normal call to get a object reader and read the contents just like you would do with a file in your filesystem 2. Can you explain what kind of integration were you expecting? Can you give a use case example? 3. Can you please paste the code that reproduces your issue? In the next couple of days we will be releasing a production ready version with complete set of functionalities. Meanwhile, if you have need for some missing logic you can access the entire API from the client object (using
client.sdk_client
) which provides you the entire interface of our currently in production client see: https://pydocs-sdk.lakefs.io/ https://docs.lakefs.io/integrations/python.html
jumping lakefs 1
g
Hi NirO, thank you for the answer. Coming to the points: 2. huggingface datasets.save_to_disk allows for saving datasets in arrow format. The workaround I found is to download all the files and then use load_dataset on the folder in which I downloaded the files to get the dataset back. However, it would be cool to read directly the dataset from the s3 bucket. Another solution I was thinking about was through spark submit, which to me seems more legit. 3. The code that is giving me issues is:
Copy code
tag_creation=client.models.tag_creation.TagCreation(id='commit_id',ref='ref_example')
thread=client.tags_api.create_tag("repo-name",tag_creation)
This always gives me a conflict. I can create tag through web interface, however.
h
about 2. Lakefs should be file format agnostic. So if you want to "stream" rather than download file from lakefs, you can use library that allow you to do that. Example fsspec/s3fs have a "open" file in S3 that stream it content.
šŸ‘ 1
n
Thanks! Regarding 2. would you mind opening an feature request for integration with huggingface and pyarrow - seems like a very useful usecase. About 3, the link you provided is to the new SDK, but from the code I see you are using the current one. I have not managed to reproduce your issue via the code. Usually conflict on tag creation means the tag name already exists in the repo. Are you sure you didn't already create that tag?
šŸ‘ 1
@HT The new SDK provides file semantics for reading and writing objects from lakeFS
šŸ‘ 1
h
@Niro oh ok ... so I have now an alternative to fsspec/s3fs then
šŸ‘šŸ½ 1
jumping lakefs 1
g
Hi @Niro the problem for point 3 is that there was no tag at the time. I added a tag manually on web client after spending a couple of hours trying to understand what happened.
n
Just to make sure, The API now works for you or are you still having trouble creating a tag
a
@Giacomo Matrone id here is a unique tag name and ref is branch/commit:
Copy code
tag_creation=client.models.tag_creation.TagCreation(id='commit_id',ref='ref_example')
Here are couple of examples:
Copy code
tagV1 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV1}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV1, 
        ref=projectBranchV1))
Copy code
lakefs.tags.create_tag(repo.id, TagCreation(id="dev-tag-01", ref="dev"))
Copy code
client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id='delta_lake_etl_job', 
        ref=deltaLakeETLBranch))
šŸ‘ 1
g
Great, but how do I get the
delta_lake_etl_job
?
a
It is just a string/text. You can use any name you like.
g
Hi, tried your approach, but the conflict arises with any kind of string.
n
@Giacomo Matrone can you please attach lakeFS logs from the time of the failure?
g
Hi NiRO, unfortunately I cannot.
n
Can you paste the exact steps + error message you are seeing?
@Giacomo Matrone I opened https://github.com/treeverse/lakeFS/issues/7103 Can you please add additional information requested here
g
@Niro sure, let's update in a couple of days, since will be back to work on Tuesday.
n
Thanks a lot
Can you try one thing out for me please? Can you try creating a tag with the following exact id:
Copy code
tag_creation=client.models.tag_creation.TagCreation(id='this_is_my_tag',ref='ref_example')
thread=client.tags_api.create_tag("repo-name",tag_creation)
g
Hi NiRO. Thank you for the hint. It still says that the code is 403 and there is a conflict. Both working in jupyter lab and as script.
n
Thanks, for the input. Let's continue this discussion over the open issue
g
@Niro have also created an implementation of sort for Pyarrow and Huggingface Datasets. The only point I am missing is how to make datasets.GeneratorBasedBuilder downloader work with lakefs instead of the standard huggingface downloader
Copy code
DownloadManager
A workaround I found was to create the dataset locally, processing it through spark and then using Huggingface Datasets to save the arrow table. Finally, I use lakefs to keep track of the different files contained in the dataset folder when saving with
Copy code
save_to_disk
in huggingface
t
Hi @Giacomo Matrone! IIUC, you would like to load and save Datasets directly from lakeFS? HF Datasets supports access to cloud storages through fsspec Filesystem implementations, and lakeFS has its own fsspec implementation called lakefs-spec! You can use it to smoothly load and save datasets from your lakeFS locations. Will this work for you?
g
Hi @Tal Sofer this is how I am working up to now. However, when I have a folder of objects and a CSV that labels the objects, it seems I cannot create the dataset using lakefs.
t
@Giacomo Matrone happy that you are already working with lakefs-spec! sunglasses lakefs Can you please help me understand your use case in more details? do you mind sharing a code snippet of what you are trying to do?
g
The usecase is the following: I have a dataset made by several objects and a csv. I would like to use lakefs to store those objects (csv included) to then create my own dataset through datasets.GeneratorBasedBuilder (huggingface), so that each time I call the loading script from load_dataset it reads the data from S3.
I cannot share any code snippet, unfortunately.
t
Thanks @Giacomo Matrone! šŸ™‚ I still have a few clarification questions - > so that each time I call the loading script from load_dataset it reads the data from S3. Do you mean that you want to read objects from s3 based on their lakefs paths (i.e. ā€œlakefs://repo/branch/pathā€)? And, IIUC you created your own loading script as required for custom datasets. What URLs it includes?
g
Hi Tal, the problem I have is that I would like to create my own loading script as required from custom datasets and I would like to include lakefs directly.
a
Hi Giacomo, It's the weekend here, so we may take a while to answer. I'll ping @Tal Sofer tomorrow to see if we have some example that you could follow.
šŸ‘ 1
heart lakefs 1
g
Hi @Ariel Shaqed (Scolnicov), no problem. As of now, I have time in weekend only. Sorry for the late reply.
šŸ‘ 1