Morning guys I am working with <https pydocs lakefs lakefs i lakeFS #help

Morning guys, I am working with <LakeFS python sdk...

Giacomo Matrone

11/30/2023, 7:32 AM

Morning guys, I am working with LakeFS python sdk and I have a couple of questions: 1. I have seen that one can use boto3 client to get the data from lakeFS, however I would like to know if I can use a client instance to pull all the data inside a repo as well 2. Do you have any integration with Huggingface datasets or pyarrow? 3. It seems that whenever I try to create a tag through the sdk, it gives me error (provided that I insert the latest commit id and the version number). Thanks to everyone for any answer.

Niro

11/30/2023, 7:52 AM

Hi @Giacomo Matrone, Glad to see you are using the new Python SDK. Please note that the current library you are using is still in beta stage, so some of the functionalities are still being developed. Regarding your questions: 1. The SDK provide a file like interface to read objects from lakeFS. You can use the Object.reader() either as a context or a normal call to get a object reader and read the contents just like you would do with a file in your filesystem 2. Can you explain what kind of integration were you expecting? Can you give a use case example? 3. Can you please paste the code that reproduces your issue? In the next couple of days we will be releasing a production ready version with complete set of functionalities. Meanwhile, if you have need for some missing logic you can access the entire API from the client object (using

client.sdk_client

) which provides you the entire interface of our currently in production client see: https://pydocs-sdk.lakefs.io/ https://docs.lakefs.io/integrations/python.html

jumping lakefs 1

Giacomo Matrone

11/30/2023, 8:04 AM

Hi NirO, thank you for the answer. Coming to the points: 2. huggingface datasets.save_to_disk allows for saving datasets in arrow format. The workaround I found is to download all the files and then use load_dataset on the folder in which I downloaded the files to get the dataset back. However, it would be cool to read directly the dataset from the s3 bucket. Another solution I was thinking about was through spark submit, which to me seems more legit. 3. The code that is giving me issues is:

Copy code

tag_creation=client.models.tag_creation.TagCreation(id='commit_id',ref='ref_example')
thread=client.tags_api.create_tag("repo-name",tag_creation)

This always gives me a conflict. I can create tag through web interface, however.

11/30/2023, 8:20 AM

about 2. Lakefs should be file format agnostic. So if you want to "stream" rather than download file from lakefs, you can use library that allow you to do that. Example fsspec/s3fs have a "open" file in S3 that stream it content.

👍 1

Niro

11/30/2023, 8:29 AM

Thanks! Regarding 2. would you mind opening an feature request for integration with huggingface and pyarrow - seems like a very useful usecase. About 3, the link you provided is to the new SDK, but from the code I see you are using the current one. I have not managed to reproduce your issue via the code. Usually conflict on tag creation means the tag name already exists in the repo. Are you sure you didn't already create that tag?

👍 1

Niro

11/30/2023, 8:29 AM

@HT The new SDK provides file semantics for reading and writing objects from lakeFS

👍 1

11/30/2023, 8:31 AM

@Niro oh ok ... so I have now an alternative to fsspec/s3fs then

👍🏽 1

jumping lakefs 1

Giacomo Matrone

11/30/2023, 12:06 PM

Hi @Niro the problem for point 3 is that there was no tag at the time. I added a tag manually on web client after spending a couple of hours trying to understand what happened.

Niro

11/30/2023, 12:08 PM

Just to make sure, The API now works for you or are you still having trouble creating a tag

Amit Kesarwani

11/30/2023, 10:51 PM

@Giacomo Matrone id here is a unique tag name and ref is branch/commit:

Copy code

tag_creation=client.models.tag_creation.TagCreation(id='commit_id',ref='ref_example')

Here are couple of examples:

Copy code

tagV1 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV1}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV1, 
        ref=projectBranchV1))

Copy code

lakefs.tags.create_tag(repo.id, TagCreation(id="dev-tag-01", ref="dev"))

Amit Kesarwani

11/30/2023, 10:52 PM

Copy code

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id='delta_lake_etl_job', 
        ref=deltaLakeETLBranch))

👍 1

Giacomo Matrone

12/01/2023, 3:58 AM

Great, but how do I get the

delta_lake_etl_job

Amit Kesarwani

12/01/2023, 4:12 AM

It is just a string/text. You can use any name you like.

Giacomo Matrone

12/02/2023, 1:08 PM

Hi, tried your approach, but the conflict arises with any kind of string.

Niro

12/02/2023, 1:20 PM

@Giacomo Matrone can you please attach lakeFS logs from the time of the failure?

Giacomo Matrone

12/03/2023, 6:01 AM

Hi NiRO, unfortunately I cannot.

Niro

12/03/2023, 10:44 AM

Can you paste the exact steps + error message you are seeing?

Niro

12/03/2023, 12:04 PM

@Giacomo Matrone I opened https://github.com/treeverse/lakeFS/issues/7103 Can you please add additional information requested here

Giacomo Matrone

12/03/2023, 2:21 PM

@Niro sure, let's update in a couple of days, since will be back to work on Tuesday.

Niro

12/03/2023, 2:22 PM

Thanks a lot

Niro

12/03/2023, 2:24 PM

Can you try one thing out for me please? Can you try creating a tag with the following exact id:

Copy code

tag_creation=client.models.tag_creation.TagCreation(id='this_is_my_tag',ref='ref_example')
thread=client.tags_api.create_tag("repo-name",tag_creation)

Giacomo Matrone

12/05/2023, 3:15 PM

Hi NiRO. Thank you for the hint. It still says that the code is 403 and there is a conflict. Both working in jupyter lab and as script.

Niro

12/05/2023, 3:17 PM

Thanks, for the input. Let's continue this discussion over the open issue

Giacomo Matrone

12/06/2023, 4:40 AM

@Niro have also created an implementation of sort for Pyarrow and Huggingface Datasets. The only point I am missing is how to make datasets.GeneratorBasedBuilder downloader work with lakefs instead of the standard huggingface downloader

Copy code

DownloadManager

Giacomo Matrone

12/06/2023, 4:43 AM

A workaround I found was to create the dataset locally, processing it through spark and then using Huggingface Datasets to save the arrow table. Finally, I use lakefs to keep track of the different files contained in the dataset folder when saving with

Copy code

save_to_disk

in huggingface

Tal Sofer

12/06/2023, 10:18 AM

Hi @Giacomo Matrone! IIUC, you would like to load and save Datasets directly from lakeFS? HF Datasets supports access to cloud storages through fsspec Filesystem implementations, and lakeFS has its own fsspec implementation called lakefs-spec! You can use it to smoothly load and save datasets from your lakeFS locations. Will this work for you?

Giacomo Matrone

12/06/2023, 10:34 AM

Hi @Tal Sofer this is how I am working up to now. However, when I have a folder of objects and a CSV that labels the objects, it seems I cannot create the dataset using lakefs.

Tal Sofer

12/06/2023, 10:53 AM

@Giacomo Matrone happy that you are already working with lakefs-spec! sunglasses lakefs Can you please help me understand your use case in more details? do you mind sharing a code snippet of what you are trying to do?

Giacomo Matrone

12/06/2023, 2:48 PM

The usecase is the following: I have a dataset made by several objects and a csv. I would like to use lakefs to store those objects (csv included) to then create my own dataset through datasets.GeneratorBasedBuilder (huggingface), so that each time I call the loading script from load_dataset it reads the data from S3.

Giacomo Matrone

12/06/2023, 2:49 PM

I cannot share any code snippet, unfortunately.

Tal Sofer

12/06/2023, 6:33 PM

Thanks @Giacomo Matrone! 🙂 I still have a few clarification questions - > so that each time I call the loading script from load_dataset it reads the data from S3. Do you mean that you want to read objects from s3 based on their lakefs paths (i.e. “lakefs://repo/branch/path”)? And, IIUC you created your own loading script as required for custom datasets. What URLs it includes?

Giacomo Matrone

12/09/2023, 8:24 AM

Hi Tal, the problem I have is that I would like to create my own loading script as required from custom datasets and I would like to include lakefs directly.

Ariel Shaqed (Scolnicov)

12/09/2023, 8:39 AM

Hi Giacomo, It's the weekend here, so we may take a while to answer. I'll ping @Tal Sofer tomorrow to see if we have some example that you could follow.

👍 1

heart lakefs 1

Giacomo Matrone

12/09/2023, 8:50 AM

Hi @Ariel Shaqed (Scolnicov), no problem. As of now, I have time in weekend only. Sorry for the late reply.

👍 1

5 Views

Open in Slack

Previous Next