Hi, Please is it possible to read data like a csv ...
# help
t
Hi, Please is it possible to read data like a csv or load a json file directly from a lakefs repo?? if yes, is there any documentation around it
g
Hi @Temilola Onaneye, lakeFS is data agnostic. Reading and Writing CSV or JSON is done the same as any other file.
I’d also add that lakeFS is S3 compatible so you can Read and Write CSV and JSON files the same way you would using your object store.
If you’d like to share some information about your use case, and how you’ve done it without lakeFS I can help you with running it on lakeFS
t
Oh, I didn't complete my question, I meant with Python. Is it possible to do this directly with python??
Can I do this directly with Python like I would with any typical filesystem
a
@Temilola Onaneye If above example is helpful then you can test this example/sample: https://github.com/treeverse/lakeFS-samples/tree/main/03-multiple-samples
This is also helpful: https://github.com/treeverse/lakeFS/tree/master/clients/python It lists all Python APIs including get_object to read file from lakeFS repo.
t
So apparently, what I want to achieve is using pandas to read a csv directly from lakefs as a pandas dataframe. The documentation speaks to spark
g
Pandas can read/write remote files from S3 You could check out this URL for more information Due to the fact lakeFS is S3 compatible you will be able to read from lakeFS the same way you would read from S3 There are two slight changes you should consider: 1. you will need to set the S3 endpoint to be your lakeFS installations 2. you should take into consideration the
repo
and
branch
when accessing files Assuming your lakeFS is running on
localhost:8000
and you want to access a file that exists on Repo:
your-repo
Branch:
your-branch
file-path:
patth/to/file.csv
It should look something like this:
Copy code
df = pd.read_csv(
    "<s3://your-repo/your-branch/path/to/file.csv>",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
        "client_kwargs": {"endpoint_url": "localhost:8000"}
t
Thanks @Guy Hardonag
But Unfortunately, my object storage is azure blob
I was thinking of a workaround for the workflow. Since I am working with small json files in terms of size, I am considering downloading the files at runtime to the working dir and reading from there.
Don't know if direct download of the files is possible with python from lakefs repo. I only found file upload in the documentation shared
b
The above code uses lakefs S3 compatible API - your underlaying storage used by lakeFS can still be Azure
Copy code
import pandas as pd

df = pd.read_csv(
    f"<s3://repo/main/data.csv>",
    storage_options={
        "key": "AKIAIOSFDNN7EXAMPLEQ",
        "secret": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
		"client_kwargs": {
			"endpoint_url": "<http://localhost:8000>",
		}
    },
)

print(df.to_string())
Same as @Guy Hardonag mentioned - the call here goes to lakefs, the example is running on a local instance
t
Okay thanks, I will try this out
Thanks it works
lakefs 1
Hi @Barak Amar I wanted to find out if it is possible to download a data file from lakefs repo directly with python??
b
Hi, @Temilola Onaneye the above code does that. You can use python's
boto
package or
s3fs
(like the code above) - both uses lakeFS's S3 compatible API. lakeFS also provide API + python SDK
lakefs_client
package - https://pydocs.lakefs.io/.
t
Okay thanks, would check it out
b
Both talk directly with lakeFS which read the data from the underlying storage.
Let me know if you need help in using any of the above.
t
I don't want to read it, I just want to download directly to a local filesystem
b
read == download for me 🙂
^ think here you will find what you are looking for
t
Okay thanks
b
Copy code
import boto3

session = boto3.session.Session()

s3_client = session.client(
    service_name='s3',
    aws_access_key_id='AKIAIOSFDNN7EXAMPLEQ',
    aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    endpoint_url='<http://localhost:8000>',
)

s3_client.download_file('repo', 'main/userdata1.parquet', 'userdata1.parquet')
🙏 1
Here is example on how to download the file using boto
you can find s3fs example similar to the one we wrote yesterday
If you need one that uses lakefs api - let me know
🙏 1
t
Yeah thanks Barak, it would be great to have the one that uses the lakefs api.
Yeah I need one that uses the lakefs api
b
Copy code
import lakefs_client
from lakefs_client.client import LakeFSClient

configuration = lakefs_client.Configuration()
configuration.username = 'AKIAIOSFDNN7EXAMPLEQ'
configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
configuration.host = '<http://localhost:8000>'

client = LakeFSClient(configuration)
with (
        client.objects.get_object(repository='ugc', ref='main', path='userdata1.parquet') as f,
        open('userdata1.parquet', "wb") as o
    ):
    o.write(f.read())
🙌 1
t
Okay thanks
lakefs 1
182 Views