Hi everyone! I’m exploring LakeFS for our data ver...
# help
s
Hi everyone! I’m exploring LakeFS for our data versioning use case. I followed the tutorial to create a repo -> upload file -> commit to master. However, I’m unable to actually read the file from the MinIO bucket using Pandas.
Copy code
import pandas as pd
import os
import s3fs

class S3FileSystemPatched(s3fs.S3FileSystem):
    def __init__(self, *k, **kw):
        super(S3FileSystemPatched, self).__init__(*k,
                                                  key = os.environ["AWS_ACCESS_KEY_ID"],
                                                  secret = os.environ["AWS_SECRET_ACCESS_KEY"],
                                                  client_kwargs={'endpoint_url': os.environ["AWS_S3_ENDPOINT"]},
                                                  **kw)
        print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched

data = pd.read_csv("<s3://example/master/test.csv>")
it throws
FileNotFoundError: example/master/test.csv
a
Hi @Sanidhya Singh, interesting case, I wonder, can you read directly from lakeFS ?
<lakefs://example/master/text.csv>
j
Hi @Sanidhya Singh, Can you share your s3a configurations (minus the secrets)?
b
@Sanidhya Singh also check that the object is on
master
or
main
which is the new default.
s
Hi @Adi Polak, it says
ValueError: Protocol not known: lakefs
Hi @Jonathan Rosenberg, I’m using just s3, updated my code reflect it. The configuration is being passed through s3fs above.
j
and what’s the endpoint that you pass to it as the
endpoint_url
?
s
Hi @Barak Amar, I’m using
master
@Jonathan Rosenberg the endpoint is the URL to the MinIO API
j
it should be the endpoint to your lakefs server, and the
AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
should be your lakefs key and secret
s
Ah, I see. I’m running a LakeFS container locally on my machine and using a remote MinIO instance. What would be default LakeFS endpoint http://localhost/api/v1?
tried with http://localhost:8000/api/v1, got
FileNotFoundError: The specified bucket does not exist
j
So what you would want to do is to use lakeFS’s S3 gateway which is an S3 compatible endpoint that S3 clients can work with (like S3f). The endpoint should be: http://localhost:8000
👍 1
If you want to take it to the next level, you can try to use the lakeFS Hadoop Filesystem (instead of S3fs) which will use lakeFS for your metadata only, and directly interact with MinIO to write data.
s
I haven’t gotten it to work, even after passing in the LakeFS URL, Access Key and Secret. The idea is to read a CSV, versioned through LakeFS and served by MinIO. Please let me know if I’m doing something fundamentally incorrect.
j
if you can share your configurations (you can DM me if you prefer) it would be very helpful. Also make sure that you provided all 4 necessary configurations specified here