Hi
@Pedro Vide Simoes !
Since you're using Boto, I assume that you're using Python. Of course as
@Niro says, if you use Boto3 to read data from lakeFS through its S3 gateway then all data is transferred through the lakeFS instance - and you may need to provision appropriately.
Is this a problem? Do you have an estimate of how much data you need to read?
Assuming there is a problem,
perhaps we can sidestep it? There are numerous alternative ways to access lakeFS, that can access
just metadata from lakeFS, and read data directly from S3:
• Many data science libraries (Polars, Pandas, and more!) can use lakeFS by going through:
• The amazing engineers at the appliedAI Institute wrote
lakefs-spec. It is a filesystem-spec (fsspec) implementation for lakeFS. To quote them: "Directly access lakeFS objects from popular data science libraries (including Pandas, Polars, DuckDB, Hugging Face Datasets, PyArrow) with minimal code". Or you can go down one level:
• The
high-level Python SDK uses the lakeFS API directly, if you also need to perform metadata operations in your code that lakeFS-spec cannot. It was written by an amazing engineer here at lakeFS 😺 . If you really need direct access:
• The
lakeFS SDK just talks to the lakeFS OpenAPI endpoint. It is auto-generated by an amazing OpenAPI code generator from our
OpenAPI spec. So it can do literally
anything that lakeFS can.
Hopefully one of these options helps; please ask for more details!