Hey everyone! Question regarding throughput. Firs...
# help
p
Hey everyone! Question regarding throughput. First Question: If I have LakeFS deployed in AWS (let's say an EC2 instance) and I make a get request using boto3 for a given file, the LakeFS instance will then do a get request to the s3 bucket "underneath" it correct? In other words, once LakeFS is deployed, a boto3 action will actually be translated into two actions: boto3 <-> LakeFS EC2 and LakeFS EC2 <-> S3 bucket While without LakeFS it would simply be a single action boto3 <-> S3 bucket. Second Question: If all requests go through the EC2 instance first, then its hardware specs must be scaled according to the average number of requests we are expecting to do in a given timeframe. Of course this greatly depends from user to user but is there a reference to the advised minimum hardware in the docs? Thanks in advance for your help 🙂
n
Hi @Pedro Vide Simoes, Welcome to the lake lakefs The answer to your question varies according to the client as well of type of request. If you use boto3 with lakeFS for example, a head request will be directed to lakeFS, but lakeFS will not need to query S3 since the object metadata information is stored in lakeFS (so in this instance you save a call to S3). On the other hand if you want to read an object then the route you described will happen.
a
Hi @Pedro Vide Simoes ! Since you're using Boto, I assume that you're using Python. Of course as @Niro says, if you use Boto3 to read data from lakeFS through its S3 gateway then all data is transferred through the lakeFS instance - and you may need to provision appropriately. Is this a problem? Do you have an estimate of how much data you need to read? Assuming there is a problem, perhaps we can sidestep it? There are numerous alternative ways to access lakeFS, that can access just metadata from lakeFS, and read data directly from S3: • Many data science libraries (Polars, Pandas, and more!) can use lakeFS by going through: • The amazing engineers at the appliedAI Institute wrote lakefs-spec. It is a filesystem-spec (fsspec) implementation for lakeFS. To quote them: "Directly access lakeFS objects from popular data science libraries (including Pandas, Polars, DuckDB, Hugging Face Datasets, PyArrow) with minimal code". Or you can go down one level: • The high-level Python SDK uses the lakeFS API directly, if you also need to perform metadata operations in your code that lakeFS-spec cannot. It was written by an amazing engineer here at lakeFS 😺 . If you really need direct access: • The lakeFS SDK just talks to the lakeFS OpenAPI endpoint. It is auto-generated by an amazing OpenAPI code generator from our OpenAPI spec. So it can do literally anything that lakeFS can. Hopefully one of these options helps; please ask for more details!
p
Thank you all for you answers and apologies for the late reply. The amount of data is quite small so I should be good to go without work arounds 🙂. Thanks once again !
sunglasses lakefs 2