Title
e

einat.orr

02/19/2021, 5:33 PM
Hi Barrett, Yes, a client is on our road map to be released on a few weeks. Please share your use case, it is important for our learning of the needs.
b

Barrett Strausser

02/19/2021, 5:46 PM
I primarily have an ML use case. We have a variety of data across primarily GCS, but some S3 buckets as well. The primary interface for my companies machine learning engineers is - https://www.tensorflow.org/api_docs/python/tf/data/Dataset The way DataSet is used is primarily through that classes "list_files" method. So something like
tf.data.Dataset.list_files("<s3://data/training/*.tfrecords>")
Files are then eagerly fetched and then locally available. After this there is no leakage of S3/GCS into the training code. I was hoping to do something where I abstracted away that into say a LakeFS Dataset where a user could simply pass in the name of a dataset aka branch. So something like ...
tf.data.LakeFSDataset.get_branch("master")
However, we are extremely sensitive to latency and have many, many researchers which combinatorially suggests the need to run many instances of the S3 Gateway. 1. I want to understand the design choice behind not simply allowing the user the ability resolve those paths, both for my own understanding and to... 2. to justify the cost and complexity. We are quite savy with both K8's and Cloud in general, so it isn't that we cannot run the S3 Gateways, but we need to understand the reasoning in order to make a good comparison against other options
The new architecture proposed in that doc and supported by SST looks amazing. If we were able to bypass the gateways, LakeFS would be a killer app for us
Perhaps until that's ready it would be possible to (mis)use something in the API to resolve paths directly
o

Oz Katz

02/19/2021, 6:26 PM
Hi @Barrett Strausser! Trying to answer your questions: 1. lakeFS exposes 2 APIs: an S3 gateway that is API compatible with S3, and an OpenAPI server that is fully capable of returning the addresses of underlying objects. The reason we decided to support an S3 gateway is because most data systems are already capable of communicating with S3 without any modifications, so this allows us to support whatever systems are already deployed without modification. the OpenAPI server fully supports your use case though, which is great. 2. From the use case you described, it is possible you won't need to use the S3 gateway at all, and can work natively with the OpenAPI server alone. I've wrote a small snippet that illustrates how to use the OpenAPI server to request the S3 addresses of files under a certain path in a branch: https://gist.github.com/ozkatz/b106da34ac4c8608ccf5c1190ea69a40 Let me know if that makes sense - as you can see, it should be pretty easy to adapt it to what you initially suggested.
b

Barrett Strausser

02/19/2021, 6:53 PM
Awesome! I'm having a look at that now, and will come back with any issues or hiccups I run into
I appreciate the fast response.
o

Oz Katz

02/19/2021, 8:04 PM
Happy to help 🙂 let me know how it goes!