Title
u

王麒詳

08/11/2022, 1:32 AM
Hi there, I am a novice studying and trying to run the MLOps env in my workstation. So far, for data version control, the lakeFS works well in my server. I integrated it with MINIO (Object Storage) and LabelStudio(For data annotation). Something like:
data coming->|
MINIO <--> lakeFS <--> LabelStudio <-- data annotation
              |<--> User access data
However, I have several questions about data access. 1. What are recommended ways to access data in lakeFS? Most of the data stored in my lakeFS server are image files. I’m currently using
boto3
or
python lakeFS API
client.object.get_object
to get bytes and then transfer them to image files. I’m wondering if there is an efficient way to access data or not in the development stage. 2. Another question is cloning data from a repo (or a branch) to a local. I tried some tutorial examples from official docs and ran it successfully. I’d like to know if cloning a large data repo to local is a common way in practice or not. Because when the data is larger than a hundred GB, cloning all of the data to a local machine, and using it seems not the best choice. Thanks for all of your kindly help and a good patience to read my questions:)
e

Eden Ohana

08/11/2022, 2:03 AM
Hi David, You can access data through lakefs API, the clients, and the s3 gateway to access the data directly which seems that is what you’re using. Cloning data to a local repository is not a common practice. let me know if you have additional questions.
u

王麒詳

08/11/2022, 3:40 AM
Thanks for the fast reply 😀 I’d like to know more details about “Cloning data to a local repository is not a common practice.” I’ve been working with several friends to develop deep learning models. We noticed that separately stored data to the different personal machines is not the first choice. So, we utilize lakeFS to manage all of our data. However, we still downloaded all the needed data to the local machine before training the model. (I think there are some tricky things in this process 😅) I have a rough idea, but not sure if it works to frequently request data from the s3 gateway during the training process. But high-frequency requests may cause service overloading 🤔. Would you please share or give me some directions to use data in the lakeFS during the model training process but not clone or copy all of the data to local?
e

Eden Ohana

08/11/2022, 3:51 AM
Let me check and get back to you 😀
u

王麒詳

08/11/2022, 3:52 AM
Thank you so much!! 😭
e

einat.orr

08/11/2022, 6:25 AM
@王麒詳 Out of curiosity, what are you using to build/ train the models?
u

王麒詳

08/11/2022, 6:27 AM
hihi @einat.orr so far, I’m using pytorch to build / train model in container or personal desktop 🙂
e

einat.orr

08/11/2022, 6:30 AM
Cool 😎. And you are concerned that fetching the data from S3 on every pytorch run will hurt the performance dramatically?
u

王麒詳

08/11/2022, 6:40 AM
Yes, I’m concerned that will hurt the service (lakeFS or MINIO) performance when I use a data generator to fetch data from S3 on every training batch 😒miling_face_with_tear:.
e

einat.orr

08/11/2022, 6:50 AM
I assume you have the same concern for the case of using just min.io, correct?
Cause with the right provisioning lakeFS should not create a significant overhead.
(Same is true for min.io 🤔 )
u

王麒詳

08/11/2022, 7:16 AM
I assume you have the same concern for the case of using just min.io, correct?
You’re right! currently, I only fetch data by using the S3 gateway from lakeRS.
Cause with the right provisioning lakeFS should not create a significant overhead.
(Same is true for min.io 🤔 )
Wow, Really🤠!? I will try some toy examples for testing the scenario I mentioned above in the next few days. I’d like to share testing info for all of you😀.
e

einat.orr

08/11/2022, 7:18 AM
That would be great! There are some benchmarks in the documentation, but on a different architecture...
The network latency may be a factor here...., Just saying ☺️
😆 1
u

王麒詳

08/11/2022, 7:28 AM
Thank you for your info 😃 (agree with you …. network latency XD