https://lakefs.io/ logo
Title
n

Narendra Nath

03/24/2023, 5:02 PM
Hello I am trying to manage dataset versioning for our training. currently working with DVC and running into issues with large number of object ~10-50M for CV related use case. The size of data can be around 200TB at full size. I want to evaluate lakefs for this reason and switch from dvc to lakefs. However with dvc I am able to have a cache of dataset in fsx/efs for faster operations for example to download datasets, is that possible with lakefs if so can someone point me to documentation?
l

Lynn Rozen

03/24/2023, 5:18 PM
Hi @Narendra Nath and welcome! Can you please elaborate on your use case?
n

Narendra Nath

03/24/2023, 5:24 PM
Hi @Lynn Rozen we are training a CV deep learning model on a dataset that is expected to grow to 200TB. The dataset will have about 50M objects. we store the dataset in S3. But to download such a large dataset from s3 it will take a long time. so we plan to keep a clone in a cache for example something like efs/fsx which will be mounted to a directory during training on kubernetes. There will be multiple users that will kick off training from kubernetes so the cache should support different versions for different training runs.
i

Itai Admi

03/24/2023, 6:18 PM
Hey @Narendra Nath, it sounds very interesting and I can definitely see the value for other lakeFS users. I wonder how you normally go about syncing your data from S3 to efs/fsx? The reason I’m asking is that lakeFS supports the S3 protocol, so you might be able to use that to read/write the cache.
n

Narendra Nath

03/24/2023, 6:33 PM
Hi @Itai Admi dvc does this out of the box but that process is a bottle neck which is why I am exploring the alternatives. Does lakefs need to download data every time one needs to access the data lets say for a distributed training run using python or s3 api ?
i

Itai Admi

03/24/2023, 6:40 PM
Yeah. Unlike git, lakeFS is designed to scale to billions of objects and petabytes of storage. Therefore lakeFS doesn’t have the concept of local and upstream. However, training data with lakeFS is very common. It all depends on the use-case, but it’s easy to sync data from lakeFS to your local FS and vice-versa. It means you won’t have to download/upload the entire datasets each time, just the objects that were modified since last time.
Rclone is one way to achieve this.
n

Narendra Nath

03/24/2023, 6:53 PM
Thank you Itai so what would my workflow look like from lakefs point of view using rclone, should I sync my s3 and fsx? what would distributed training(different runs) look like in such case?
e

einat.orr

03/24/2023, 6:58 PM
Sorry for barging in @Narendra Nath, but just to clarify, the most natural way to use lakeFS would be to run your training application directly over S3, while using a lakeFS python SDK that allows you to access different versions of the data. Would that be useful for you or do you need the data to be local?
n

Narendra Nath

03/24/2023, 7:00 PM
no worries, thanks for response Einat! there is no requirement for data to be local. Having data in s3 might be slower for training use cases. we are looking at low latency in EBS or FSX range of performance.
e

einat.orr

03/24/2023, 7:02 PM
Yes, that is common for deep learning training.
n

Narendra Nath

03/24/2023, 7:06 PM
yes exactly, since I am dealing with a large dataset with large number of objects with latency requirements for deep learning training. Caching/local copy on FSX was something I was looking for because downloading the dataset everytime or running directly on top of s3 is not an option.
i

Itai Admi

03/25/2023, 11:37 AM
Hey @Narendra Nath, I looked at this some more and it seems like FSx doesn’t support S3-compatible sources, like MinIO or lakeFS. If you want to expose your datasets stored in lakeFS to your consumers through FSx, you would need to export the needed version of the data stored in lakeFS to the S3 bucket/path that is configured to integrate with FSx. There are several ways to achieve that with lakeFS, RClone and DistCp is also an option. If there is some other tool that can be used instead of FSx/efs and has integration with S3-compatible sources, that would be easier.
n

Narendra Nath

03/25/2023, 6:42 PM
got it. let me look into it more