Ants Young

01/17/2023, 9:05 AM
Hi, is there a roadmap to support POSIX interface to mount a ref of dataset on local host? It'll be helpful to read a specific version of dataset, ranther than download them.
can s3fs work for this?

Adi Polak

01/17/2023, 9:55 AM
Hi Ants, that's interesting! could you walk me through the use-case and what would you like to achive?

Ants Young

01/17/2023, 12:00 PM
I'm developing an MLOps platform, based on Docker/Kubernetes, which offers machine learning engineers a cloud environment to train models. We mount dataset files to the container for user, rather than copy them into the container, because copying costs so much time to finish initialization before training, especially for large datasets, and downloading files would consume amounts of storage, it's not acceptable if the dataset is very large. In addition to other reasons, we do mounting instead of copying. We're supposed to design an MLOps platform to support MLEs training models in the container just like on their PC. So when users use
ls /path/to/data
in the container, they could see files just like on their local host, that's what we want to implemented. It would be code intrusion if we demand our users to use a specific data version management tool or Python SDK to download files from the repo, it's better to leave it to the user's own decision. Mounting files to the container is the best way, but it is not an elegant solution if downloading first by the platform and then mounting. Performance is acceptable in most cases.
There are some challenges to mount a version (branch,tag,commit) of dataset. But I found some implementations, such as
dagshub mount
pachctl mount
, they both use fuse. And s3fs-fuse and JuiceFS also use fuse. So maybe fuse is one of the optimal way to achieve it. By the way, Pachyderm supports mount dataset version in Jupyter, please see JuiceFS has a very friendly POSIX interface to read files based on FUSE. It's a good example, but it dose not support data versioning.

Amit Kesarwani

01/17/2023, 6:52 PM
@Ants Young I reviewed Pachyderm’s “Mount a Repo Locally” document that you shared. 1st paragraph in the document says:
This command uses the Filesystem in Userspace (FUSE) user interface to export a Pachyderm File System (PFS) to a Unix computer system. This functionality is useful when you want to pull data locally to experiment.
So, I think that data will be copied locally in this case. So, it may not meet your requirements.

Ants Young

01/18/2023, 1:20 AM
@Amit Kesarwani Thank you, Amit. In that case, I think user could download files they what but not the whole dataset. But I haven't try it. Whatever, the POSIX interface s3fs, JuiceFS offered is my case

Or Tzabary

01/19/2023, 7:22 AM
@Ants Young I haven’t personally tried that, but lakeFS supports the S3 protocol, if you would point the application you use to mount s3 buckets (e.g s3fuse), to use lakeFS S3 gateway, it should work

Amit Kesarwani

01/20/2023, 5:32 AM
@Ants Young I tested Fuse to mount lakeFS repository/branch on my laptop and it worked fine. Please see instructions below. Let me know if you have any questions. Install s3fs Fuse:
echo <lakefs access key>:<lakefs secret key> > ${HOME}/.passwd-s3fs

chmod 600 ${HOME}/.passwd-s3fs

sudo mkdir /path/to/mountpoint

-- If you want to mount lakeFS repository
sudo s3fs <lakefs repo name> /path/to/mountpoint -o passwd_file=${HOME}/.passwd-s3fs -o url=<lakefs endpoint url> -o use_path_request_style

sudo ls -lh /path/to/mountpoint/
sudo cat /path/to/mountpoint/<lakefs branch name>/<file name>
sudo touch /path/to/mountpoint/<lakefs branch name>/test1.txt

-- If you want to mount a particular branch of a lakeFS repository
sudo s3fs <lakefs repo name>:/<lakefs branch name> /path/to/mountpoint -o passwd_file=${HOME}/.passwd-s3fs -o url=<lakefs endpoint url> -o use_path_request_style

sudo cat /path/to/mountpoint/<file name>
sudo touch /path/to/mountpoint/test1.txt

Ants Young

01/21/2023, 7:24 AM
Thank you @Or Tzabary , @Amit Kesarwani so much. It is really helpful !!!
🙏 2