lakeFS is an open-source data version control that transforms your object storage to Git-like repositories..

lakeFS

I looked at the capabilities of LakeFS, and it is very impressive for data storage, addressing exactly what we need in an MLOps pipeline. However, out of curiosity, and knowing that LakeFS is not specifically built for this purpose, will it still be performant if we store and version ML and LLM models through LakeFS (such as safetensor, pickle files, or binaries, each being around ~5GB)? For example, modern LLMs can reach about 160GB, meaning the repository would consist of around 40 files of ~4GB each.

Hi Andrij!
Welcome to the lake! :lakefs:
This is definitely something that lakeFS can handle.
We have many users with the same use case, and lakeFS has configurations which leaves the lakeFS server out of the actual data path (such as using <https://docs.lakefs.io/security/presigned-url.html|pre-signed urls> and <https://docs.lakefs.io/integrations/spark.html#lakefs-hadoop-filesystem|lakeFS hadoop FS>) which will maintain the same performance as working directly with the object store.

Thank you for your prompt answer! I will take a look at the links you provided. It looks very promising indeed. Does it also retain the ability to merge, diff, and delta diff on those types of files (safetensor, pickle files, binaries)?

lakeFS continues to maintain its core versioning capabilities regardless of the configuration