Hi everyone! I am new to LakeFS and I'm trying to ...
# help
m
Hi everyone! I am new to LakeFS and I'm trying to wrap my head around this tool. For context, I'm working in the field of deep learning with images. I think that the use and advantages of LakeFS for ETL pipelines with ingestion branches for various data levels (e.g., Medallion Hierarchy) is quite straightforward. However, I'm unsure to grasp what the workflow would look like for machine learning projects, even after reading your doc/blog. In a use case where: • Several domain specific models and customer-specific models must be trained • Each model uses its own subset of data from one or several data tier. The data subsets may be overlapping across models, i.e., some data are used in several different models. • Each data subset might need (or not) additional processing specific to the model I ask myself the following questions: • What would be a workflow for such a use case here (with lots of different model to track)? ◦ Would you have a LakeFS repo per model? ◦ Would you have a LakeFS branch per model? • Metadata in commits would probably need to be heavily used to version everything together? It looks like this could become quite complex very rapidly. • Would the right place to track most information be in experiment tracking tools (such as ML-Flow, WandB, etc.) where information about data commits in LakeFS would be added? Additional general question: • What's the advantage of copying data inside LakeFS instead of just managing versioning with LakeFS? Basically, upload vs import. ◦ I think that I have read in some doc/blog post that copying data should be favored if data will be frequently modified (updated or deleted), while it might be better to use import when data is frequently appended only. Thanks in advance for your help!
h
We do Deep learning for image. Lakefs provide a generic versioning system that allow you to snapshot your whole storage fast and easy. It's kind of up to you to manage it the way you want. Abit like git version your code. How many repo, how many branch, how to link Release to commit on which repo is up to the user ... For us, we manage all dataset via code: we have all data in lakefs. For specific model version, we will pull a specific subset of data of interest (define by code) from lakefs, at specific commit. The lakefs commit is written in the code, so tracked by git. In short: model v4 => git commit xyz => lakefs commit abc The main advantage with lakefs vs your own versioning (folder v1, folder v2, etc ...) is that you can do diff and don;t duplicate data between v2 and v1. You can also create branch of new data where you can QC before merging to main, while other people can still work on main.
The copy vs import: I think import is only interesting when you have really large amount of data (PB) that you don;t want to duplicate/copy. In the import scenario, from my understanding, you end up having 2 storages to keep alive: the original one and the one use as backend for your lakefs. Plus, you need to make sure that no one ever touch the original storage once you import into lakefs.
m
Hi @HT! Thank you for your prompt response! It clarifies things a bit 🙂. From your experience working with LakeFS, would you have any advice about how to organize repo and branches in that regard? I suppose it's highly use-case specific as you suggested, but I would be curious to know if you were able to draw some basic principles that work well in general.
h
we don't do ETL type of processing (or at least not yet) so we only have main branch as source of truth. All production model use data from main branch. We have random branch mainly for testing code mainly, so never really merge ... the only time we would expecting merge to happen is when we scale up and multiple people QC and merge new data to main. In the early stage, we though about one repo per "customer", but then found out that having 1000s of repos... lakefs don;t really like it ... and then you have a nightmare of: at given dataset version, which commit of which repo is required .... So we end up spliting repo between data that are very unlikely to end up in the same dataset. You can also do mono repo but I don't really like monolithic ... I prefer decouple stuffs out when I can.
My background is Software Engineer, so I am used to git mental model. Having Lakefs is easy transition. In the software world, you have : code == compile ==> binary You version code, then you reproduce binary. Deep learning: code + data == train (compile equivalent) ==> model (binary equivalent) git cannot version data properly. Yes, you have LFS but it is still clunky ... and never really been designed for Deep Learning type of data. lakefs solve the git for data problem. Plus now you have decouple version of code and version of data, which are relatively independent both in concept and in format. While with LFS, everything go to the same git repo ...
m
Thank you for all this information! Having lots of repos can indeed add quite a bit of management complexity. When you say that LakeFS don't really like it, do you mean that it starts to get slow, both UI-wise and api-query wise? I am also more from a software engineering background. I like your mental model about deep learning equivalence in software 🙂. We are redesigning our data pipeline, so we will explore different architectures.
h
do you mean that it starts to get slow, both UI-wise and api-query wise?
Yeah, the web UI will not like your 100s repo ... I came across an issue where the API did not expect more than 1000s repo. That been long fixed (and very quickly 😉 ) Don't really remember exactly issue but, most of the time, people would not expect 1000s repos ...
m
I see! Thanks for referencing your issue 🙂. It's probably not the path that we will follow but it's always good to know as this is still a possibility. Thank your very much for your help and time, it's much appreciated!
o
@MetaH Welcome to the lake Please advise if you have any further open questions we can help you with