Hi everyone!
I am new to LakeFS and I'm trying to wrap my head around this tool.
For context, I'm working in the field of deep learning with images.
I think that the use and advantages of LakeFS for ETL pipelines with ingestion branches for various data levels (e.g., Medallion Hierarchy) is quite straightforward. However, I'm unsure to grasp what the workflow would look like for machine learning projects, even after reading your doc/blog.
In a use case where:
• Several domain specific models and customer-specific models must be trained
• Each model uses its own subset of data from one or several data tier. The data subsets may be overlapping across models, i.e., some data are used in several different models.
• Each data subset might need (or not) additional processing specific to the model
I ask myself the following questions:
• What would be a workflow for such a use case here (with lots of different model to track)?
◦ Would you have a LakeFS repo per model?
◦ Would you have a LakeFS branch per model?
• Metadata in commits would probably need to be heavily used to version everything together? It looks like this could become quite complex very rapidly.
• Would the right place to track most information be in experiment tracking tools (such as ML-Flow, WandB, etc.) where information about data commits in LakeFS would be added?
Additional general question:
• What's the advantage of copying data inside LakeFS instead of just managing versioning with LakeFS? Basically, upload vs import.
◦ I think that I have read in some doc/blog post that copying data should be favored if data will be frequently modified (updated or deleted), while it might be better to use import when data is frequently appended only.
Thanks in advance for your help!