What is the recommended way to handle multiple data sources that update at different frequencies? Say we have datasets D1, D2, D3 ... each that update at different times (say monthly, weekly, yearly etc.). Each dataset may contain multiple CSV files.
Should each of them have their own branch or should they be just folders within a single branch or some other better way to handle them?
Ultimately would like to have the
branch always point to the latest version of all these datasets. What is the best way to organize this with LakeFS?
11/02/2023, 12:09 AM
Best practice to ingest any datasets is via a new “ingestion” branch (can append timestamp in the branch name to make it unique) and merge that branch to “main” once all data quality checks are done.
While datasets D1, D2 and D3 can be just folders within the “main” branch.
11/03/2023, 2:36 AM
Thank you @Amit Kesarwani
if there are daily ingestions, would creating a new branch for each day (or may be more), is ok performance-wise? Can LakeFS handle high number of branches?