Hi, I just joined this channel and have a question...
# help
u
Hi, I just joined this channel and have a question on how to best use lakefs for a specific use case. I am looking for some kind of 'best practice' workflow for reprocessing data. So the situation would be that you have some pipeline logic (code) that you changed and want to apply to your historical production data. I guess that you would have a git branch with your modified pipeline code, and could have a lakefs branch to test the changes on in isolation. But if you want to release those changes to production, how would you go about it? Also taking into account that during the development and testing of your new pipeline code, new data might have been ingested and processed by the current pipeline logic.
u
Hi @Jelle De Jong, welcome 🙂 There is a blog post about data dev env - https://lakefs.io/building-a-data-development-environment-with-lakefs/ that you can look into. It can also be related to https://lakefs.io/ensuring-data-quality-in-a-data-lake-environment/. I assume that some release practices related to the change in the ingested data you described. The use of versioning and lakeFS for working on feature branch can enable you to develop and experiment with new ingested data format before or test new ingested data on different branch before you roll out changes to production. Not related to the data version if your data pipeline needs to support move than one version of the ingested data, your tested code will have to support that (as tested in the isolated environment). If you have additional information you can share about the specific challenge, maybe others can jump in and purr more insights.
u
Hi @Jelle De Jong! That's a great question! Just to make sure I understand: you're talking about merging in the changes that were applied to the historical data? If so, as long as the main branch didn't apply any changes to those, this should be safe and won't result in a conflict: the results of merging would be that the historical data would be atomically replaced with the changed data from your branch, while the new data (that was updated on main in the meantime) would be unchanged because it's not part of the diff between those branches. Does this make sense? I'd be happy to elaborate if needed! :)
u
Hi @Oz Katz and @Barak Amar, thanks for both you swift answers. I guess my question is how to 'atomically' apply code changes together with data changes. Let's say you have a table with sales figures you are ingesting and two derived tables: table 1 with aggregate sales per day and table 2 with the moving average of the current day and n previous days. In addition to these tables there is the code that does the processing (let's say on some periodic schedule). Now you discover that there is a mistake in the code that calculates the daily averages. To fix this you have to change the code but also reprocess all historical data. So you make a lakefs 'reprocessing' branch from main and a git branch for your code, testing your code and applying the new logic to the snapshot of the data lake. In the meantime new sales data gets ingested and processed by the current code. Now you want to merge the data changes, code changes to main as well as process the newly arrived data according to the new logic. I think you could merge the changes from the ingest branch to the reprocessing branch and then merge it into main and at the same time deploy the new version of the code. I still think you have to be careful here to time your deployment (e.g. (automatically as part of your CD pipeline?) pause the pipeline that does the processing on main?) Hopefully this makes sense and makes my use case a bit clearer.
u
Hey @Jelle De Jong In the case you described the current calculated average is wrong, so I would suggest to • stop the current process. • fix the code • test it on a side branch • merge your code and run it on production • backfill to fix the previous data The backfilling could be done on a side branch and merge back the fixed data. lakeFS will help you in two aspects 1. You could use lakeFS CI/CD capabilities to add quality tests for your code/data before merging it to production (that way we wouldn’t have wrong data in production at the first place) 2. Using lakeFS branches you could test your changes on your production data without effecting it
u
Thanks @Guy Hardonag. The mistake could also be an 'improvement' (e.g. better method for imputing missing values or something), so you might still want the downstream consumers have access to the (most up to date) data that is processed according to the current logic. But even in that case you would have to stop the processing at some stage, but maybe only just before you deploy the reprocessed data (which should be based on the whole history up to the point of deployment) and the code changes.
u
@Oz Katz Do you have anything to add to this? I think reprocessing is a quite typical task, so I guess you guys have thought about how to facilitate such workflows with lakefs?
u
Hey @Jelle De Jong for what you describe I'd suggest going with a simpler approach: run the improvement on an isolated branch, in parallel to the "production" process that takes place on "main". Once you've reprocessed history on your branch and you are generating data on a regular basis with the improvement, you can expose this branch to consumers or run acceptance tests to ensure you're satisfied with the change. Once you are - create a "deploy" branch from main, drop the existing data, merge your branch into it - and merge that into main. This will atomically replace the old data with the improved version. From this point on, run the improved code directly on main.
u
Thanks @Oz Katz, that sounds like a good approach. @Iddo Avneri planned a meeting in April so I will continue the discussion then, and discuss how to move towards a potential PoC for a project I am working on.