https://lakefs.io/ logo
Title
s

Seungchan Lee

04/24/2023, 10:15 PM
Hi, I’m just getting started with lakeFS and a little confused about the workflow. In git, you usually commit -> push. For lakeFS, there’s no push, so if I want to add objects to my dataset in s3, how do I do that?
i

Iddo Avneri

04/24/2023, 10:18 PM
Welcome to the lake @Seungchan Lee!
Typically using lakeFS, you would have a production branch (main) and any type of transformation / ingestion will be done on a separate branch, and then, merged into main.
For example, let’s say you want to run an ETL against your data: 1. You create a branch from prod (main) - a zero clone copy operation 2. You make the changes on your isolated branch. 3. Once you are ready, you merge the changes back into main, promoting the data to production
s

Seungchan Lee

04/24/2023, 10:22 PM
OK great thanks - so when does commit happen?
i

Iddo Avneri

04/24/2023, 10:22 PM
When you execute it.
s

Seungchan Lee

04/24/2023, 10:22 PM
You branch out -> upload new data -> commit -> merge back to main if required?
i

Iddo Avneri

04/24/2023, 10:22 PM
For example, yes.
That’s a good ingest example.
Or for transformation, what I mentioned above.
s

Seungchan Lee

04/24/2023, 10:24 PM
Wait - I don’t see the difference? Is transformation workflow different? Isn’t it still branch out -> transform -> commit -> merge back?
i

Iddo Avneri

04/24/2023, 10:24 PM
image.png,image.png
(only mentioned it because you mentioned “upload new data”) You are correct that it is the same concept
s

Seungchan Lee

04/24/2023, 10:24 PM
Ah ok
What about if you’re running an automated workflow that does ETL? Is it basically the same process?
i

Iddo Avneri

04/24/2023, 10:26 PM
Yup
Do you use an airflow / dagster / kubeflow / other orchestrator?
s

Seungchan Lee

04/24/2023, 10:26 PM
Yes Flyte
Which is just another workflow engine - but for kubernetes
i

Iddo Avneri

04/24/2023, 10:26 PM
You will have the orchestrator run the ETL in a branch.
s

Seungchan Lee

04/24/2023, 10:27 PM
RIght
i

Iddo Avneri

04/24/2023, 10:27 PM
We have a good example with airflow:

https://youtu.be/HuQQUvmVjhU

s

Seungchan Lee

04/24/2023, 10:27 PM
OK this is helpful
👍 1
I’ll watch it and then try it out with my project. Thank you!
i

Iddo Avneri

04/24/2023, 10:28 PM
Sure thing. Hope you enjoy lakeFS!
s

Seungchan Lee

04/24/2023, 10:28 PM
It looks very promising - thank you!
🙏 1
i

Iddo Avneri

04/24/2023, 10:28 PM
If you have additional questions, feel free to post those.
👍 1
s

Seungchan Lee

04/24/2023, 10:29 PM
Definitely - thanks!
Oh sorry one more quick thing - I was planning to use mlflow for experiment tracking but a recent blog post in lakeFS mentions it can also track experiments. Any good resource for me to dig into how lakeFS does it and what the pros/cons are compared to something like mlflow?
i

Iddo Avneri

04/24/2023, 10:41 PM
That’s a great point. lakeFS will version the data on top of (as opposed to instead of).
Another resource available is

https://youtu.be/O0u72YHi7qY

We are actually working on another ML webinar that might be even more relevant. Stay tuned 🙂
s

Seungchan Lee

04/24/2023, 10:56 PM
OK thank you!
i

Iddo Avneri

04/24/2023, 11:09 PM
Enjoy!
👍 1