<@U01867G8AA0> could it be an idea to introduce co...
# help
i
@Ariel Shaqed (Scolnicov) could it be an idea to introduce concept of staging changes? This would allow writing with multiple writers to the same branch, but then only have each writer commit the changes they made🤟
a
Wow, so that's a big one to spec out. Here's my chain of thought on this. This does not mean I am right, it's my attempt to explain the lakeFS model. One big difference between git and lakeFS is that git lives on a single directory on a single machine, while lakeFS lives on a single URL on the network. That's why lakeFS doesn't really have a worktree, only a staging area. But suppose I wanted to add multiple staging areas. lakeFS doesn't run on the same machine as the job, and for Spark there's anyway more than one machine running the job. So each staging area would need a name. Ignoring URL issues, I'd probably call them main$my-spark-job, main$my-pandas, or something. And my Spark job would access a URL lakefs://repo/main$my-spark-job/spark/output/ and for Pandas I'd use a URL lakefs://repo/my-pandas/pandas/data/. Now I need to be able to handle merges etc. during commits. So I pretty much have a bunch of branches main$my-spark-job, main$my-pandas. Basically, these multiple staging areas would just be branches.
i
But then each staging area won't be aware of the other staging area.
Ideally the uncommit changes are aware of each other, and only after selecting they move to the staging area, which could perhaps be represented as a branch underneath
a
Yeah, but the same happens with Git, whether you use separate worktrees or separate trees.
(I have been toying around with commits that are not attached to a branch, which are essentially a lot like your staging areas or Git stashes or the Git staging area). But layering a worktree in front of a staging area, and then silently changing the staging area, would be really weird.
Could you give an example of the intended behaviour on 2 staging areas, when both modify an object at the same path? (Of course you might need to invent new lakectl commands, feel free to invent them and I'll ask if I don't understand.)
i
Ok let me try to sketch something out with draw.io xd
staging commits.png
Since one writer will always finish earlier than the other, we can add hooks into the writer that then stages those files and then does a lakeFS commit
a
I don't understand this system, though. Both writer 1 and writer 2 need to access lakeFS, and they need to use some URL to do that. Delta writing also involves reading, and it is distributed. So IIUC machines in writer 1 use a differemt URL from machines in writer 2. So how does writer 2 see "0001.json" to decide it wants to write "0002.json"?
i
So the way I am seeing it is, the uncommited changes are visisble on the main branch
Staging a couple files should not mean they are not visible on the main branch anymore
just that they are in a different state that can now be committed
a
Okay, I think I get it! We're talking about this issue I think!
i
Yup sounds the same 🙂
a
Yeah, I like this use-case: commit a single (Delta or Iceberg) table change.
❤️ 1
i
Having such a thing + CopyIfNotExists could give the ergonomics for writer concurrency without doing the current approach of: writer start -> create lakefs branch -> write -> pray merge does not fail XD, else retry
a
We'd still need to add code to every writer, I guess.
Will you enrich the issue or shall I?
i
Go ahead 🙂
Yeah either something on the writer side needs to become aware of all the files it's written, or something in between. Just thinking out loud here but maybe we could add metadata witihin lakefs during the creation of files, that would allow a user to filter the changes on that and the push them to the staged changes
a
I'd probably just go off and read the changes from 0001.json when I commit it.
i
Yeah you could retrospect the history, and fetch the commit file and the parquet files in the add actions
This actually could be done on lakefs side 🤔, if it's aware that it's a Delta Table
And we have some kind of functionality that takes the Delta commit version as input, it could derive what to put as staging changes
👌🏼 1