< Ariel Shaqed Scolnicov > could it be an idea to introduce lakeFS #help

<@U01867G8AA0> could it be an idea to introduce co...

Ion

04/08/2024, 3:14 PM

@Ariel Shaqed (Scolnicov) could it be an idea to introduce concept of staging changes? This would allow writing with multiple writers to the same branch, but then only have each writer commit the changes they made🤟

Ariel Shaqed (Scolnicov)

04/08/2024, 3:22 PM

Wow, so that's a big one to spec out. Here's my chain of thought on this. This does not mean I am right, it's my attempt to explain the lakeFS model. One big difference between git and lakeFS is that git lives on a single directory on a single machine, while lakeFS lives on a single URL on the network. That's why lakeFS doesn't really have a worktree, only a staging area. But suppose I wanted to add multiple staging areas. lakeFS doesn't run on the same machine as the job, and for Spark there's anyway more than one machine running the job. So each staging area would need a name. Ignoring URL issues, I'd probably call them main$my-spark-job, main$my-pandas, or something. And my Spark job would access a URL lakefs://repo/main$my-spark-job/spark/output/ and for Pandas I'd use a URL lakefs://repo/my-pandas/pandas/data/. Now I need to be able to handle merges etc. during commits. So I pretty much have a bunch of branches main$my-spark-job, main$my-pandas. Basically, these multiple staging areas would just be branches.

Ion

04/08/2024, 3:29 PM

But then each staging area won't be aware of the other staging area.

Ion

04/08/2024, 3:30 PM

Ideally the uncommit changes are aware of each other, and only after selecting they move to the staging area, which could perhaps be represented as a branch underneath

Ariel Shaqed (Scolnicov)

04/08/2024, 3:30 PM

Yeah, but the same happens with Git, whether you use separate worktrees or separate trees.

Ariel Shaqed (Scolnicov)

04/08/2024, 3:32 PM

(I have been toying around with commits that are not attached to a branch, which are essentially a lot like your staging areas or Git stashes or the Git staging area). But layering a worktree in front of a staging area, and then silently changing the staging area, would be really weird.

Ariel Shaqed (Scolnicov)

04/08/2024, 3:35 PM

Could you give an example of the intended behaviour on 2 staging areas, when both modify an object at the same path? (Of course you might need to invent new lakectl commands, feel free to invent them and I'll ask if I don't understand.)

Ion

04/08/2024, 3:36 PM

Ok let me try to sketch something out with draw.io xd

Ion

04/08/2024, 3:56 PM

staging commits.png

Ion

04/08/2024, 3:56 PM

Since one writer will always finish earlier than the other, we can add hooks into the writer that then stages those files and then does a lakeFS commit

Ariel Shaqed (Scolnicov)

04/08/2024, 3:58 PM

I don't understand this system, though. Both writer 1 and writer 2 need to access lakeFS, and they need to use some URL to do that. Delta writing also involves reading, and it is distributed. So IIUC machines in writer 1 use a differemt URL from machines in writer 2. So how does writer 2 see "0001.json" to decide it wants to write "0002.json"?

Ion

04/08/2024, 3:59 PM

So the way I am seeing it is, the uncommited changes are visisble on the main branch

Ion

04/08/2024, 4:00 PM

Staging a couple files should not mean they are not visible on the main branch anymore

Ion

04/08/2024, 4:00 PM

just that they are in a different state that can now be committed

Ariel Shaqed (Scolnicov)

04/08/2024, 4:01 PM

Okay, I think I get it! We're talking about this issue I think!

Ion

04/08/2024, 4:03 PM

Yup sounds the same 🙂

Ariel Shaqed (Scolnicov)

04/08/2024, 4:04 PM

Yeah, I like this use-case: commit a single (Delta or Iceberg) table change.

❤️ 1

Ion

04/08/2024, 4:05 PM

Having such a thing + CopyIfNotExists could give the ergonomics for writer concurrency without doing the current approach of: writer start -> create lakefs branch -> write -> pray merge does not fail XD, else retry

Ariel Shaqed (Scolnicov)

04/08/2024, 4:06 PM

We'd still need to add code to every writer, I guess.

Ariel Shaqed (Scolnicov)

04/08/2024, 4:06 PM

Will you enrich the issue or shall I?

Ion

04/08/2024, 4:06 PM

Go ahead 🙂

Ion

04/08/2024, 4:08 PM

Yeah either something on the writer side needs to become aware of all the files it's written, or something in between. Just thinking out loud here but maybe we could add metadata witihin lakefs during the creation of files, that would allow a user to filter the changes on that and the push them to the staged changes

Ariel Shaqed (Scolnicov)

04/08/2024, 4:09 PM

I'd probably just go off and read the changes from 0001.json when I commit it.

Ion

04/08/2024, 4:11 PM

Yeah you could retrospect the history, and fetch the commit file and the parquet files in the add actions

Ion

04/08/2024, 4:12 PM

This actually could be done on lakefs side 🤔, if it's aware that it's a Delta Table

Ion

04/08/2024, 4:13 PM

And we have some kind of functionality that takes the Delta commit version as input, it could derive what to put as staging changes

👌🏼 1

2 Views

Open in Slack

Previous Next