Is it possible to specify which files you want to add to a c lakeFS #help

Join Slack

Is it possible to specify which files you want to ...

# help

Ion

02/21/2024, 7:21 PM

Is it possible to specify which files you want to add to a commit?

Ariel Shaqed (Scolnicov)

02/21/2024, 7:40 PM

Hi @Ion, We do not currently support this. lakeFS steps away from the Git model here for various reasons: it has only "staged" and "committed" objects. Git also has "worktree" objects, that aren't even staged. As an alternative, if you know in advance what you'll need to commit at each step, you could use branches, isolated bonus on the branches, and merges to get something similar. If you can describe a specific scenario please open an issue. We will try there to understand how it should behave with concurrent access, or at least we might be able to offer a workaround.

Ion

02/21/2024, 8:33 PM

Yeah I was just thinking if you have a delta writer you know which files it wrote, then you can pick these files and put them into one commit, that would then allow concurrent writers within one branch to write because delta can do the commit resolution directly without having to rewrite the data in a new branch

Ariel Shaqed (Scolnicov)

02/21/2024, 8:49 PM

Sure, but I don't think you'd need even that. If you intend to serialise on the latest JSON, and everything else has guaranteed unique names, and you can change the writing library, then put-if-absent is back (as of today). Should work equally well. But equally, delta knows which objects it's writing. So with the same assumptions I'd prefer to branch out for each writer and try to merge back; if the merge fails due to conflict I just need to update my JSON and try again. As a bonus every delta commit becomes a lakeFS commit, which means history becomes meaningful and useful for lineage. So I'm not sure this point is make-or-break; I think we have multiple solutions here. The challenges are elsewhere I believe. That said, a logwriter would be great.

Ion

02/21/2024, 8:52 PM

Yeah was just an idea for perhaps on the short term as another Alternative. Regarding the delta writer, you mean 'just update ' as what Oz explained yesterday? Also put-if-absent would only work I guess if you have writer that talks to the native lakefs protocol instead of the S3 gateway api, or am I misunderstanding here?

Ariel Shaqed (Scolnicov)

02/21/2024, 9:06 PM

Yes, on all counts. I don't think you can have a safe logwriter that uses only s3 operations. If there were, we probably wouldn't need to have this conversation. But as Oz said, we'll get there. I expect the issues to be more around being able to support the runtime environments of our users.

❤️ 1

Niro

02/21/2024, 10:24 PM

Hi, if I may step in. Using lakeFS HadoopFS implementation and configuring the hadoop s3a gateway to point to your underlying blockstore you will be able to write the data directly to the blockstore while leveraging lakeFS's put if absent

Ion

02/21/2024, 10:29 PM

@Niro thanks so in that case I still use the S3 gateway?

Ion

02/21/2024, 10:29 PM

How about LakeFS Hadoop filesystem, which allows to write directly to lakefs uri, ah this is AWS only

Niro

02/21/2024, 10:41 PM

Sorry, I just noticed you were using Azure as your underlying storage. The Hadoop azure FS supports only wasb[s] scheme which lakeFS doesn't currently support.

Ion

02/21/2024, 10:45 PM

Whats the difference exactly? Would presigned mode use put-if-absent?

Niro

02/21/2024, 10:51 PM

No, this is unrelated to pre-sign. The lakeFS HadoopFS supports the overwrite mode flag. It uses the get/link physical address API for writing objects so that laleFS is not in the data path. But because we provide an http url for Azure blockstore and not wasb you can't use the Hadoop Azure FS to write the data directly to the storage. In AWS this is not an issue

Ion

02/22/2024, 6:42 AM

Any plans to add support for the Hadoop FS for azure directly then? It seems for spark there currently isn't then any concurrency possible

Ion

02/22/2024, 6:43 AM

I think it will be more expensive in spark to do a retry commit in LakeFS unless I do some smart caching of the query plsn

Ariel Shaqed (Scolnicov)

02/22/2024, 6:57 AM

Currently writing Delta to Spark provides the same guarantees as S3A. That is, no support for multiwriter. This should change - the current lakeFS API suffices to write a good implementation. User need for the feature helps prioritize it. Trust me, your use-case is heavily noted in support of how soon we should do this! One thing that I would ask you to do is to open an issue requesting support for concurrent multiwriter support in Delta tables. You could then encourage people to express their need for the feature on that issue.

👍 1

Ion

02/22/2024, 6:59 AM

Alright got it, I got a bit confused there with the mention with put-if-absent I'll make a GitHub issue later today

7 Views

Open in Slack

Previous Next