So S3 requires a locking cliënt for concurrent writers to wr lakeFS #help

So S3 requires a locking cliënt for concurrent wri...

Ion

01/17/2024, 3:54 PM

So S3 requires a locking cliënt for concurrent writers to write to the same delta table but I'm wondering whether that is also required for the S3 emulation in LakeFS if the actual storage backend is Azure adls? Are you able to shed some light on this? :)

Oz Katz

01/17/2024, 6:21 PM

Hey @Ion! 👋 Welcome 🙂 Sure - let me try and explain. Short answer is yes - you'd still need a locking client for concurrent writers, even if the backing object store is ADLS. With non-s3 object stores (for example, ADLS) Delta's log store implementation would use a conditional header such as

If-None-Match

(see conditional requests) to make sure new log entries don't accidentally overwrite another writer's log entries. Since the S3 gateway in lakeFS implements the S3 protocol, and S3 doesn't support conditional writes - even if we were to add the required headers to support this, no S3 client would know how to use them, regardless of the underlying storage used by lakeFS. This is yet another reason why we prefer native clients when possible.

Oz Katz

01/17/2024, 6:22 PM

Let me know if that makes sense - happy to elaborate!

❤️ 1

Ion

01/17/2024, 6:27 PM

That makes sense! Thanks for explaining:) Unfortunate to hear though, because that would make it difficult to start using this for us.

Ion

01/17/2024, 6:28 PM

Regarding a native integration, do you reckon your team could work on LakeFS-client crate? Which implements the trait methods of the ObjectStore?

Ion

01/17/2024, 6:29 PM

Or we can do some joined effort here, because once such thing is available I can integrate it directly in delta-RS and make it a first class citizen of the library

sunglasses lakefs 2

Oz Katz

01/17/2024, 6:31 PM

sure! there’s some Rust experience on the team but not a huge amount - collaborating on this sounds like a great idea!

💯 1

Ion

01/18/2024, 9:21 AM

Btw, @Oz Katz one question that still popped up on the Hadoop file system integration. When you use presigned mode it suggests it will do write operations directly to the storage. In this case it mentions it is supported on Azure, then I can assume in this mode since it will write directly to ADLS I don't need to bother with a locking cliënt. Did I understand this correctly?

Oz Katz

01/18/2024, 7:08 PM

Well not quite 🙂 There’s currently no logstore implementation for the Hadoop file system. what we typically recommend is to use branching and merging to ensure safe writes: if every write is a sequence of creating an isolated branch, writing to the table and then committing and merging back into the original branch - any attempt to overwrite an existing log entry will (rightfully) result in a conflict. so yes - it works without a locking client, but at the moment- requires you to branch/write/merge yourself. a logstore implementation could automatically do that for you transparently in the future

💯 1

Ion

01/18/2024, 7:24 PM

Ah I see that seems doable to do, each job execution just needs a random branch run

Ion

01/18/2024, 7:25 PM

Any reason why it doesn't use Adls capability here to do file locks?

Oz Katz

01/18/2024, 7:55 PM

lakefs provides indirection between a logical path and the actual stored object.

0001.json

in branch A needs to point to a different object than

0001.json

on branch B (for isolation). lakeFS makes sure that they are always unique, so even if adls supports a conditional put on an object, that condition won’t help since whether a path exists is knowledge that lakeFS has but adls doesn’t

Oz Katz

01/18/2024, 7:55 PM

hope thats clear enough?

Oz Katz

01/18/2024, 7:56 PM

this might be a useful read: https://docs.lakefs.io/understand/how/versioning-internals.html#constructing-a-consistent-view-of-the-keyspace-ie-a-commit

Ion

01/19/2024, 7:38 AM

That helps! Thanks @Oz Katz

11 Views

Open in Slack

Previous Next