Hi Everyone I have a question regarding integration of lakeF lakeFS #help

Hi Everyone, I have a question regarding integrati...

Waqas Zubairy

02/16/2024, 12:30 PM

Hi Everyone, I have a question regarding integration of lakeFS with dashboarding tools like Power BI or Tableau, my background is more on Azure side, how best we can integrate data in lakeFS with dashboarding tools?

einat.orr

02/16/2024, 12:40 PM

lakeFS should be connected to your Azure blob or gen 2 storage. Your visualisation tool will work over lakeFS that provides an S3 interface. So lakeFS is compatible with your Azure storage and externalities an S3 interface to the tools you use to access the data. Visualisation tools is one example. Azure databricks is another example. You can configure the visualisation tool to use the lakeFS storage API using the gateway.

Ariel Shaqed (Scolnicov)

02/16/2024, 12:53 PM

Hi @Waqas Zubairy, That's a big question! My goto method would be to start by configuring the dashboard to read from lakeFS using the S3 gateway.

Ion

02/16/2024, 11:20 PM

@Ariel Shaqed (Scolnicov) has there been considering creating an Azure gateway api as well?

Ariel Shaqed (Scolnicov)

02/16/2024, 11:43 PM

Hi @Ion, I'm not aware of any such plans. I imagine one way to make a compelling case for such a feature might be to show apps that have an Azure interface but no S3 interface, whose users could benefit from versioning.

Ion

02/17/2024, 12:45 AM

@Ariel Shaqed (Scolnicov) I would say the main reason is Azure can place locks on files, while S3 can't. Currently using LakeFS in Azure means you have to use the S3 gateway api, however all the locking clients implementations out there like dynamodb are AWS only. So you end up with a worser solution than native ADLS, and you lose rhe ability to have concurrent writers

Ariel Shaqed (Scolnicov)

02/17/2024, 5:22 AM

Hi @Ion, You are of course correct: the object store semantics of lakeFS are modelled more on S3 than on Azure. Our concurrency semantics are modelled on those of Git. So one plan for concurrent writes, for instance, is to branch out to a temporary branch, write, and merge back. A version of this is implemented in the OSS "high level" Python SDK wrapper. We have not yet written similar wrappers for other ecosystems. If there is a specific environment in which you are interested, could you please open an issue and provide details? At the very least it would help indicate demand, which might help prioritize it.

Ariel Shaqed (Scolnicov)

02/17/2024, 7:34 AM

Personally I would add that Azure locks are leases. So the write operation is conditional any way you handle things. Leases can have better average performance under load than the optimistic concurrency of the lakeFS model, but to do this they significantly sacrifice their performance in bad cases (say, above p95). It is also not trivial what it means to commit a branch (the fundamental operation of lakeFS!) when one of the objects has been leased. Obviously a lease on a single object cannot block commits, otherwise leases on 2 or more objects can prevent commits. Which leaves two surprising possible outcomes: the server breaks leases on commits or the lease survives. If leases break on commits then their eventual writes fail, but the client is unaware until it next renews the lease. If leases survive then a commit followed by branching out will give results that appear to break consistency. Users of tools rarely need to care about this, but we do. So I would prefer for us to discuss at the higher level of what table writing operations can do. We can obviously do better with concurrent writes.

Ion

02/17/2024, 9:03 AM

@Ariel Shaqed (Scolnicov) does LakeFS automatically resolve the merge conflict when you have two appending write operations in delta with the same commit version or will 1 of the 2 fail and then you have to retry? If it simply fails then it meens we have to rewrite and try to commit again, however in delta-rs we simply have commit resolution built in so writing only happens once and commit are retried until even the resolution might fail and only then you have to rewrite

Ariel Shaqed (Scolnicov)

02/17/2024, 9:51 AM

Currently it does not. We could do this in a delta log writer. Would you be interested in talking about specific use-cases? Things like size and frequency of writes, platform details, etc. We may have workarounds , or it might help us prioritize this.

Ion

02/17/2024, 10:02 AM

Sure, so the biggest usecase is that you have a pipeline that's partitioned by 20 values (for example product line), each step of this pipeline will write to a delta table of that step using partition overwrite action. Write frequency can be hundreds of times during the day while in active development, also we use ADLS for storage This means that multiple writers will write to the same delta table at the same time . Let's assume we are table version 5 and 3 writers try to commit at the same time. Since we don't have locks because S3.. the last writer will overwrite the previous writers commit with version 6. Using the LakeFS branch and commit approach you run into the same thing, now you get a merge conflict so only the first one survives, the other two writers need to completely do the write action again, with again a chance of failure because now two writers are trying to write. This currently is A bit of showstopper to use LakeFS in production

Ariel Shaqed (Scolnicov)

02/17/2024, 10:19 AM

We completely understand the failure case. lakeFS can use BlobStore for is backend storage, but obviously that does not resolve your issues - it merely means we don't have another issue. 1. What is the desired behaviour on choosing concurrent writes? 2. Where is all of this running? Managed Spark / self hosted Spark / DataBricks / PowerBI / pandas / ...? 3. Can you estimate the frequency of colliding writes? 4. How much data are we talking about?

Ion

02/17/2024, 10:24 AM

1. This question I don't understand? Do you mean why I need concurrent writes? 2. Some writers are on k8s, using polars + delta-rs, some other writers are on databricks using delta-spark 3. I guess 20-50x times per run with a max of 100 jobs per day, kicking off partitioned run would trigger many subruns which could collide 4. Between 1-200million records

Ariel Shaqed (Scolnicov)

02/17/2024, 10:29 AM

Sorry about (1). I meant to ask what should happen during a concurrent write. For instance, a union of all rows written is commonly the required logic - but this is of course business logic. In any case, thanks for these details, I think they mean with our product plans! I'd like to talk with our product team tomorrow and then get back to you, if that sounds okay?

Ion

02/17/2024, 10:41 AM

What should happen during a concurrent write depends on the type of write action, in terms of appends or partitioned overwrite I think it's a matter of retrying the commit with a higher version.

👍 1

Ion

02/17/2024, 10:42 AM

Sure let me know :)

26 Views

Open in Slack

Previous Next