Following t<https docs lakefs io integrations databricks htm lakeFS #help

Following t<his integration tutorial on running la...

user

03/23/2022, 11:00 AM

Following this integration tutorial on running lakeFS with Databricks, I think I am missing something there. is there a library I need to install on the databricks cluster to make them work? Which dependencies should be there?

user

03/23/2022, 1:02 PM

Hey @Adi Polak, In the documentation there is only an explanation on using databricks with the lakeFS S3 gateway. When using the lakeFS S3 gateway there is no need to install any library. You would need to install a library in case you would like to use the lakeFS-specific Hadoop FileSystem Using the lakeFS-specific Hadoop FileSystem with databricks is currently not documented. I am opening an issue for that Thanks for pointing this out 🙂

user

03/23/2022, 1:03 PM

thanks, Guy! what is the difference between the two options?

user

03/23/2022, 1:19 PM

lakeFS is S3 compatible, meaning you could configure your S3 endpoint to be lakeFS. In spark it is done by configuring

spark.hadoop.fs.s3a.endpoint

In that case you would be using the

S3 committer

(which is already contained in databricks), but instead of accessing S3 it would access lakeFS, in this case all the data goes through your lakeFS server. In order for your spark job to write the data directly to S3 ( your underlying storage), lakeFS provides the

lakeFS-specific Hadoop FileSystem

user

03/23/2022, 10:47 PM

@Guy Hardonag this is interesting, I imagine that if one configures the lakeFS specific Hadoop FileSystem, it will use the "official" s3 endpoint and have an higher throughput

user

03/23/2022, 10:48 PM

what about the commit index that you maintain on the postgres??

user

03/23/2022, 10:58 PM

Good question, the lakeFS specific hadoop filesystem still uses lakeFS for managing metadata. It's only the data itself that is updated directly to S3

user

03/24/2022, 8:38 AM

@Edmondo Porcu you are absolutely correct, and so is @Guy Hardonag! The lakeFS Hadoop FileSystem (unofficially we call it "lakeFSFS" around here...) performs all data operations directly on S3, but all metadata operations still occur via lakeFS. So for instance, in order to read an object

<lakefs://repo/branch/path/to/obj>

, lakeFSFS will ask lakeFS to look up path

/path/to/obj

on branch

branch

of the repository named

repo

. lakeFS will reply with the object metadata, that includes a path

<s3://storage/namespace/opaqueopaqueopaque>

("opaque" means that portion of the path is essentially meaningless). Now lakeFSFS calls out to the Hadoop S3AFileSystem to open that path. Writing is a bit more complex sunglasses lakefs but the details are the same: ask lakeFS for somewhere to write, upload the file directly to S3, tell lakeFS to link that file into its new location on whatever branch.

user

03/24/2022, 8:43 AM

Silly of me, I posted before remembering @Tal Sofer wrote a great blog post about the lakeFS Hadoop FileSystem that goes into these details and more, complete with diagrams. Best to read that post if you are at all interested in using Spark with lakeFS and worried about scale!

user

03/24/2022, 12:54 PM

Is LakeFSFS supporting Azure (Blob|ADLS)? Sounds like currently only S3 compatible storage is supported when using that. At least I could not figure it out how to use LakeFSFS with Azure 🙂

user

03/24/2022, 12:59 PM

Hey @Micha Kunze, lakeFSFS currently support only S3

user

03/24/2022, 1:17 PM

Make sense to have both. Can you point me to the code for lakeFS Hadoop FileSystem ?

user

03/24/2022, 1:28 PM

It sure does make sense, and we are planning to support them in the future. But currently lakeFSFS uses

s3a

you could find the code here

user

03/24/2022, 1:51 PM

Make sense to have both.

@Edmondo Porcu this is certainly a future plan for us. We will love to hear about the use cases you and @Micha Kunze have!

user

03/24/2022, 3:52 PM

I am not sure I understood 😞

user

03/24/2022, 4:10 PM

@Edmondo Porcu I’m not sure I understood 😁 can you help me understand where the confusion is?

user

03/24/2022, 4:31 PM

Oh sorry, make sense to have both referred to azure !

user

03/24/2022, 4:37 PM

Yes, that's how I understood this :)

user

03/24/2022, 4:42 PM

I did not mean to hijack this thread 😬 I am mostly interested to use lakeFSFS for our spark work/data pipelines. At our current scale (10-50TB/day) this might not be a concern to run the data through the server but if we can avoid that we would like to do that.

user

03/24/2022, 4:43 PM

My company runs exclusively in Azure today - hence the question. I was actually setting this up today and came across this thread via searching slack.

user

03/24/2022, 4:49 PM

Run Minio on Kubernetes to provide an s3 like API to azure? 😄

user

03/24/2022, 4:52 PM

Yes, this was one of our instant reactions, but we would like to not use extra stuff to run lakeFS if possible.

user

03/24/2022, 5:11 PM

Hi @Micha Kunze, I'm sorry that your use case might not be covered. Obviously this is something we want to support if there is a large enough user on Azure. Could you please open an issue so that we can track it? It would be great if you could add some of your expected numbers on it.

user

03/24/2022, 6:08 PM

No need to be sorry - I think for now we are ok just using lakeFS as s3 endpoint with spark. I will open an issue to track this. Wrt numbers it is something we really need to actually test now on our side. Our team owns ~300TB at rest (compressed parquet) and we crunch quite a bit of that volume (guessing 10-50TB since we compute full snapshots typically) every day.

Open in Slack

Previous Next