Following t<his integration tutorial on running la...
# help
u
Following this integration tutorial on running lakeFS with Databricks, I think I am missing something there. is there a library I need to install on the databricks cluster to make them work? Which dependencies should be there?
u
Hey @Adi Polak, In the documentation there is only an explanation on using databricks with the lakeFS S3 gateway. When using the lakeFS S3 gateway there is no need to install any library. You would need to install a library in case you would like to use the lakeFS-specific Hadoop FileSystem Using the lakeFS-specific Hadoop FileSystem with databricks is currently not documented. I am opening an issue for that Thanks for pointing this out šŸ™‚
u
thanks, Guy! what is the difference between the two options?
u
lakeFS is S3 compatible, meaning you could configure your S3 endpoint to be lakeFS. In spark it is done by configuring
spark.hadoop.fs.s3a.endpoint
In that case you would be using the
S3 committer
(which is already contained in databricks), but instead of accessing S3 it would access lakeFS, in this case all the data goes through your lakeFS server. In order for your spark job to write the data directly to S3 ( your underlying storage), lakeFS provides the
lakeFS-specific Hadoop FileSystem
u
@Guy Hardonag this is interesting, I imagine that if one configures the lakeFS specific Hadoop FileSystem, it will use the "official" s3 endpoint and have an higher throughput
u
what about the commit index that you maintain on the postgres??
u
Good question, the lakeFS specific hadoop filesystem still uses lakeFS for managing metadata. It's only the data itself that is updated directly to S3
u
@Edmondo Porcu you are absolutely correct, and so is @Guy Hardonag! The lakeFS Hadoop FileSystem (unofficially we call it "lakeFSFS" around here...) performs all data operations directly on S3, but all metadata operations still occur via lakeFS. So for instance, in order to read an object
<lakefs://repo/branch/path/to/obj>
, lakeFSFS will ask lakeFS to look up path
/path/to/obj
on branch
branch
of the repository named
repo
. lakeFS will reply with the object metadata, that includes a path
<s3://storage/namespace/opaqueopaqueopaque>
("opaque" means that portion of the path is essentially meaningless). Now lakeFSFS calls out to the Hadoop S3AFileSystem to open that path. Writing is a bit more complex sunglasses lakefs but the details are the same: ask lakeFS for somewhere to write, upload the file directly to S3, tell lakeFS to link that file into its new location on whatever branch.
u
Silly of me, I posted before remembering @Tal Sofer wrote a great blog post about the lakeFS Hadoop FileSystem that goes into these details and more, complete with diagrams. Best to read that post if you are at all interested in using Spark with lakeFS and worried about scale!
u
Is LakeFSFS supporting Azure (Blob|ADLS)? Sounds like currently only S3 compatible storage is supported when using that. At least I could not figure it out how to use LakeFSFS with Azure šŸ™‚
u
Hey @Micha Kunze, lakeFSFS currently support only S3
u
Make sense to have both. Can you point me to the code for lakeFS Hadoop FileSystem ?
u
It sure does make sense, and we are planning to support them in the future. But currently lakeFSFS uses
s3a
you could find the code here
u
Make sense to have both.
@Edmondo Porcu this is certainly a future plan for us. We will love to hear about the use cases you and @Micha Kunze have!
u
I am not sure I understood šŸ˜ž
u
@Edmondo Porcu I’m not sure I understood 😁 can you help me understand where the confusion is?
u
Oh sorry, make sense to have both referred to azure !
u
Yes, that's how I understood this :)
u
I did not mean to hijack this thread 😬 I am mostly interested to use lakeFSFS for our spark work/data pipelines. At our current scale (10-50TB/day) this might not be a concern to run the data through the server but if we can avoid that we would like to do that.
u
My company runs exclusively in Azure today - hence the question. I was actually setting this up today and came across this thread via searching slack.
u
Run Minio on Kubernetes to provide an s3 like API to azure? šŸ˜„
u
Yes, this was one of our instant reactions, but we would like to not use extra stuff to run lakeFS if possible.
u
Hi @Micha Kunze, I'm sorry that your use case might not be covered. Obviously this is something we want to support if there is a large enough user on Azure. Could you please open an issue so that we can track it? It would be great if you could add some of your expected numbers on it.
u
No need to be sorry - I think for now we are ok just using lakeFS as s3 endpoint with spark. I will open an issue to track this. Wrt numbers it is something we really need to actually test now on our side. Our team owns ~300TB at rest (compressed parquet) and we crunch quite a bit of that volume (guessing 10-50TB since we compute full snapshots typically) every day.