help lakefs Hi Does anyone have issues with Databricks sinc lakeFS #help

:help_lakefs: Hi, Does anyone have issues with Dat...

Quentin Nambot

12/30/2022, 3:23 PM

help lakefs Hi, Does anyone have issues with Databricks since today? I cannot write dataframes on lakefs using databricks today but yesterday it works fine and I did not change anything.. I am using latest runtime (12.0) and a spark config:

Copy code

spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint <http://s3.eu-west-1.amazonaws.com|s3.eu-west-1.amazonaws.com>
spark.hadoop.fs.lakefs.secret.key ...
spark.hadoop.fs.lakefs.access.key ...
spark.hadoop.fs.lakefs.endpoint http://...:8000/api/v1
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem

and

io.lakefs:hadoop-lakefs-assembly:0.1.9

installed on the cluster And when I try to write any date with spark I have the following error:

Copy code

java.io.IOException: get object metadata using underlying wrapped s3 client

In the stacktrace I also see:

Copy code

Caused by: java.lang.NoSuchMethodException: com.databricks.sql.acl.fs.CredentialScopeFileSystem.getWrappedFs()

I am wondering if Databricks changed something 🤔 (Note that using spark locally or python client on databricks I can upload objects so it seems really related to spark on databricks)

Tal Sofer

12/30/2022, 3:31 PM

Hi @Quentin Nambot! Let me try to reproduce and get back to you 🙂 it sounds like databricks potentially changed something. In the meantime, do you mind pasting the command you are running?

Quentin Nambot

12/30/2022, 3:33 PM

I tried a lot a different things, but basically writing a simple dataframe in parquet/csv fails, like:

Copy code

import random
df = spark.createDataFrame(data=[(f"uid{i}", random.randint(1, 100)) for i in range(10)], schema=["user", "age"])
df.write.parquet("<lakefs://datamarts/main/users/>")

Tal Sofer

12/30/2022, 3:33 PM

Thanks 🙂

Quentin Nambot

12/30/2022, 3:34 PM

I am trying to see if using

s3a

gateway fails too

👍 1

Quentin Nambot

12/30/2022, 3:34 PM

And it seems to work with S3 gateway 🤔

Tal Sofer

12/30/2022, 3:35 PM

Will try to run myself and let you know what I find! 🙂

Tal Sofer

12/30/2022, 4:45 PM

Hi @Quentin Nambot! I managed to use the lakeFS filesystem to write with DBR 12.0 without running into any error. The only difference my setup has is that i’m using the default fs.s3a.endpoint because my data is on us-east-1, and I see that you are using s3.eu-west-1.amazonaws.com. I will open an issue for it and try to reproduce in this setup 🙂 Do you mind sharing the full stacktrace?

Tal Sofer

01/02/2023, 9:17 AM

Opened this issue to track the problem https://github.com/treeverse/lakeFS/issues/4923

Quentin Nambot

01/02/2023, 9:36 AM

Thank you! I will try with a US bucket, and I'll send you full stacktrace very soon

🙏 1

Quentin Nambot

01/02/2023, 9:53 AM

Here is the full stacktrace:

Quentin Nambot

01/02/2023, 9:53 AM

Untitled.txt

Tal Sofer

01/02/2023, 10:03 AM

Thank you!

Quentin Nambot

01/02/2023, 10:22 AM

I tried using a US bucket, without overriding

fs.s3a.endpoint

, and I sill have the same issue 🤔 (I can read, but I cannot write)

Tal Sofer

01/02/2023, 10:22 AM

Thanks for the update! we are looking into it

Quentin Nambot

01/02/2023, 10:24 AM

(Note that I attach a policy

s3:*

on my databricks cluster to be sure that it is not related to IAM rights..)

Quentin Nambot

01/02/2023, 10:53 AM

😮 I found something interesting: With a No Isolation shared cluster it works, but it doesn't work with a Single User cluster (Same spark config, same notebook)

Tal Sofer

01/02/2023, 12:35 PM

Thanks for sharing your findings, that helps us to move forward 🙂 On your end, are you ok with using no isolation shared cluster access mode for now?

Quentin Nambot

01/02/2023, 12:36 PM

Yes totally 👌

sunglasses lakefs 1

🤗 1

Ariel Shaqed (Scolnicov)

01/02/2023, 3:32 PM

@Quentin Nambot, just to make sure: do you also have

Copy code

spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem

in your configuration? (This is the end of my longer comment on the issue.)

Quentin Nambot

01/02/2023, 3:34 PM

Yes. (Sorry I forgot the send the message on the issue) My complete spark configuration is:

Copy code

spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint <http://s3.eu-west-1.amazonaws.com|s3.eu-west-1.amazonaws.com>
spark.hadoop.fs.lakefs.secret.key ...
spark.hadoop.fs.lakefs.access.key ...
spark.hadoop.fs.lakefs.endpoint http://...:8000/api/v1
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem

And my only lib installed:

io.lakefs:hadoop-lakefs-assembly:0.1.9

Ariel Shaqed (Scolnicov)

01/02/2023, 3:38 PM

Strange. I'll try to figure out from where that

com.databricks.sql.acl.fs.CredentialScopeFileSystem

drops out. Did you run any Spark-SQL code as part of your notebook or job? Context for this strangeness: lakeFSFS needs to get the AWS S3 client used by the S3A filesystem in order to be able to call getObjectMetadata on the S3 object that S3A generates. So it tries a variety of methods in method io.lakefs.MetadataClient.getObjectMetadata. The second one is to call a nonpublic method S3AFileSystem.getAmazonS3Client.... but I am beginning to suspect that somehow in your case there's actually a different filesystem there!

Ariel Shaqed (Scolnicov)

01/02/2023, 3:44 PM

I think I know how I will steal the AWS S3 client even through this one, but until I manage to reproduce the failure... it will be hard to test the fix.

Ariel Shaqed (Scolnicov)

01/02/2023, 3:46 PM

Ah! How do you authenticate to S3? That's different!

Quentin Nambot

01/02/2023, 3:47 PM

No nothing, my entire notebook is:

Copy code

# Databricks notebook source
# To test if i can connect to lakefs (it works)
df = spark.read.parquet("<lakefs://issue-us/main/users>")
df.show()

# COMMAND ----------

import pyspark.sql.functions as f
df.withColumn("gender", f.lit("u")).write.mode("overwrite").parquet("<lakefs://issue-us/main/users>")

Ariel Shaqed (Scolnicov)

01/02/2023, 3:48 PM

That is strange: lakeFSFS writes directly to S3 through the S3AFileSystem. And that needs to authenticate to S3 AFAICT.

Quentin Nambot

01/02/2023, 3:48 PM

🤔 A Single User cluster can access to Databrikcs Unity Catalog, so it is possible that the library is different there

Quentin Nambot

01/02/2023, 3:48 PM

So it is possible that there is a different filesystem

Ariel Shaqed (Scolnicov)

01/02/2023, 3:49 PM

Oooh, @Tal Sofer maybe we don't support DataBricks Unity? Sounds like the "un" in "fun". 😕

😄 1

Tal Sofer

01/02/2023, 3:57 PM

Interesting! we don’t yet support databricks unity. but @Quentin Nambot maybe you are using an instance profile to authenticate to s3?

Quentin Nambot

01/02/2023, 3:59 PM

Ok so that's why! Yes I am using an instance profile on my cluster

Ariel Shaqed (Scolnicov)

01/03/2023, 7:57 AM

Still not sure which of the two it is. A quick search for the filesystem in the error yields this SO answer (your bug report is the only other hit!), which also says "Unity Catalog". Right now I have a plan but I would like to reproduce first. Sorry, it will take me a while because of some irrelevant stuff. If I cannot reproduce by Wednesday I might ask to trouble you with testing solutions. I would rather not do that because it may take multiple attempts: the solution involves Java reflection, so type-checking for silly bugs that I make will only occur when it runs.

Quentin Nambot

01/03/2023, 8:52 AM

It seems like

Single User

cluster have different libs underneath to use Unity Catalog.. it doesn't surprise me, it's not the first time that I have some magical issues like this with Databricks Do not hesitate to ask me to test different things!

Ariel Shaqed (Scolnicov)

01/04/2023, 12:40 PM

Thanks for your offer! Right now we suspect it is related to using Unity Catalog, possibly in relation to using a Single User cluster. I've updated the issue accordingly with short-term and medium-term proposals to fix it. We hope to have access to Unity Catalog soon, which will allow us to continue work. I prefer to do it directly due to the number of dependencies involved. A principal issue is that all relevant code uses reflection to access JVM code for which we do not have a defined interface and that uses classes that are not available at compile time. So there is no type-checking and trivial errors are likely. Having our own cluster will shorten the loop and save time and be less annoying to everyone.

👍 1

86 Views

Open in Slack

Previous Next