:help_lakefs: Hi, Does anyone have issues with Dat...
# help
q
help lakefs Hi, Does anyone have issues with Databricks since today? I cannot write dataframes on lakefs using databricks today but yesterday it works fine and I did not change anything.. I am using latest runtime (12.0) and a spark config:
Copy code
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint <http://s3.eu-west-1.amazonaws.com|s3.eu-west-1.amazonaws.com>
spark.hadoop.fs.lakefs.secret.key ...
spark.hadoop.fs.lakefs.access.key ...
spark.hadoop.fs.lakefs.endpoint http://...:8000/api/v1
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
and
io.lakefs:hadoop-lakefs-assembly:0.1.9
installed on the cluster And when I try to write any date with spark I have the following error:
Copy code
java.io.IOException: get object metadata using underlying wrapped s3 client
In the stacktrace I also see:
Copy code
Caused by: java.lang.NoSuchMethodException: com.databricks.sql.acl.fs.CredentialScopeFileSystem.getWrappedFs()
I am wondering if Databricks changed something ๐Ÿค” (Note that using spark locally or python client on databricks I can upload objects so it seems really related to spark on databricks)
t
Hi @Quentin Nambot! Let me try to reproduce and get back to you ๐Ÿ™‚ it sounds like databricks potentially changed something. In the meantime, do you mind pasting the command you are running?
q
I tried a lot a different things, but basically writing a simple dataframe in parquet/csv fails, like:
Copy code
import random
df = spark.createDataFrame(data=[(f"uid{i}", random.randint(1, 100)) for i in range(10)], schema=["user", "age"])
df.write.parquet("<lakefs://datamarts/main/users/>")
t
Thanks ๐Ÿ™‚
q
I am trying to see if using
s3a
gateway fails too
๐Ÿ‘ 1
And it seems to work with S3 gateway ๐Ÿค”
t
Will try to run myself and let you know what I find! ๐Ÿ™‚
Hi @Quentin Nambot! I managed to use the lakeFS filesystem to write with DBR 12.0 without running into any error. The only difference my setup has is that iโ€™m using the default fs.s3a.endpoint because my data is on us-east-1, and I see that you are using s3.eu-west-1.amazonaws.com. I will open an issue for it and try to reproduce in this setup ๐Ÿ™‚ Do you mind sharing the full stacktrace?
Opened this issue to track the problem https://github.com/treeverse/lakeFS/issues/4923
q
Thank you! I will try with a US bucket, and I'll send you full stacktrace very soon
๐Ÿ™ 1
Here is the full stacktrace:
t
Thank you!
q
I tried using a US bucket, without overriding
fs.s3a.endpoint
, and I sill have the same issue ๐Ÿค” (I can read, but I cannot write)
t
Thanks for the update! we are looking into it
q
(Note that I attach a policy
s3:*
on my databricks cluster to be sure that it is not related to IAM rights..)
๐Ÿ˜ฎ I found something interesting: With a No Isolation shared cluster it works, but it doesn't work with a Signel User cluster (Same spark config, same notebook)
t
Thanks for sharing your findings, that helps us to move forward ๐Ÿ™‚ On your end, are you ok with using no isolation shared cluster access mode for now?
q
Yes totally ๐Ÿ‘Œ
sunglasses lakefs 1
๐Ÿค— 1
a
@Quentin Nambot, just to make sure: do you also have
Copy code
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
in your configuration? (This is the end of my longer comment on the issue.)
q
Yes. (Sorry I forgot the send the message on the issue) My complete spark configuration is:
Copy code
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint <http://s3.eu-west-1.amazonaws.com|s3.eu-west-1.amazonaws.com>
spark.hadoop.fs.lakefs.secret.key ...
spark.hadoop.fs.lakefs.access.key ...
spark.hadoop.fs.lakefs.endpoint http://...:8000/api/v1
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
And my only lib installed:
io.lakefs:hadoop-lakefs-assembly:0.1.9
a
Strange. I'll try to figure out from where that
com.databricks.sql.acl.fs.CredentialScopeFileSystem
drops out. Did you run any Spark-SQL code as part of your notebook or job? Context for this strangeness: lakeFSFS needs to get the AWS S3 client used by the S3A filesystem in order to be able to call getObjectMetadata on the S3 object that S3A generates. So it tries a variety of methods in method io.lakefs.MetadataClient.getObjectMetadata. The second one is to call a nonpublic method S3AFileSystem.getAmazonS3Client.... but I am beginning to suspect that somehow in your case there's actually a different filesystem there!
I think I know how I will steal the AWS S3 client even through this one, but until I manage to reproduce the failure... it will be hard to test the fix.
Ah! How do you authenticate to S3? That's different!
q
No nothing, my entire notebook is:
a
That is strange: lakeFSFS writes directly to S3 through the S3AFileSystem. And that needs to authenticate to S3 AFAICT.
q
๐Ÿค” A Single User cluster can access to Databrikcs Unity Catalog, so it is possible that the library is different there
So it is possible that there is a different filesystem
a
Oooh, @Tal Sofer maybe we don't support DataBricks Unity? Sounds like the "un" in "fun". ๐Ÿ˜•
๐Ÿ˜„ 1
t
Interesting! we donโ€™t yet support databricks unity. but @Quentin Nambot maybe you are using an instance profile to authenticate to s3?
q
Ok so that's why! Yes I am using an instance profile on my cluster
a
Still not sure which of the two it is. A quick search for the filesystem in the error yields this SO answer (your bug report is the only other hit!), which also says "Unity Catalog". Right now I have a plan but I would like to reproduce first. Sorry, it will take me a while because of some irrelevant stuff. If I cannot reproduce by Wednesday I might ask to trouble you with testing solutions. I would rather not do that because it may take multiple attempts: the solution involves Java reflection, so type-checking for silly bugs that I make will only occur when it runs.
q
It seems like
Single User
cluster have different libs underneath to use Unity Catalog.. it doesn't surprise me, it's not the first time that I have some magical issues like this with Databricks
a
Thanks for your offer! Right now we suspect it is related to using Unity Catalog, possibly in relation to using a Single User cluster. I've updated the issue accordingly with short-term and medium-term proposals to fix it. We hope to have access to Unity Catalog soon, which will allow us to continue work. I prefer to do it directly due to the number of dependencies involved. A principal issue is that all relevant code uses reflection to access JVM code for which we do not have a defined interface and that uses classes that are not available at compile time. So there is no type-checking and trivial errors are likely. Having our own cluster will shorten the loop and save time and be less annoying to everyone.
๐Ÿ‘ 1