Hi! I am trying to use the LakeFSFileSystem in Dat...
# help
c
Hi! I am trying to use the LakeFSFileSystem in Databricks Spark but I get an error when writing to a
lakefs://
URI:
Copy code
Caused by: java.lang.NoSuchMethodException: com.databricks.s3a.S3AFileSystem.getWrappedFs()
The command fails when running in Databricks Runtimes 7.3 LTS and 9.1 LTS. However, the command succeeds in Databricks Runtime 6.4. Reading a
lakefs://
URI works in all three of those Databricks Runtimes. Is this an issue anyone else has experienced? Does anyone have guidance on how I could get this to work for Databricks Runtime 9.1 LTS (Spark 3.1)?
👀 1
full logs.txt
I'm using
<s3://treeverse-clients-us-east/hadoop/hadoop-lakefs-assembly-0.1.6.jar>
e
Hi Clinton, not a LakeFS expert but a JVM one. here. This problem is typically due to multiple version of the same class being available in different locations in the java Classpath.
So, it really depends on how your classes are loaded by the JVM in memory. You can do some interesting things to understand what happens:
Copy code
Class.forName("com.databricks.s3a.S3AFileSystem").getClassLoader().
b
Currently we have an open issue on using the lakefs fs with Spark 3.1 https://github.com/treeverse/lakeFS/issues/2336
❤️ 1
e
You can use this Scala notebook to see where does the S3AFileSystem is loaded from. I imagine the fat assembly for hadoop-lakefs-assembly procedure might need to be updated
Copy code
val s3AFileSystemClass = Class.forName("com.databricks.s3a.S3AFileSystem")
s3AFileSystemClass.getProtectionDomain().getCodeSource().getLocation()
👍 1
👍🏼 1
Copy code
res1: java.net.URL = file:/databricks/jars/s3--s3-spark_3.2_2.12_deploy.jar
b
I assume that the exception occurs because the of write operation which try to access the object metadata after write completes.
e
@Barak Amar it looks like a class loading exception
Copy code
Caused by: java.lang.NoSuchMethodException: com.databricks.s3a.S3AFileSystem.getWrappedFs()
    at java.lang.Class.getDeclaredMethod(Class.java:2130)
    at io.lakefs.MetadataClient.getObjectMetadata(MetadataClient.java:72)
b
The call comes from lakefs getObjectMetadata
We try to access the underlaying hadoop filesystem's s3 information and have couple of fallbacks
e
b
True. We consider using additional s3 client (maybe we will revisit this idea) to access the new object after write completes in order to access the real metadata.
c
I wonder if it is also because Databricks had a change in its S3 file system in Runtime 7:
org.apache.hadoop.fs.s3native.NativeS3FileSystem and org.apache.hadoop.fs.s3.S3FileSystem are no longer supported for accessing S3.
We strongly encourage you to use com.databricks.s3a.S3AFileSystem, which is the default for s3a://, s3://, and s3n:// file system schemes in Databricks Runtime. If you need assistance with migration to com.databricks.s3a.S3AFileSystem, contact Databricks support or your Databricks representative.
https://docs.databricks.com/release-notes/runtime/7.x-migration.html
e
you can even run that code yourself @Clinton Monk in a notebook within the different runtimes
b
The first implementation we wrote was using AWS's s3a implementation - having access to the real s3 client under the underlaying filesytem solves a lot of issues vs connecting the underlaying storage using additional client you require to embed.
c
That maybe that new filesystem is missing part of the interface LakeFS expects
b
Pretty sure - we will need to look into the changes and address it in our implementation.
e
@Barak Amar are these jars available s3--s3-spark_3.2_2.12 somewhere? s3-spark? Or is it databricks maintained and closed source?
b
Multiple S3A implementations + different versions can produce large matrix of libs to produce. We tried to address the common one at the write we wrote this code.
@Edmondo Porcu closed source
e
Make sense 😞 That's annoying, looks like having integration tests that capture these funny behaviours is the only way to solve those annoying issues 😞
b
Yes, feel free to add a comment on the issue (or +1) this helps us focus on the specific versions users are using.
👍 1
r
@Barak Amar thanks for the information! what's the timeline for supporting this? think this is going to be necessary for us to include Lake FS in the solution we're thinking of implementing. cc: @Iddo Avneri
👍 1
b
Hi Ryan, will discuss this tomorrow with the team and update here.
👍 1
❤️ 1
y
Hey @Ryan Green @Clinton Monk, just to keep you in the loop: I will be investigating this issue and will try to come up with a solution for you
❤️ 2
jumping lakefs 4
c
Thanks! In case it is useful, here are some of my findings from looking into it a bit more yesterday: I tried configuring a Runtime 9.1 cluster (Spark 3.1) to use
org.apache.hadoop.fs.s3.S3FileSystem
rather than the LakeFS-incompatible
com.databricks.s3a.S3AFileSystem
. I had to install
hadoop-aws
to add it back. However, I ran into a class incompatibility issue: The Hadoop
S3FileSystem
uses the public
jets3t
library, but Databricks has their own
jets3t
library that is installed in the cluster. Classes from one didn't seem to be compatible with the others (or perhaps I had a library version mismatch somewhere). I didn't go much further, but it feels like the solution is probably to update
MetadataClient.getObjectMetadata()
to be compatible with the new
com.databricks.s3a.S3AFileSystem
. I'll leave that decision up to you all though!
🙏🏻 1
👀 1
y
Updating here after discussing in private: 1. When I tried to reproduce the problem, I found that on my clusters,
<http://shaded.databricks.org|shaded.databricks.org>.apache.hadoop.fs.s3a.S3AFileSystem
was used as the file system for s3a. This is consistent across different Databricks Runtime versions. Only when I manually set the file system to
com.databricks.s3a.S3AFileSystem
, I'm able to reproduce the problem. 2. We will update the code to support the Databricks proprietary file system. 3. I will contact Databricks support to understand why we are getting different file systems.
👏 3
c
After some follow-up, we got it working with Runtime 9.1 LTS (Spark 3.1) by using this setting:
Copy code
fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
Thanks @Yoni Augarten and @Barak Amar! 🙌 🍾
🙏 4
🙏🏻 1