Hi I am trying to use the LakeFSFileSystem in Databricks Spa lakeFS #help

Hi! I am trying to use the LakeFSFileSystem in Dat...

Clinton Monk

03/30/2022, 9:06 PM

Hi! I am trying to use the LakeFSFileSystem in Databricks Spark but I get an error when writing to a

lakefs://

URI:

Copy code

Caused by: java.lang.NoSuchMethodException: com.databricks.s3a.S3AFileSystem.getWrappedFs()

The command fails when running in Databricks Runtimes 7.3 LTS and 9.1 LTS. However, the command succeeds in Databricks Runtime 6.4. Reading a

lakefs://

URI works in all three of those Databricks Runtimes. Is this an issue anyone else has experienced? Does anyone have guidance on how I could get this to work for Databricks Runtime 9.1 LTS (Spark 3.1)?

👀 1

Clinton Monk

03/30/2022, 9:07 PM

full logs.txt

Clinton Monk

03/30/2022, 9:08 PM

I'm using

<s3://treeverse-clients-us-east/hadoop/hadoop-lakefs-assembly-0.1.6.jar>

Edmondo Porcu

03/30/2022, 9:14 PM

Hi Clinton, not a LakeFS expert but a JVM one. here. This problem is typically due to multiple version of the same class being available in different locations in the java Classpath.

Edmondo Porcu

03/30/2022, 9:19 PM

So, it really depends on how your classes are loaded by the JVM in memory. You can do some interesting things to understand what happens:

Copy code

Class.forName("com.databricks.s3a.S3AFileSystem").getClassLoader().

Barak Amar

03/30/2022, 9:19 PM

Currently we have an open issue on using the lakefs fs with Spark 3.1 https://github.com/treeverse/lakeFS/issues/2336

❤️ 1

Edmondo Porcu

03/30/2022, 9:19 PM

You can use this Scala notebook to see where does the S3AFileSystem is loaded from. I imagine the fat assembly for hadoop-lakefs-assembly procedure might need to be updated

Edmondo Porcu

03/30/2022, 9:20 PM

Copy code

val s3AFileSystemClass = Class.forName("com.databricks.s3a.S3AFileSystem")
s3AFileSystemClass.getProtectionDomain().getCodeSource().getLocation()

👍 1

👍🏼 1

Edmondo Porcu

03/30/2022, 9:20 PM

Copy code

res1: java.net.URL = file:/databricks/jars/s3--s3-spark_3.2_2.12_deploy.jar

Barak Amar

03/30/2022, 9:22 PM

I assume that the exception occurs because the of write operation which try to access the object metadata after write completes.

Edmondo Porcu

03/30/2022, 9:23 PM

@Barak Amar it looks like a class loading exception

Copy code

Caused by: java.lang.NoSuchMethodException: com.databricks.s3a.S3AFileSystem.getWrappedFs()
    at java.lang.Class.getDeclaredMethod(Class.java:2130)
    at io.lakefs.MetadataClient.getObjectMetadata(MetadataClient.java:72)

Barak Amar

03/30/2022, 9:24 PM

The call comes from lakefs getObjectMetadata

Barak Amar

03/30/2022, 9:24 PM

We try to access the underlaying hadoop filesystem's s3 information and have couple of fallbacks

Edmondo Porcu

03/30/2022, 9:25 PM

Using reflection expose you to those risks... https://github.com/treeverse/lakeFS/blob/master/clients/hadoopfs/src/main/java/io/lakefs/MetadataClient.java#L72

Barak Amar

03/30/2022, 9:26 PM

True. We consider using additional s3 client (maybe we will revisit this idea) to access the new object after write completes in order to access the real metadata.

Clinton Monk

03/30/2022, 9:26 PM

I wonder if it is also because Databricks had a change in its S3 file system in Runtime 7:

org.apache.hadoop.fs.s3native.NativeS3FileSystem and org.apache.hadoop.fs.s3.S3FileSystem are no longer supported for accessing S3.

We strongly encourage you to use com.databricks.s3a.S3AFileSystem, which is the default for s3a://, s3://, and s3n:// file system schemes in Databricks Runtime. If you need assistance with migration to com.databricks.s3a.S3AFileSystem, contact Databricks support or your Databricks representative.

https://docs.databricks.com/release-notes/runtime/7.x-migration.html

Edmondo Porcu

03/30/2022, 9:27 PM

you can even run that code yourself @Clinton Monk in a notebook within the different runtimes

Barak Amar

03/30/2022, 9:27 PM

The first implementation we wrote was using AWS's s3a implementation - having access to the real s3 client under the underlaying filesytem solves a lot of issues vs connecting the underlaying storage using additional client you require to embed.

Clinton Monk

03/30/2022, 9:27 PM

That maybe that new filesystem is missing part of the interface LakeFS expects

Barak Amar

03/30/2022, 9:28 PM

Pretty sure - we will need to look into the changes and address it in our implementation.

Edmondo Porcu

03/30/2022, 9:30 PM

@Barak Amar are these jars available s3--s3-spark_3.2_2.12 somewhere? s3-spark? Or is it databricks maintained and closed source?

Barak Amar

03/30/2022, 9:30 PM

Multiple S3A implementations + different versions can produce large matrix of libs to produce. We tried to address the common one at the write we wrote this code.

Barak Amar

03/30/2022, 9:30 PM

@Edmondo Porcu closed source

Edmondo Porcu

03/30/2022, 9:32 PM

Make sense 😞 That's annoying, looks like having integration tests that capture these funny behaviours is the only way to solve those annoying issues 😞

Barak Amar

03/30/2022, 9:34 PM

Yes, feel free to add a comment on the issue (or +1) this helps us focus on the specific versions users are using.

👍 1

Ryan Green

03/30/2022, 11:56 PM

@Barak Amar thanks for the information! what's the timeline for supporting this? think this is going to be necessary for us to include Lake FS in the solution we're thinking of implementing. cc: @Iddo Avneri

👍 1

Barak Amar

03/31/2022, 12:05 AM

Hi Ryan, will discuss this tomorrow with the team and update here.

👍 1

❤️ 1

Yoni Augarten

03/31/2022, 10:56 AM

Hey @Ryan Green @Clinton Monk, just to keep you in the loop: I will be investigating this issue and will try to come up with a solution for you

❤️ 2

jumping lakefs 4

Clinton Monk

03/31/2022, 12:50 PM

Thanks! In case it is useful, here are some of my findings from looking into it a bit more yesterday: I tried configuring a Runtime 9.1 cluster (Spark 3.1) to use

org.apache.hadoop.fs.s3.S3FileSystem

rather than the LakeFS-incompatible

com.databricks.s3a.S3AFileSystem

. I had to install

hadoop-aws

to add it back. However, I ran into a class incompatibility issue: The Hadoop

S3FileSystem

uses the public

jets3t

library, but Databricks has their own

jets3t

library that is installed in the cluster. Classes from one didn't seem to be compatible with the others (or perhaps I had a library version mismatch somewhere). I didn't go much further, but it feels like the solution is probably to update

MetadataClient.getObjectMetadata()

to be compatible with the new

com.databricks.s3a.S3AFileSystem

. I'll leave that decision up to you all though!

🙏🏻 1

👀 1

Yoni Augarten

04/01/2022, 8:56 AM

Updating here after discussing in private: 1. When I tried to reproduce the problem, I found that on my clusters,

<http://shaded.databricks.org|shaded.databricks.org>.apache.hadoop.fs.s3a.S3AFileSystem

was used as the file system for s3a. This is consistent across different Databricks Runtime versions. Only when I manually set the file system to

com.databricks.s3a.S3AFileSystem

, I'm able to reproduce the problem. 2. We will update the code to support the Databricks proprietary file system. 3. I will contact Databricks support to understand why we are getting different file systems.

👏 3

Clinton Monk

04/01/2022, 1:53 PM

After some follow-up, we got it working with Runtime 9.1 LTS (Spark 3.1) by using this setting:

Copy code

fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem

Thanks @Yoni Augarten and @Barak Amar! 🙌 🍾

🙏 4

🙏🏻 1

8 Views

Open in Slack

Previous Next