Hi Team I m trying to write a parquet to lakeFS for the firs lakeFS #help

Hi Team, I’m trying to write a parquet to lakeFS f...

user

06/29/2022, 7:33 AM

Hi Team, I’m trying to write a parquet to lakeFS for the first time. Working with scala, locally, using SBT. I included this in my dependencies:

Copy code

libraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"

Yet when trying to write out a dataframe:

Copy code

outputDf.write.parquet(s"lakefs://${repo}/${branch}/example.parquet")

I’m getting:

Copy code

Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "lakefs"

I’m executing the program in intellij, in a main method of an object. What am I missing here? Maybe I should somehow exclusively mention --packages somewhere in the run configuration, even though lakefs is in the dependencies?

user

06/29/2022, 7:40 AM

Hey Gideon! I think that there are some missing configurations, in this case

fs.lakefs.impl

. Our docs lists all needed configurations, I would start with the above one, and

fs.lakefs.access.key

fs.lakefs.secret.key

fs.lakefs.endpoint

user

06/29/2022, 7:41 AM

Let me know if that doesn’t work for you.

user

06/29/2022, 7:46 AM

Thanks Itai. I’ve put these in hdfs-site.xml, which I placed in the folder

/usr/local/Cellar/hadoop/3.3.1/libexec/etc/hadoop

. And in the run configurations I set SPARK_HOME to that location. The question is whether it was actually picked up by spark…

user

06/29/2022, 7:49 AM

I guess you’re right, can you try passing the configurations explicitly and not with

hdfs-site

user

06/29/2022, 7:50 AM

Or set it in your code for now:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<https://lakefs.example.com/api/v1>")

user

06/29/2022, 7:50 AM

Ok, I can try. But don’t you think it’s something related to the lakefs jar not being taken into account?

user

06/29/2022, 7:51 AM

If the jar wasn’t loaded, then this configuration will not work as well, but it least we narrowed down the problem.

user

06/29/2022, 7:52 AM

ok, I’ll check, thanks

user

06/29/2022, 8:23 AM

Thanks Itai - Looks like setting the values explicitly in scala has helped (though I still wonder how should I make it discover the XML). So No I’m getting a different error when trying to write out the parquet:

Copy code

22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <s3a://lakefs-poc/>: java.lang.NumberFormatException: For input string: "64M"
22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/from-app-2022-06-28/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
.....

I saw some threads that this might be related to version incompatibility between spark and hadoop, but my dependencies are all aligned to version 3.2.1:

Copy code

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-auth" % "3.2.1"
libraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"

user

06/29/2022, 8:34 AM

The same happens, by the way, if try to write it out as CSV:

outputDf.write.csv(s"lakefs://${repo}/${branch}/hello-world.csv")

user

06/29/2022, 8:35 AM

I think there’s a dependencies mismatch. Can you try to use

hadoop-common

hadoop-auth

at version

2.7.7

user

06/29/2022, 8:50 AM

I don’t think it likes it:

Copy code

Exception in thread "main" java.lang.NoSuchMethodError: 'void org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(java.lang.String)'
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
	at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)

That’s the same error I was getting before I realized I should add hadoop-common and hadoop-auth as dependencies in the first place.

user

06/29/2022, 8:53 AM

Hi Gideon, I can see that you use Hadoop 3.3.1, unfortunately, lakeFSFS is built for hadoop 2.7.7. Will it be possible for you to use Spark (whichever version) and Hadoop 2.7.7? For example, this Spark version.

user

06/29/2022, 9:09 AM

Is that a bundle of spark 3.2.1 with hadoop 2.7.7?

user

06/29/2022, 9:09 AM

Why are these things bundled?

user

06/29/2022, 9:10 AM

Thanks Jonathan, In any case It is important for me to point out that I’m not using HDFS here. For my POC I’m using a local minio (in the same docker-compose as lakefs). In production I’m planning to use our corporate installation of Scality (which in the future may change to Amazon S3). What does it mean in terms of the hadoop jars version I have to use?

user

06/29/2022, 9:18 AM

Spark can work with different versions of Hadoop's API. This is a bundle of Spark with the Hadoop jars that is compatible with Hadoop 2.7.x. (which is the compatible Hadoop API version with lakeFSFS). This has nothing to do with HDFS so we're cool 😎 If you download this Spark version you'll have all the necessary Hadoop API jars, and you'll be able to work with lakeFSFS. If you need any help in configuring the environment please let me know!

user

06/29/2022, 9:23 AM

Ok, thanks Jonathan. Can this version be downloaded only manually? Is it not available via mavencentral or something?

user

06/29/2022, 9:41 AM

You can download it manually or build it using the source code. You can see here the options. Personally, I think that it would be much quicker to download it manually. You can always run a Spark Docker image, but remember to specify the correct Hadoop version.

user

06/29/2022, 11:58 AM

I’m now getting this when trying to read a parquet file (i.e. upon

spark.read.parquet

Copy code

Symbol 'type org.apache.spark.sql.Row' is missing from the classpath.
This symbol is required by 'type org.apache.spark.sql.DataFrame'.
Make sure that type Row is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.

I did sbt clean and even deleted the

target

folder, but the error persists…

user

06/29/2022, 12:16 PM

When you say you commented them out of the dependencies, what do you mean? Where are they located?

user

06/29/2022, 12:32 PM

You are using a Spark installation to run your Spark program. The jars you added might be shadowed or dependent on other jars that are located in SPARK_HOME/jars. Do you mind trying to download this, configure SPARK_HOME to be

<the spark-3.2.1-bin-hadoop2.7 directory>

in the terminal you use to run the Spark program, and let me know if it works?

user

06/29/2022, 2:59 PM

Sorry, was in a sequence of meeting. The dependencies are listed in

build.sbt

file.

user

06/29/2022, 3:00 PM

The thing is that I’m trying to run this in intellij…

user

06/29/2022, 3:02 PM

That's ok, you can add an environment variable to your run configurations. Download the version, point SPARK_HOME in your intelliJ run configurations to that location, and run it. It's supposed to work...

user

06/29/2022, 3:04 PM

ok, I’ll try

user

06/29/2022, 3:04 PM

thanks

user

06/29/2022, 3:05 PM

Sure thing. Please update with the results 🙂

user

06/30/2022, 5:48 AM

Hi, Since I’m running directly in intellij (and not via command line spark submit), I placed the jars from the tgz in the lib folder, and commented out the spark dependencies from

build.sbt

. I’m using scala version 2.12.12. There was an exception that the version of jackson databind needs to be 2.12.*, so I put explicitly 2.12.7 in

build.sbt

. So what I’m getting now is IndexOutOfBounds when trying to read the parquet file (an error that I wasn’t getting before): ```Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 34826 out of bounds for length 198 at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) at com.fasterxml.jackson.module.scala.introspect.JavaParameterIntrospector$.getCtorParamNames(JavaParameterIntrospector.scala:12) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:41) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$2(BeanIntrospector.scala:61) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:61) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$23(BeanIntrospector.scala:203) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18(BeanIntrospector.scala:197) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18$adapted(BeanIntrospector.scala:194) at scala.collection.immutable.List.flatMap(List.scala:366) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:194) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:154) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:165) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:46) at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:502) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:530) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:421) at com.faste…

user

06/30/2022, 8:11 AM

Hi Gideon, Playing the dependency games can be painful because the jars themselves might be dependent on each other. I would advise you to follow the steps here (look for IntelliJ) to work with SBT in IntelliJ. It also gives an example of how to use Hadoop 2.7.

user

07/03/2022, 11:06 AM

Hi Johnathan. I’m now retrying this with gradle and command-line spark-submit. Is there a compatible combination of mavencentral (or similar) dependencies I can write in my

build.gradle

, or do I have to unzip that tar,gz directly into my lib folder?

user

07/03/2022, 11:15 AM

Hi Gideon, I'm not aware of such combination. As I mentioned above, it would be much faster if you would use a pre bundled spark home directory with every necessary jar. Other than that, you'll have to figure out the interdependencies of hadoop-common-2.7.x, hadoop-aws-2.7.x and the other jars your application code needs.

user

07/03/2022, 11:19 AM

ok, I’ll try. With spark submit it should be easier. Thanks.

user

07/03/2022, 12:23 PM

LMK if it helped!

user

07/03/2022, 1:32 PM

Getting this when attempting to write:

Copy code

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found

Tried to add

--packages io.lakefs:lakefs-spark-client-301_2.12:0.1.8

but it doesn’t seem to help…

user

07/03/2022, 1:51 PM

But the error persists…

user

07/03/2022, 1:54 PM

Ok, when I look at the contents of the resulted jar, I see that this class wasn’t packed inside…

user

07/03/2022, 2:02 PM

Did you try running the resulted jar with the

Copy code

--packages io.lakefs:hadoop-lakefs-assembly:0.1.6

flag?

user

07/03/2022, 2:03 PM

can you share the command that you use to run the spark job?

user

07/03/2022, 2:12 PM

Thanks, it looks we’re getting somewhere! I now have this:

Copy code

AmazonHttpClient: Unable to execute HTTP request: lakefs-poc.localhost: nodename nor servname provided, or not known

lakefs-poc

is the name of my bucket in the minio behind lakefs.

user

07/03/2022, 2:31 PM

I think you need to add the configuration for the path style..

user

07/03/2022, 2:31 PM

Tried this:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "false")

user

07/03/2022, 2:32 PM

Doesn’t seem to have an effect. Is there some other style you can suggest?

user

07/03/2022, 2:35 PM

Can you share the stack trace of the error? It’s hard to tell at which point it fails.

user

07/03/2022, 2:36 PM

```java.net.UnknownHostException: lakefs-poc.minio: nodename nor servname provided, or not known at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:932) at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1517) at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:851) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1507) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1366) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1300) at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:605) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781) at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:32) at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.ref…

user

07/03/2022, 2:38 PM

I tried these two options of pointing to minio:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://localhost:9000>")

and

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://minio:9000>")

That is, since “minio” is the name of the service in docker compose. The error seems to be the same in both cases. Let me re-check….

user

07/03/2022, 2:39 PM

Looks pretty much the same in the case of localhost as well: ```java.net.UnknownHostException: lakefs-poc.localhost: nodename nor servname provided, or not known at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:932) at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1517) at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:851) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1507) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1366) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1300) at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:605) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781) at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:32) at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(…

user

07/03/2022, 2:40 PM

You’re running in docker-compose, so you need to add a link. docker-compose knows how to route the host

minio

to that container, but not

lakefs-poc.minio

user

07/03/2022, 2:41 PM

Add this to the definition of your spark container:

Copy code

links:
      - minio:lakefs-poc.minio

user

07/03/2022, 2:42 PM

Wow… Docker-compose doesn’t deal with subdomains out of the box?….

user

07/03/2022, 2:42 PM

If I’m not mistaken 🤷

user

07/03/2022, 2:45 PM

Ok, I’ll check how to do that, thanks. So the in spark I should use the storage service hostname which can be resolvable from lakefs service (and not necessarily from the driver), right? i.e. in my case -

minio

and not

localhost

, right?

user

07/03/2022, 2:48 PM

It should be resolvable from both.

lakeFS

server needs storage access.

Spark

also needs direct storage access because the

lakeFS

client sometimes uploads directly to the storage.

user

07/03/2022, 2:49 PM

Please use:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

We currently don't support virtual hosted style…

user

07/04/2022, 6:56 AM

Hi Jonathan, Should setting

fs.s3a.path.style.access

to true prevent it from making requests to buckets using subdomains?

user

07/04/2022, 7:01 AM

Yes

user

07/04/2022, 7:04 AM

Strange. Even with it set to true, I’m still getting: java.net.UnknownHostException: lakefs-poc.minio

user

07/04/2022, 7:05 AM

Is it possible that you left this piece of code:

Copy code

links:
      - minio:lakefs-poc.minio

user

07/04/2022, 7:38 AM

Thanks. I had to add both this piece of code and a similar reference in /etc/hosts, to route lakefs-poc.minio to localhost. It would be too good to be true, though, for it to finally work 🤣 So here’s what I’m getting now:

Copy code

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)

user

07/04/2022, 8:03 AM

Looks like an auth issue, did you set your MinIO and lakeFS credentials in Spark appropriately?

user

07/04/2022, 8:40 AM

It looks like its failing on a

doesBucketExist

error. So can you check the following: 1. The bucket does exist. 2. As @Itai Admi said, that lakeFS configuration is correct. Especially the s3 endpoint part.

user

07/06/2022, 9:12 AM

Copy code

/Users/gcatz/Downloads/spark-3.2.1-bin-hadoop2.7/bin/spark-submit --class com.tutorial.spark.TopDomainRankingsLakeFsJob --conf spark.driver.extraJavaOptions="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" --packages io.lakefs:hadoop-lakefs-assembly:0.1.6 build/libs/simple-scala-spark-gradle.jar

user

07/06/2022, 10:00 AM

RELEASE.2022-06-02T02-11-04Z

Open in Slack

Previous Next