user
06/29/2022, 7:33 AMlibraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"
Yet when trying to write out a dataframe:
outputDf.write.parquet(s"lakefs://${repo}/${branch}/example.parquet")
I’m getting:
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "lakefs"
I’m executing the program in intellij, in a main method of an object.
What am I missing here? Maybe I should somehow exclusively mention --packages somewhere in the run configuration, even though lakefs is in the dependencies?user
06/29/2022, 7:40 AMfs.lakefs.impl
. Our docs lists all needed configurations, I would start with the above one, and fs.lakefs.access.key
, fs.lakefs.secret.key
& fs.lakefs.endpoint
.user
06/29/2022, 7:41 AMuser
06/29/2022, 7:46 AM/usr/local/Cellar/hadoop/3.3.1/libexec/etc/hadoop
.
And in the run configurations I set SPARK_HOME to that location.
The question is whether it was actually picked up by spark…user
06/29/2022, 7:49 AMhdfs-site
?user
06/29/2022, 7:50 AMspark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<https://lakefs.example.com/api/v1>")
user
06/29/2022, 7:50 AMuser
06/29/2022, 7:51 AMuser
06/29/2022, 7:52 AMuser
06/29/2022, 8:23 AM22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <s3a://lakefs-poc/>: java.lang.NumberFormatException: For input string: "64M"
22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/from-app-2022-06-28/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
.....
I saw some threads that this might be related to version incompatibility between spark and hadoop, but my dependencies are all aligned to version 3.2.1:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-auth" % "3.2.1"
libraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"
user
06/29/2022, 8:34 AMoutputDf.write.csv(s"lakefs://${repo}/${branch}/hello-world.csv")
user
06/29/2022, 8:35 AMhadoop-common
& hadoop-auth
at version 2.7.7
user
06/29/2022, 8:50 AMException in thread "main" java.lang.NoSuchMethodError: 'void org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(java.lang.String)'
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
That’s the same error I was getting before I realized I should add hadoop-common and hadoop-auth as dependencies in the first place.user
06/29/2022, 8:53 AMuser
06/29/2022, 9:09 AMuser
06/29/2022, 9:09 AMuser
06/29/2022, 9:10 AMuser
06/29/2022, 9:18 AMuser
06/29/2022, 9:23 AMuser
06/29/2022, 11:58 AMspark.read.parquet
):
Symbol 'type org.apache.spark.sql.Row' is missing from the classpath.
This symbol is required by 'type org.apache.spark.sql.DataFrame'.
Make sure that type Row is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.
I did sbt clean and even deleted the target
folder, but the error persists…user
06/29/2022, 12:16 PMuser
06/29/2022, 12:32 PM<the spark-3.2.1-bin-hadoop2.7 directory>
in the terminal you use to run the Spark program, and let me know if it works?user
06/29/2022, 2:59 PMbuild.sbt
file.user
06/29/2022, 3:00 PMuser
06/29/2022, 3:02 PMuser
06/29/2022, 3:04 PMuser
06/29/2022, 3:04 PMuser
06/29/2022, 3:05 PMuser
06/30/2022, 5:48 AMbuild.sbt
.
I’m using scala version 2.12.12.
There was an exception that the version of jackson databind needs to be 2.12.*, so I put explicitly 2.12.7 in build.sbt
.
So what I’m getting now is IndexOutOfBounds when trying to read the parquet file (an error that I wasn’t getting before):
```Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 34826 out of bounds for length 198
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
at com.fasterxml.jackson.module.scala.introspect.JavaParameterIntrospector$.getCtorParamNames(JavaParameterIntrospector.scala:12)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:41)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$2(BeanIntrospector.scala:61)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:61)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$23(BeanIntrospector.scala:203)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:285)
at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18(BeanIntrospector.scala:197)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18$adapted(BeanIntrospector.scala:194)
at scala.collection.immutable.List.flatMap(List.scala:366)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:194)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:154)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:165)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:46)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:502)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:530)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:421)
at com.faste…user
06/30/2022, 8:11 AMuser
07/03/2022, 11:06 AMbuild.gradle
, or do I have to unzip that tar,gz directly into my lib folder?user
07/03/2022, 11:15 AMuser
07/03/2022, 11:19 AMuser
07/03/2022, 12:23 PMuser
07/03/2022, 1:32 PMjava.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
Tried to add --packages io.lakefs:lakefs-spark-client-301_2.12:0.1.8
but it doesn’t seem to help…user
07/03/2022, 1:51 PMuser
07/03/2022, 1:54 PMuser
07/03/2022, 2:02 PM--packages io.lakefs:hadoop-lakefs-assembly:0.1.6
flag?user
07/03/2022, 2:03 PMuser
07/03/2022, 2:12 PMAmazonHttpClient: Unable to execute HTTP request: lakefs-poc.localhost: nodename nor servname provided, or not known
lakefs-poc
is the name of my bucket in the minio behind lakefs.user
07/03/2022, 2:31 PMuser
07/03/2022, 2:31 PMspark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "false")
user
07/03/2022, 2:32 PMuser
07/03/2022, 2:35 PMuser
07/03/2022, 2:36 PMuser
07/03/2022, 2:38 PMspark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://localhost:9000>")
and
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://minio:9000>")
That is, since “minio” is the name of the service in docker compose.
The error seems to be the same in both cases. Let me re-check….user
07/03/2022, 2:39 PMuser
07/03/2022, 2:40 PMminio
to that container, but not lakefs-poc.minio
user
07/03/2022, 2:41 PMlinks:
- minio:lakefs-poc.minio
user
07/03/2022, 2:42 PMuser
07/03/2022, 2:42 PMuser
07/03/2022, 2:45 PMminio
and not localhost
, right?user
07/03/2022, 2:48 PMlakeFS
server needs storage access. Spark
also needs direct storage access because the lakeFS
client sometimes uploads directly to the storage.user
07/03/2022, 2:49 PMspark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
We currently don't support virtual hosted style…user
07/04/2022, 6:56 AMfs.s3a.path.style.access
to true prevent it from making requests to buckets using subdomains?user
07/04/2022, 7:01 AMuser
07/04/2022, 7:04 AMuser
07/04/2022, 7:05 AMlinks:
- minio:lakefs-poc.minio
user
07/04/2022, 7:38 AMException in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: null
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
user
07/04/2022, 8:03 AMuser
07/04/2022, 8:40 AMdoesBucketExist
error.
So can you check the following:
1. The bucket does exist.
2. As @Itai Admi said, that lakeFS configuration is correct. Especially the s3 endpoint part.user
07/06/2022, 9:12 AM/Users/gcatz/Downloads/spark-3.2.1-bin-hadoop2.7/bin/spark-submit --class com.tutorial.spark.TopDomainRankingsLakeFsJob --conf spark.driver.extraJavaOptions="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" --packages io.lakefs:hadoop-lakefs-assembly:0.1.6 build/libs/simple-scala-spark-gradle.jar
user
07/06/2022, 10:00 AM