Hi Team, I’m trying to write a parquet to lakeFS f...
# help
u
Hi Team, I’m trying to write a parquet to lakeFS for the first time. Working with scala, locally, using SBT. I included this in my dependencies:
Copy code
libraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"
Yet when trying to write out a dataframe:
Copy code
outputDf.write.parquet(s"lakefs://${repo}/${branch}/example.parquet")
I’m getting:
Copy code
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "lakefs"
I’m executing the program in intellij, in a main method of an object. What am I missing here? Maybe I should somehow exclusively mention --packages somewhere in the run configuration, even though lakefs is in the dependencies?
u
Hey Gideon! I think that there are some missing configurations, in this case
fs.lakefs.impl
. Our docs lists all needed configurations, I would start with the above one, and
fs.lakefs.access.key
,
fs.lakefs.secret.key
&
fs.lakefs.endpoint
.
u
Let me know if that doesn’t work for you.
u
Thanks Itai. I’ve put these in hdfs-site.xml, which I placed in the folder
/usr/local/Cellar/hadoop/3.3.1/libexec/etc/hadoop
. And in the run configurations I set SPARK_HOME to that location. The question is whether it was actually picked up by spark…
u
I guess you’re right, can you try passing the configurations explicitly and not with
hdfs-site
?
u
Or set it in your code for now:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<https://lakefs.example.com/api/v1>")
u
Ok, I can try. But don’t you think it’s something related to the lakefs jar not being taken into account?
u
If the jar wasn’t loaded, then this configuration will not work as well, but it least we narrowed down the problem.
u
ok, I’ll check, thanks
u
Thanks Itai - Looks like setting the values explicitly in scala has helped (though I still wonder how should I make it discover the XML). So No I’m getting a different error when trying to write out the parquet:
Copy code
22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <s3a://lakefs-poc/>: java.lang.NumberFormatException: For input string: "64M"
22/06/29 11:12:41 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/from-app-2022-06-28/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
.....
I saw some threads that this might be related to version incompatibility between spark and hadoop, but my dependencies are all aligned to version 3.2.1:
Copy code
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-auth" % "3.2.1"
libraryDependencies += "io.lakefs" % "hadoop-lakefs" % "0.1.6"
u
The same happens, by the way, if try to write it out as CSV:
outputDf.write.csv(s"lakefs://${repo}/${branch}/hello-world.csv")
u
I think there’s a dependencies mismatch. Can you try to use
hadoop-common
&
hadoop-auth
at version
2.7.7
u
I don’t think it likes it:
Copy code
Exception in thread "main" java.lang.NoSuchMethodError: 'void org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(java.lang.String)'
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
	at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
That’s the same error I was getting before I realized I should add hadoop-common and hadoop-auth as dependencies in the first place.
u
Hi Gideon, I can see that you use Hadoop 3.3.1, unfortunately, lakeFSFS is built for hadoop 2.7.7. Will it be possible for you to use Spark (whichever version) and Hadoop 2.7.7? For example, this Spark version.
u
Is that a bundle of spark 3.2.1 with hadoop 2.7.7?
u
Why are these things bundled?
u
Thanks Jonathan, In any case It is important for me to point out that I’m not using HDFS here. For my POC I’m using a local minio (in the same docker-compose as lakefs). In production I’m planning to use our corporate installation of Scality (which in the future may change to Amazon S3). What does it mean in terms of the hadoop jars version I have to use?
u
Spark can work with different versions of Hadoop's API. This is a bundle of Spark with the Hadoop jars that is compatible with Hadoop 2.7.x. (which is the compatible Hadoop API version with lakeFSFS). This has nothing to do with HDFS so we're cool 😎 If you download this Spark version you'll have all the necessary Hadoop API jars, and you'll be able to work with lakeFSFS. If you need any help in configuring the environment please let me know!
u
Ok, thanks Jonathan. Can this version be downloaded only manually? Is it not available via mavencentral or something?
u
You can download it manually or build it using the source code. You can see here the options. Personally, I think that it would be much quicker to download it manually. You can always run a Spark Docker image, but remember to specify the correct Hadoop version.
u
I’m now getting this when trying to read a parquet file (i.e. upon
spark.read.parquet
):
Copy code
Symbol 'type org.apache.spark.sql.Row' is missing from the classpath.
This symbol is required by 'type org.apache.spark.sql.DataFrame'.
Make sure that type Row is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.
I did sbt clean and even deleted the
target
folder, but the error persists…
u
When you say you commented them out of the dependencies, what do you mean? Where are they located?
u
You are using a Spark installation to run your Spark program. The jars you added might be shadowed or dependent on other jars that are located in SPARK_HOME/jars. Do you mind trying to download this, configure SPARK_HOME to be
<the spark-3.2.1-bin-hadoop2.7 directory>
in the terminal you use to run the Spark program, and let me know if it works?
u
Sorry, was in a sequence of meeting. The dependencies are listed in
build.sbt
file.
u
The thing is that I’m trying to run this in intellij…
u
That's ok, you can add an environment variable to your run configurations. Download the version, point SPARK_HOME in your intelliJ run configurations to that location, and run it. It's supposed to work...
u
ok, I’ll try
u
thanks
u
Sure thing. Please update with the results 🙂
u
Hi, Since I’m running directly in intellij (and not via command line spark submit), I placed the jars from the tgz in the lib folder, and commented out the spark dependencies from
build.sbt
. I’m using scala version 2.12.12. There was an exception that the version of jackson databind needs to be 2.12.*, so I put explicitly 2.12.7 in
build.sbt
. So what I’m getting now is IndexOutOfBounds when trying to read the parquet file (an error that I wasn’t getting before): ```Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 34826 out of bounds for length 198 at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) at com.fasterxml.jackson.module.scala.introspect.JavaParameterIntrospector$.getCtorParamNames(JavaParameterIntrospector.scala:12) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:41) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$2(BeanIntrospector.scala:61) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:61) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$23(BeanIntrospector.scala:203) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18(BeanIntrospector.scala:197) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$18$adapted(BeanIntrospector.scala:194) at scala.collection.immutable.List.flatMap(List.scala:366) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:194) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:154) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:165) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:46) at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:502) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:530) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:421) at com.faste…
u
Hi Gideon, Playing the dependency games can be painful because the jars themselves might be dependent on each other. I would advise you to follow the steps here (look for IntelliJ) to work with SBT in IntelliJ. It also gives an example of how to use Hadoop 2.7.
u
Hi Johnathan. I’m now retrying this with gradle and command-line spark-submit. Is there a compatible combination of mavencentral (or similar) dependencies I can write in my
build.gradle
, or do I have to unzip that tar,gz directly into my lib folder?
u
Hi Gideon, I'm not aware of such combination. As I mentioned above, it would be much faster if you would use a pre bundled spark home directory with every necessary jar. Other than that, you'll have to figure out the interdependencies of hadoop-common-2.7.x, hadoop-aws-2.7.x and the other jars your application code needs.
u
ok, I’ll try. With spark submit it should be easier. Thanks.
u
LMK if it helped!
u
Getting this when attempting to write:
Copy code
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
Tried to add
--packages io.lakefs:lakefs-spark-client-301_2.12:0.1.8
but it doesn’t seem to help…
u
But the error persists…
u
Ok, when I look at the contents of the resulted jar, I see that this class wasn’t packed inside…
u
Did you try running the resulted jar with the
Copy code
--packages io.lakefs:hadoop-lakefs-assembly:0.1.6
flag?
u
can you share the command that you use to run the spark job?
u
Thanks, it looks we’re getting somewhere! I now have this:
Copy code
AmazonHttpClient: Unable to execute HTTP request: lakefs-poc.localhost: nodename nor servname provided, or not known
lakefs-poc
is the name of my bucket in the minio behind lakefs.
u
I think you need to add the configuration for the path style..
u
Tried this:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "false")
u
Doesn’t seem to have an effect. Is there some other style you can suggest?
u
Can you share the stack trace of the error? It’s hard to tell at which point it fails.
u
```java.net.UnknownHostException: lakefs-poc.minio: nodename nor servname provided, or not known at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:932) at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1517) at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:851) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1507) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1366) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1300) at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:605) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781) at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:32) at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.ref…
u
I tried these two options of pointing to minio:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://localhost:9000>")
and
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<http://minio:9000>")
That is, since “minio” is the name of the service in docker compose. The error seems to be the same in both cases. Let me re-check….
u
Looks pretty much the same in the case of localhost as well: ```java.net.UnknownHostException: lakefs-poc.localhost: nodename nor servname provided, or not known at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:932) at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1517) at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:851) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1507) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1366) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1300) at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:605) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781) at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:32) at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(…
u
You’re running in docker-compose, so you need to add a link. docker-compose knows how to route the host
minio
to that container, but not
lakefs-poc.minio
u
Add this to the definition of your spark container:
Copy code
links:
      - minio:lakefs-poc.minio
u
Wow… Docker-compose doesn’t deal with subdomains out of the box?….
u
If I’m not mistaken 🤷
u
Ok, I’ll check how to do that, thanks. So the in spark I should use the storage service hostname which can be resolvable from lakefs service (and not necessarily from the driver), right? i.e. in my case -
minio
and not
localhost
, right?
u
It should be resolvable from both.
lakeFS
server needs storage access.
Spark
also needs direct storage access because the
lakeFS
client sometimes uploads directly to the storage.
u
Please use:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
We currently don't support virtual hosted style…
u
Hi Jonathan, Should setting
fs.s3a.path.style.access
to true prevent it from making requests to buckets using subdomains?
u
Yes
u
Strange. Even with it set to true, I’m still getting: java.net.UnknownHostException: lakefs-poc.minio
u
Is it possible that you left this piece of code:
Copy code
links:
      - minio:lakefs-poc.minio
u
Thanks. I had to add both this piece of code and a similar reference in /etc/hosts, to route lakefs-poc.minio to localhost. It would be too good to be true, though, for it to finally work 🤣 So here’s what I’m getting now:
Copy code
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
u
Looks like an auth issue, did you set your MinIO and lakeFS credentials in Spark appropriately?
u
It looks like its failing on a
doesBucketExist
error. So can you check the following: 1. The bucket does exist. 2. As @Itai Admi said, that lakeFS configuration is correct. Especially the s3 endpoint part.
u
Copy code
/Users/gcatz/Downloads/spark-3.2.1-bin-hadoop2.7/bin/spark-submit --class com.tutorial.spark.TopDomainRankingsLakeFsJob --conf spark.driver.extraJavaOptions="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" --packages io.lakefs:hadoop-lakefs-assembly:0.1.6 build/libs/simple-scala-spark-gradle.jar
u
RELEASE.2022-06-02T02-11-04Z