Hi <@U018PQSEPDE>, <@U02SNPQRGEM>. Continuing our ...
# help
u
Hi @Itai Admi, @Jonathan Rosenberg. Continuing our previous thread, here’s a status update:. I’m currently getting “bad request” amazon S3 exception, during a
doesBucketExist
request, as part of an attempt to write data to lakeFS:
Copy code
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 5A4G0DYMDXP6H486, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: RlgVlqdoXIpa4ieQL/mUU4kaRthFB4HrwvS7RpYawg2MYG2laCbapsgmrEog7L5+YBOsjRL2QwE=
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
...
Might it be that there’s an incompatibility in some versions between hadoop/lakeFS/amazon jars? I’m using the pre-bundled environment with spark 3.2.1 and hadoop 2.7. Thanks
u
Hi @Gideon Catz, Sorry to hear you are having difficulties. I went over the thread you had with @Jonathan Rosenberg and @Itai Admi. Did you happen to look over what Jonathan said? Checking the bucket in S3 and that the lakeFS Configuration is correct (Configuration Reference)?
u
Additionally, for the
lakefs://
URI to work properly, you'd have to configure the
fs.lakefs.impl
setting (and the rest of the settings that are covered in the integration doc).. if you've done so, could you share the configuration you're using?
u
Hi, For the sake of the POC, in order to make sure it’s not related to loading of properties, I’ve configured everything in the code for the time being:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ...my access key in AWS...)
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", ...my secret key in AWS...)
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint.region", "eu-central-1")
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<https://s3.eu-central-1.amazonaws.com>")
	
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "...the access key of the user I created in lakefs...)
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", ...the secret key of the user I created in lakefs...)
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>") // lakeFS runs locally in docker
u
As for the S3 bucket, I have no difficulties when performing operations through lakeFS UI, so I suppose there are no configuration problems there,
u
Thanks for the information. Another small question, are you using the same credentials on the UI and the spark configuration?
u
Yes, I rechecked the credentials in my docker-compose - I have the same coresponding values in
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID
and
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_SECRET_KEY
as the ones I configured through my code in
fs.s3a.access.key
and
fs.s3a.secret.key
, respectively.
u
Hi @Gideon Catz, I think you may still be running into some Spark/Hadoop versioning difficulties. This is unfortunately all too hard too debug. (On a side note, we hope to upgrade to Hadoop 3 support and everything will be beautiful rainbows 🌈 butterflies 🦋 and unicorns...). I read this on the previous thread, but I am not sure: can you please show which Spark and Hadoop versions you are using? Here's what I did on my box to verify versions (this shows Spark 2.4.8 with Hadoop 2.7.3, but I also verified that this works for Spark 3.1.3 compiled with Hadoop 3.2.0, which makes me reasonably confident that it should work for you too!):
Copy code
❯ spark-shell
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/ariels/stuff/spark-2.4.8-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.8.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/07/14 17:06:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at <http://fedora:4040>
Spark context available as 'sc' (master = local[*], app id = local-1657807584702).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.15)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo

scala> VersionInfo.getVersion
res0: String = 2.7.3

scala> :quit
u
Hi @Ariel Shaqed (Scolnicov) I am indeed using this pre-bundled environment of spark 3.2.1 and hadoop 2.7. Here’s the output on my machine:
Copy code
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 16.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo

scala> VersionInfo.getVersion
res0: String = 2.7.4
(btw, when are you planning to switch to hadoop 3? 😁)
u
Great, thanks! Let's talk about the Hadoop 3 timeline on a separate thread. (I'm really happy to hear requests for this, they really help gauge user demand. And (you didn't hear it from me...) a separate thread makes more noise...) Let's go in the opposite direction! I had a quick look at our integration tests. There we run lakeFSFS on the Bitnami Spark-3 image, and when I look at that it uses Spark 3.3.0 on Hadoop 3.3.2. I think the lakeFS Everything Bagel also uses that version. Can we move you in the opposite direction, to try these newest versions? Sorry for all this back-and-forth.
u
Hi Ariel, Thank you for the insights. Well, I think for the sake of the POC I can indeed run the environment you are suggesting. We’ve recently upgraded spark version to 3.2.1 in our staging environment and it went quite smoothly, so I don’t expect problems upgrading to 3.3.0 later on. Is there some prabundled tar gz, or can I only run bitnami via docker-compose?
u
The official download on Apache right now is Spark 3.3.0 with Hadoop 3.3+, I would try that one. Again, sorry for versioning issues. Spark loads packages strangely, which probably makes plugging in harder than it should be.
u
Ok, thanks I’ll try that. But will lakeFS run with hadoop 3+ ?
u
Anyway, now I’m getting this NumberFormatException:
Copy code
22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <s3a://dwh-lakefs>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/main/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 INFO SparkUI: Stopped Spark web UI at <http://10.100.102.37:4041>
java.lang.NumberFormatException: For input string: "64M"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
	at java.base/java.lang.Long.parseLong(Long.java:714)
	at java.base/java.lang.Long.parseLong(Long.java:839)
	at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1586)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:558)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:793)
	at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:34)
	at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
u
From what I’m reading, this might be caused by mixing hadoop-common with an older version of hadoop-aws, What I have in my gradle dependencies is only this:
Copy code
implementation 'org.apache.spark:spark-sql_2.12:3.3.0'
u
That may be an issue. Hadoop dependencies have to match across the board.
u
Do you have any
hadoop
lines in your gradle dependencies?
u
I’m using this parameter when running:
Copy code
--packages io.lakefs:hadoop-lakefs-assembly:0.1.6,org.apache.hadoop:hadoop-aws:3.3.0
u
Do you have any specific Gradle dependency? (Or, can you drop the hadoop-aws) package entirely?)
u
There’s nothing about hadoop in my build.gradle. I removed
org.apache.hadoop:hadoop-aws:3.3.0
from the
--packages
- still the same exception
u
Can you run
ls $SPARK_HOME/jars | grep hadoop
and post the versions you use? (at this point please make sure that $SPARK_HOME is configured to the Spark bundle location you downloaded)
u
Here’s what I see:
Copy code
hadoop-client-api-3.3.2.jar
hadoop-client-runtime-3.3.2.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.2.jar
parquet-hadoop-1.12.2.jar
u
@Jonathan Rosenberg, let's try to split the issue. Do you reckon the following will work? 1. Assemble a lakeFSFS package based on Hadoop 3.3.0. Give this (manually) to @Gideon Catz. 2. Revisit the lakeFSFS-on-Spark3 test in our CI to understand why it works for us but not for @Gideon Catz. Fix that. (I suspect they are using Spark by some provider (EMR??), which ends up supplying something subtly different...) @Gideon Catz would that work for you? You would need to supply the assembled JAR from
--jars
and some S3 path rather than as a package, and it would not be the current release that we supply.
u
Thanks Ariel. If that turns out to work, then it’ll definitely be better than the current state of things :-)
u
Hi @Gideon Catz and friends, There's a problem reaching out specifically to the
eu-central-1
region with earlier
hadoop-aws
versions because their S3 client cannot be instructed to use V4 signing. You have two main options here: 1. Use a bucket from a different region. 2. Add the following configuration to your Spark submit command:
Copy code
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4'
Try it and let me know if it worked...
u
@Gideon Catz I'm surprised that plain hadoop-aws s3a (with no lakeFSFS or lakeFS) works on that bucket without that setting!
u
@Ariel Shaqed (Scolnicov), I did use
--packages io.lakefs:hadoop-lakefs-assembly:0.1.6
in the spark-submit command. That provides the necessary classes, doesn’t it?
u
Hi @Jonathan Rosenberg, I tried submitting based on the pre-bundled spark 3.2.1 & hadoop 2.7 environment with these two parameters, and ended up with this exception:
Copy code
java.io.IOException: listObjects
	at io.lakefs.LakeFSFileSystem$ListingIterator.readNextChunk(LakeFSFileSystem.java:893)
	at io.lakefs.LakeFSFileSystem$ListingIterator.hasNext(LakeFSFileSystem.java:873)
	at io.lakefs.LakeFSFileSystem.exists(LakeFSFileSystem.java:793)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:117)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
	at <http://org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org|org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org>$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
u
Hi @Gideon Catz, Happy to see we're getting stuck further down the chain. The stacktrace you sent has a cause, the line that throws it looks like this:
Copy code
} catch (ApiException e) {
                    throw new IOException("listObjects", e);
                }
Could you send the complete error report, including the message and (especially) the cause? The specific ApiException from lakeFS might be able to say more about what went wrong. THANKS!