Hi < Itai Admi> < Jonathan Rosenberg> Continuing our <https lakeFS #help

Hi <@U018PQSEPDE>, <@U02SNPQRGEM>. Continuing our ...

user

07/14/2022, 10:05 AM

Hi @Itai Admi, @Jonathan Rosenberg. Continuing our previous thread, here’s a status update:. I’m currently getting “bad request” amazon S3 exception, during a

doesBucketExist

request, as part of an attempt to write data to lakeFS:

Copy code

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 5A4G0DYMDXP6H486, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: RlgVlqdoXIpa4ieQL/mUU4kaRthFB4HrwvS7RpYawg2MYG2laCbapsgmrEog7L5+YBOsjRL2QwE=
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
...

Might it be that there’s an incompatibility in some versions between hadoop/lakeFS/amazon jars? I’m using the pre-bundled environment with spark 3.2.1 and hadoop 2.7. Thanks

user

07/14/2022, 10:19 AM

Hi @Gideon Catz, Sorry to hear you are having difficulties. I went over the thread you had with @Jonathan Rosenberg and @Itai Admi. Did you happen to look over what Jonathan said? Checking the bucket in S3 and that the lakeFS Configuration is correct (Configuration Reference)?

user

07/14/2022, 10:29 AM

Additionally, for the

lakefs://

URI to work properly, you'd have to configure the

fs.lakefs.impl

setting (and the rest of the settings that are covered in the integration doc).. if you've done so, could you share the configuration you're using?

user

07/14/2022, 11:08 AM

Hi, For the sake of the POC, in order to make sure it’s not related to loading of properties, I’ve configured everything in the code for the time being:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ...my access key in AWS...)
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", ...my secret key in AWS...)
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint.region", "eu-central-1")
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<https://s3.eu-central-1.amazonaws.com>")
	
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "...the access key of the user I created in lakefs...)
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", ...the secret key of the user I created in lakefs...)
    spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>") // lakeFS runs locally in docker

user

07/14/2022, 11:10 AM

As for the S3 bucket, I have no difficulties when performing operations through lakeFS UI, so I suppose there are no configuration problems there,

user

07/14/2022, 11:13 AM

Thanks for the information. Another small question, are you using the same credentials on the UI and the spark configuration?

user

07/14/2022, 1:40 PM

Yes, I rechecked the credentials in my docker-compose - I have the same coresponding values in

LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID

and

LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_SECRET_KEY

as the ones I configured through my code in

fs.s3a.access.key

and

fs.s3a.secret.key

, respectively.

user

07/14/2022, 2:07 PM

Hi @Gideon Catz, I think you may still be running into some Spark/Hadoop versioning difficulties. This is unfortunately all too hard too debug. (On a side note, we hope to upgrade to Hadoop 3 support and everything will be beautiful rainbows 🌈 butterflies 🦋 and unicorns...). I read this on the previous thread, but I am not sure: can you please show which Spark and Hadoop versions you are using? Here's what I did on my box to verify versions (this shows Spark 2.4.8 with Hadoop 2.7.3, but I also verified that this works for Spark 3.1.3 compiled with Hadoop 3.2.0, which makes me reasonably confident that it should work for you too!):

Copy code

❯ spark-shell
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/ariels/stuff/spark-2.4.8-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.8.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/07/14 17:06:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at <http://fedora:4040>
Spark context available as 'sc' (master = local[*], app id = local-1657807584702).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.15)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo

scala> VersionInfo.getVersion
res0: String = 2.7.3

scala> :quit

user

07/14/2022, 2:56 PM

Hi @Ariel Shaqed (Scolnicov) I am indeed using this pre-bundled environment of spark 3.2.1 and hadoop 2.7. Here’s the output on my machine:

Copy code

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 16.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo

scala> VersionInfo.getVersion
res0: String = 2.7.4

(btw, when are you planning to switch to hadoop 3? 😁)

user

07/14/2022, 3:41 PM

Great, thanks! Let's talk about the Hadoop 3 timeline on a separate thread. (I'm really happy to hear requests for this, they really help gauge user demand. And (you didn't hear it from me...) a separate thread makes more noise...) Let's go in the opposite direction! I had a quick look at our integration tests. There we run lakeFSFS on the Bitnami Spark-3 image, and when I look at that it uses Spark 3.3.0 on Hadoop 3.3.2. I think the lakeFS Everything Bagel also uses that version. Can we move you in the opposite direction, to try these newest versions? Sorry for all this back-and-forth.

user

07/17/2022, 6:11 AM

Hi Ariel, Thank you for the insights. Well, I think for the sake of the POC I can indeed run the environment you are suggesting. We’ve recently upgraded spark version to 3.2.1 in our staging environment and it went quite smoothly, so I don’t expect problems upgrading to 3.3.0 later on. Is there some prabundled tar gz, or can I only run bitnami via docker-compose?

user

07/17/2022, 7:02 AM

The official download on Apache right now is Spark 3.3.0 with Hadoop 3.3+, I would try that one. Again, sorry for versioning issues. Spark loads packages strangely, which probably makes plugging in harder than it should be.

user

07/17/2022, 7:06 AM

Ok, thanks I’ll try that. But will lakeFS run with hadoop 3+ ?

user

07/17/2022, 7:30 AM

Anyway, now I’m getting this NumberFormatException:

Copy code

22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <s3a://dwh-lakefs>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/main/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 INFO SparkUI: Stopped Spark web UI at <http://10.100.102.37:4041>
java.lang.NumberFormatException: For input string: "64M"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
	at java.base/java.lang.Long.parseLong(Long.java:714)
	at java.base/java.lang.Long.parseLong(Long.java:839)
	at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1586)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
	at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:558)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:793)
	at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:34)
	at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

user

07/17/2022, 10:16 AM

From what I’m reading, this might be caused by mixing hadoop-common with an older version of hadoop-aws, What I have in my gradle dependencies is only this:

Copy code

implementation 'org.apache.spark:spark-sql_2.12:3.3.0'

user

07/17/2022, 10:16 AM

That may be an issue. Hadoop dependencies have to match across the board.

user

07/17/2022, 10:17 AM

Do you have any

hadoop

lines in your gradle dependencies?

user

07/17/2022, 10:18 AM

I’m using this parameter when running:

Copy code

--packages io.lakefs:hadoop-lakefs-assembly:0.1.6,org.apache.hadoop:hadoop-aws:3.3.0

user

07/17/2022, 10:19 AM

Do you have any specific Gradle dependency? (Or, can you drop the hadoop-aws) package entirely?)

user

07/17/2022, 10:21 AM

There’s nothing about hadoop in my build.gradle. I removed

org.apache.hadoop:hadoop-aws:3.3.0

from the

--packages

- still the same exception

user

07/17/2022, 10:51 AM

Can you run

ls $SPARK_HOME/jars | grep hadoop

and post the versions you use? (at this point please make sure that $SPARK_HOME is configured to the Spark bundle location you downloaded)

user

07/17/2022, 11:15 AM

Here’s what I see:

Copy code

hadoop-client-api-3.3.2.jar
hadoop-client-runtime-3.3.2.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.2.jar
parquet-hadoop-1.12.2.jar

user

07/17/2022, 12:24 PM

@Jonathan Rosenberg, let's try to split the issue. Do you reckon the following will work? 1. Assemble a lakeFSFS package based on Hadoop 3.3.0. Give this (manually) to @Gideon Catz. 2. Revisit the lakeFSFS-on-Spark3 test in our CI to understand why it works for us but not for @Gideon Catz. Fix that. (I suspect they are using Spark by some provider (EMR??), which ends up supplying something subtly different...) @Gideon Catz would that work for you? You would need to supply the assembled JAR from

--jars

and some S3 path rather than as a package, and it would not be the current release that we supply.

user

07/17/2022, 12:29 PM

Thanks Ariel. If that turns out to work, then it’ll definitely be better than the current state of things :-)

user

07/21/2022, 2:33 PM

Hi @Gideon Catz and friends, There's a problem reaching out specifically to the

eu-central-1

region with earlier

hadoop-aws

versions because their S3 client cannot be instructed to use V4 signing. You have two main options here: 1. Use a bucket from a different region. 2. Add the following configuration to your Spark submit command:

Copy code

--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4'

Try it and let me know if it worked...

user

07/22/2022, 10:14 AM

@Gideon Catz I'm surprised that plain hadoop-aws s3a (with no lakeFSFS or lakeFS) works on that bucket without that setting!

user

07/24/2022, 6:06 AM

@Ariel Shaqed (Scolnicov), I did use

--packages io.lakefs:hadoop-lakefs-assembly:0.1.6

in the spark-submit command. That provides the necessary classes, doesn’t it?

user

07/24/2022, 6:09 AM

Hi @Jonathan Rosenberg, I tried submitting based on the pre-bundled spark 3.2.1 & hadoop 2.7 environment with these two parameters, and ended up with this exception:

Copy code

java.io.IOException: listObjects
	at io.lakefs.LakeFSFileSystem$ListingIterator.readNextChunk(LakeFSFileSystem.java:893)
	at io.lakefs.LakeFSFileSystem$ListingIterator.hasNext(LakeFSFileSystem.java:873)
	at io.lakefs.LakeFSFileSystem.exists(LakeFSFileSystem.java:793)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:117)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
	at <http://org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org|org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org>$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)

user

07/24/2022, 7:14 AM

Hi @Gideon Catz, Happy to see we're getting stuck further down the chain. The stacktrace you sent has a cause, the line that throws it looks like this:

Copy code

} catch (ApiException e) {
                    throw new IOException("listObjects", e);
                }

Could you send the complete error report, including the message and (especially) the cause? The specific ApiException from lakeFS might be able to say more about what went wrong. THANKS!

Open in Slack

Previous Next