user
07/14/2022, 10:05 AMdoesBucketExist
request, as part of an attempt to write data to lakeFS:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 5A4G0DYMDXP6H486, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: RlgVlqdoXIpa4ieQL/mUU4kaRthFB4HrwvS7RpYawg2MYG2laCbapsgmrEog7L5+YBOsjRL2QwE=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
...
Might it be that there’s an incompatibility in some versions between hadoop/lakeFS/amazon jars?
I’m using the pre-bundled environment with spark 3.2.1 and hadoop 2.7.
Thanksuser
07/14/2022, 10:19 AMuser
07/14/2022, 10:29 AMlakefs://
URI to work properly, you'd have to configure the fs.lakefs.impl
setting (and the rest of the settings that are covered in the integration doc).. if you've done so, could you share the configuration you're using?user
07/14/2022, 11:08 AMspark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ...my access key in AWS...)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", ...my secret key in AWS...)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint.region", "eu-central-1")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<https://s3.eu-central-1.amazonaws.com>")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "...the access key of the user I created in lakefs...)
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", ...the secret key of the user I created in lakefs...)
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>") // lakeFS runs locally in docker
user
07/14/2022, 11:10 AMuser
07/14/2022, 11:13 AMuser
07/14/2022, 1:40 PMLAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID
and LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_SECRET_KEY
as the ones I configured through my code in fs.s3a.access.key
and fs.s3a.secret.key
, respectively.user
07/14/2022, 2:07 PM❯ spark-shell
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/ariels/stuff/spark-2.4.8-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.8.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/07/14 17:06:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at <http://fedora:4040>
Spark context available as 'sc' (master = local[*], app id = local-1657807584702).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.8
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.15)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo
scala> VersionInfo.getVersion
res0: String = 2.7.3
scala> :quit
user
07/14/2022, 2:56 PMWelcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 16.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.hadoop.util.VersionInfo
import org.apache.hadoop.util.VersionInfo
scala> VersionInfo.getVersion
res0: String = 2.7.4
(btw, when are you planning to switch to hadoop 3? 😁)user
07/14/2022, 3:41 PMuser
07/17/2022, 6:11 AMuser
07/17/2022, 7:02 AMuser
07/17/2022, 7:06 AMuser
07/17/2022, 7:30 AM22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <s3a://dwh-lakefs>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 WARN FileSystem: Failed to initialize fileystem <lakefs://lakefs-poc-repo/main/urlf-example-01.parquet>: java.lang.NumberFormatException: For input string: "64M"
22/07/17 10:27:21 INFO SparkUI: Stopped Spark web UI at <http://10.100.102.37:4041>
java.lang.NumberFormatException: For input string: "64M"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
at java.base/java.lang.Long.parseLong(Long.java:714)
at java.base/java.lang.Long.parseLong(Long.java:839)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1586)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at io.lakefs.LakeFSFileSystem.initializeWithClient(LakeFSFileSystem.java:93)
at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:558)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:793)
at com.tutorial.spark.TopDomainRankingsLakeFsJob$.main(TopDomainRankingsLakeFsJob.scala:34)
at com.tutorial.spark.TopDomainRankingsLakeFsJob.main(TopDomainRankingsLakeFsJob.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
user
07/17/2022, 10:16 AMimplementation 'org.apache.spark:spark-sql_2.12:3.3.0'
user
07/17/2022, 10:16 AMuser
07/17/2022, 10:17 AMhadoop
lines in your gradle dependencies?user
07/17/2022, 10:18 AM--packages io.lakefs:hadoop-lakefs-assembly:0.1.6,org.apache.hadoop:hadoop-aws:3.3.0
user
07/17/2022, 10:19 AMuser
07/17/2022, 10:21 AMorg.apache.hadoop:hadoop-aws:3.3.0
from the --packages
- still the same exceptionuser
07/17/2022, 10:51 AMls $SPARK_HOME/jars | grep hadoop
and post the versions you use? (at this point please make sure that $SPARK_HOME is configured to the Spark bundle location you downloaded)user
07/17/2022, 11:15 AMhadoop-client-api-3.3.2.jar
hadoop-client-runtime-3.3.2.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.2.jar
parquet-hadoop-1.12.2.jar
user
07/17/2022, 12:24 PM--jars
and some S3 path rather than as a package, and it would not be the current release that we supply.user
07/17/2022, 12:29 PMuser
07/21/2022, 2:33 PMeu-central-1
region with earlier hadoop-aws
versions because their S3 client cannot be instructed to use V4 signing. You have two main options here:
1. Use a bucket from a different region.
2. Add the following configuration to your Spark submit command:
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4'
Try it and let me know if it worked...user
07/22/2022, 10:14 AMuser
07/24/2022, 6:06 AM--packages io.lakefs:hadoop-lakefs-assembly:0.1.6
in the spark-submit command. That provides the necessary classes, doesn’t it?user
07/24/2022, 6:09 AMjava.io.IOException: listObjects
at io.lakefs.LakeFSFileSystem$ListingIterator.readNextChunk(LakeFSFileSystem.java:893)
at io.lakefs.LakeFSFileSystem$ListingIterator.hasNext(LakeFSFileSystem.java:873)
at io.lakefs.LakeFSFileSystem.exists(LakeFSFileSystem.java:793)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:117)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at <http://org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org|org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org>$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
user
07/24/2022, 7:14 AM} catch (ApiException e) {
throw new IOException("listObjects", e);
}
Could you send the complete error report, including the message and (especially) the cause? The specific ApiException from lakeFS might be able to say more about what went wrong.
THANKS!