Cristian Caloian
05/16/2023, 9:29 AM3.1.2
with the following dependencies installed:
<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.7/hadoop-aws-2.7.7.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.9/hadoop-lakefs-assembly-0.1.9.jar>
<http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.7.2/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar>
I am using the following command to run the garbage collector:
spark-submit --class io.treeverse.clients.GarbageCollector \
--jars <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.12.jar> \
-c spark.hadoop.lakefs.api.url="https://<api-url>/api/v1" \
-c spark.hadoop.lakefs.api.access_key="<api-access-key>" \
-c spark.hadoop.lakefs.api.secret_key="<api-secret-key>" \
-c spark.hadoop.fs.s3a.access.key="<s3a-access-key>" \
-c spark.hadoop.fs.s3a.secret.key="<s3a-secret-key>" \
$SPARK_HOME/jars/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar <repo-name> eu-west-1
This is the error I’m getting:
Exception in thread "main" java.lang.NumberFormatException: For input string: "100M"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Long.parseLong(Long.java:692)
at java.base/java.lang.Long.parseLong(Long.java:817)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:595)
at io.treeverse.clients.GarbageCollector.getCommitsDF(GarbageCollector.scala:105)
at io.treeverse.clients.GarbageCollector.getExpiredAddresses(GarbageCollector.scala:217)
at io.treeverse.clients.GarbageCollector$.markAddresses(GarbageCollector.scala:500)
at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:382)
at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Jonathan Rosenberg
05/16/2023, 9:33 AMhadoop-common
version?ls $SPARK_HOME/jars | grep hadoop-common
Cristian Caloian
05/16/2023, 9:37 AMhadoop-common-3.2.0.jar
hadoop-yarn-common-3.2.0.jar
if that makes a difference.Jonathan Rosenberg
05/17/2023, 8:36 AMhadoop-aws
version should match the hadoop-common
version, and currently it’s not.
Can you add the:
--packages org.apache.hadoop:hadoop-aws:3.2.0
option to your spark-submit GC job so it will match?Cristian Caloian
05/17/2023, 8:43 AMWARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/05/17 08:40:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.fs.Path.getName(Path.java:414)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:136)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
hadoop-aws
version and run it.Jonathan Rosenberg
05/17/2023, 8:48 AMCristian Caloian
05/17/2023, 10:01 AMException in thread "main" java.lang.IllegalArgumentException: Expected URL scheme 'http' or 'https' but no colon was found
at io.lakefs.spark.shade.okhttp3.HttpUrl$Builder.parse$okhttp(HttpUrl.kt:1260)
at io.lakefs.spark.shade.okhttp3.HttpUrl$Companion.get(HttpUrl.kt:1633)
at io.lakefs.spark.shade.okhttp3.Request$Builder.url(Request.kt:184)
at io.lakefs.clients.api.ApiClient.buildRequest(ApiClient.java:1077)
at io.lakefs.clients.api.ApiClient.buildCall(ApiClient.java:1052)
at io.lakefs.clients.api.ConfigApi.getStorageConfigCall(ConfigApi.java:422)
at io.lakefs.clients.api.ConfigApi.getStorageConfigValidateBeforeCall(ConfigApi.java:429)
at io.lakefs.clients.api.ConfigApi.getStorageConfigWithHttpInfo(ConfigApi.java:464)
at io.lakefs.clients.api.ConfigApi.getStorageConfig(ConfigApi.java:447)
at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:223)
at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:222)
at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:193)
at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
at io.treeverse.clients.RequestRetryWrapper.wrapWithRetry(ApiClient.scala:315)
at io.treeverse.clients.ApiClient.getBlockstoreType(ApiClient.scala:225)
at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:334)
at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Jonathan Rosenberg
05/17/2023, 10:38 AMspark.hadoop.lakefs.api.url
flag?Cristian Caloian
05/17/2023, 11:19 AMSparkApplication
in Kubernetes. Does it happen that you have a SparkApplication example for running the GC?Jonathan Rosenberg
05/17/2023, 11:24 AMCristian Caloian
05/17/2023, 11:28 AMJonathan Rosenberg
05/17/2023, 11:29 AMCristian Caloian
05/17/2023, 11:36 AMJonathan Rosenberg
05/17/2023, 11:36 AMCristian Caloian
05/23/2023, 8:06 AM-c fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
-c fs.s3a.assumed.role.arn="arn:aws:iam::<my-role>"
But I get the following error:
Exception in thread "main" java.nio.file.AccessDeniedException: lakefs-eu-aws-volvocars-ai: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVaria
bleCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.AmazonServiceException: Bad Gateway (Service: null; Status Code: 502; Error Code: null; Request ID: null)
Any idea what might happen?
I am using the following versions:
hadoop-aws-3.2.0.jar
hadoop-lakefs-assembly-0.1.14.jar
lakefs-spark-client-312-hadoop3-assembly-0.7.3.jar
Jonathan Rosenberg
05/23/2023, 10:15 AMfs.s3a.aws.credentials.provider
to both:
org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
Not exactly sure why, but you can try setting it to org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
instead of org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
.Cristian Caloian
05/23/2023, 12:21 PMException in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on <MY-BUCKET-NAME>: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset
The bucket does exist, but I might face some other networking issues. I will try to investigate this.