Hi LakeFS, I am trying to run the garbage collecto...
# help
c
Hi LakeFS, I am trying to run the garbage collector and I think I am running into some dependency issues. Can you please take a look at my setup and the error I’m getting, and let me know if you have an idea of what might go wrong. Thanks in advance! I am using Spark
3.1.2
with the following dependencies installed:
Copy code
<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.7/hadoop-aws-2.7.7.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.9/hadoop-lakefs-assembly-0.1.9.jar>
<http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.7.2/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar>
I am using the following command to run the garbage collector:
Copy code
spark-submit --class io.treeverse.clients.GarbageCollector  \
    --jars <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.12.jar> \
    -c spark.hadoop.lakefs.api.url="https://<api-url>/api/v1" \
    -c spark.hadoop.lakefs.api.access_key="<api-access-key>" \
    -c spark.hadoop.lakefs.api.secret_key="<api-secret-key>" \
    -c spark.hadoop.fs.s3a.access.key="<s3a-access-key>" \
    -c spark.hadoop.fs.s3a.secret.key="<s3a-secret-key>" \
    $SPARK_HOME/jars/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar <repo-name> eu-west-1
This is the error I’m getting:
Copy code
Exception in thread "main" java.lang.NumberFormatException: For input string: "100M"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Long.parseLong(Long.java:692)
	at java.base/java.lang.Long.parseLong(Long.java:817)
	at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:595)
	at io.treeverse.clients.GarbageCollector.getCommitsDF(GarbageCollector.scala:105)
	at io.treeverse.clients.GarbageCollector.getExpiredAddresses(GarbageCollector.scala:217)
	at io.treeverse.clients.GarbageCollector$.markAddresses(GarbageCollector.scala:500)
	at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:382)
	at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
j
Hi @Cristian Caloian Let me take a look
This looks like a classic case of Hadoop dependency issue. What’s your
hadoop-common
version?
Copy code
ls $SPARK_HOME/jars | grep hadoop-common
c
Thank you for looking into this @Jonathan Rosenberg
Copy code
hadoop-common-3.2.0.jar
I also see I have
Copy code
hadoop-yarn-common-3.2.0.jar
if that makes a difference.
j
basically your
hadoop-aws
version should match the
hadoop-common
version, and currently it’s not. Can you add the:
Copy code
--packages org.apache.hadoop:hadoop-aws:3.2.0
option to your spark-submit GC job so it will match?
c
Just adding the suggested line to the existing command gives me the following error
Copy code
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/05/17 08:40:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.fs.Path.getName(Path.java:414)
	at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:136)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I will try to build a new image with that
hadoop-aws
version and run it.
j
Cool, please update with your results
c
I guess we make some progress, I get a different error 🙂
Copy code
Exception in thread "main" java.lang.IllegalArgumentException: Expected URL scheme 'http' or 'https' but no colon was found
    at io.lakefs.spark.shade.okhttp3.HttpUrl$Builder.parse$okhttp(HttpUrl.kt:1260)
    at io.lakefs.spark.shade.okhttp3.HttpUrl$Companion.get(HttpUrl.kt:1633)
    at io.lakefs.spark.shade.okhttp3.Request$Builder.url(Request.kt:184)
    at io.lakefs.clients.api.ApiClient.buildRequest(ApiClient.java:1077)
    at io.lakefs.clients.api.ApiClient.buildCall(ApiClient.java:1052)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigCall(ConfigApi.java:422)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigValidateBeforeCall(ConfigApi.java:429)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigWithHttpInfo(ConfigApi.java:464)
    at io.lakefs.clients.api.ConfigApi.getStorageConfig(ConfigApi.java:447)
    at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:223)
    at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:222)
    at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
    at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
    at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
    at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:193)
    at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
    at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
    at io.treeverse.clients.RequestRetryWrapper.wrapWithRetry(ApiClient.scala:315)
    at io.treeverse.clients.ApiClient.getBlockstoreType(ApiClient.scala:225)
    at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:334)
    at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
j
did you specify the
spark.hadoop.lakefs.api.url
flag?
c
I did, but I am wondering if it has to do with the fact that I’m running it as a
SparkApplication
in Kubernetes. Does it happen that you have a SparkApplication example for running the GC?
j
do you mean that you run it with Kubernetes deployment mode?
c
Yes.
j
I’m sorry I don’t have such an example
c
Thanks @Jonathan Rosenberg 🙏 I will try to debug this and get back to you.
j
great
c
Hi again @Jonathan Rosenberg I think I made a bit more progress with this, but still not yet fully functional. I am trying to authenticate to S3 using AWS IAM role by using the following configuration
Copy code
-c fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
-c fs.s3a.assumed.role.arn="arn:aws:iam::<my-role>"
But I get the following error:
Copy code
Exception in thread "main" java.nio.file.AccessDeniedException: lakefs-eu-aws-volvocars-ai: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVaria
bleCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.AmazonServiceException: Bad Gateway (Service: null; Status Code: 502; Error Code: null; Request ID: null)
Any idea what might happen? I am using the following versions:
Copy code
hadoop-aws-3.2.0.jar
hadoop-lakefs-assembly-0.1.14.jar
lakefs-spark-client-312-hadoop3-assembly-0.7.3.jar
j
According to this, you should try setting
fs.s3a.aws.credentials.provider
to both:
Copy code
org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
  org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
Not exactly sure why, but you can try setting it to
org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
instead of
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
.
👀 1
c
Thanks for the suggestion 🙏 I think now I am hitting an issue on my end.
Copy code
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on <MY-BUCKET-NAME>: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset
The bucket does exist, but I might face some other networking issues. I will try to investigate this.