Hi LakeFS I am trying to run the garbage collector and I thi lakeFS #help

Hi LakeFS, I am trying to run the garbage collecto...

Cristian Caloian

05/16/2023, 9:29 AM

Hi LakeFS, I am trying to run the garbage collector and I think I am running into some dependency issues. Can you please take a look at my setup and the error I’m getting, and let me know if you have an idea of what might go wrong. Thanks in advance! I am using Spark

3.1.2

with the following dependencies installed:

Copy code

<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.7/hadoop-aws-2.7.7.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.9/hadoop-lakefs-assembly-0.1.9.jar>
<http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.7.2/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar>

I am using the following command to run the garbage collector:

Copy code

spark-submit --class io.treeverse.clients.GarbageCollector  \
    --jars <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.12.jar> \
    -c spark.hadoop.lakefs.api.url="https://<api-url>/api/v1" \
    -c spark.hadoop.lakefs.api.access_key="<api-access-key>" \
    -c spark.hadoop.lakefs.api.secret_key="<api-secret-key>" \
    -c spark.hadoop.fs.s3a.access.key="<s3a-access-key>" \
    -c spark.hadoop.fs.s3a.secret.key="<s3a-secret-key>" \
    $SPARK_HOME/jars/lakefs-spark-client-312-hadoop3-assembly-0.7.2.jar <repo-name> eu-west-1

This is the error I’m getting:

Copy code

Exception in thread "main" java.lang.NumberFormatException: For input string: "100M"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Long.parseLong(Long.java:692)
	at java.base/java.lang.Long.parseLong(Long.java:817)
	at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:595)
	at io.treeverse.clients.GarbageCollector.getCommitsDF(GarbageCollector.scala:105)
	at io.treeverse.clients.GarbageCollector.getExpiredAddresses(GarbageCollector.scala:217)
	at io.treeverse.clients.GarbageCollector$.markAddresses(GarbageCollector.scala:500)
	at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:382)
	at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Jonathan Rosenberg

05/16/2023, 9:33 AM

Hi @Cristian Caloian Let me take a look

Jonathan Rosenberg

05/16/2023, 9:36 AM

This looks like a classic case of Hadoop dependency issue. What’s your

hadoop-common

version?

Jonathan Rosenberg

05/16/2023, 9:37 AM

Copy code

ls $SPARK_HOME/jars | grep hadoop-common

Cristian Caloian

05/16/2023, 9:37 AM

Thank you for looking into this @Jonathan Rosenberg

Copy code

hadoop-common-3.2.0.jar

Cristian Caloian

05/16/2023, 9:38 AM

I also see I have

Copy code

hadoop-yarn-common-3.2.0.jar

if that makes a difference.

Jonathan Rosenberg

05/17/2023, 8:36 AM

basically your

hadoop-aws

version should match the

hadoop-common

version, and currently it’s not. Can you add the:

Copy code

--packages org.apache.hadoop:hadoop-aws:3.2.0

option to your spark-submit GC job so it will match?

Cristian Caloian

05/17/2023, 8:43 AM

Just adding the suggested line to the existing command gives me the following error

Copy code

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/05/17 08:40:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.fs.Path.getName(Path.java:414)
	at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:136)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Cristian Caloian

05/17/2023, 8:45 AM

I will try to build a new image with that

hadoop-aws

version and run it.

Jonathan Rosenberg

05/17/2023, 8:48 AM

Cool, please update with your results

Cristian Caloian

05/17/2023, 10:01 AM

I guess we make some progress, I get a different error 🙂

Copy code

Exception in thread "main" java.lang.IllegalArgumentException: Expected URL scheme 'http' or 'https' but no colon was found
    at io.lakefs.spark.shade.okhttp3.HttpUrl$Builder.parse$okhttp(HttpUrl.kt:1260)
    at io.lakefs.spark.shade.okhttp3.HttpUrl$Companion.get(HttpUrl.kt:1633)
    at io.lakefs.spark.shade.okhttp3.Request$Builder.url(Request.kt:184)
    at io.lakefs.clients.api.ApiClient.buildRequest(ApiClient.java:1077)
    at io.lakefs.clients.api.ApiClient.buildCall(ApiClient.java:1052)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigCall(ConfigApi.java:422)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigValidateBeforeCall(ConfigApi.java:429)
    at io.lakefs.clients.api.ConfigApi.getStorageConfigWithHttpInfo(ConfigApi.java:464)
    at io.lakefs.clients.api.ConfigApi.getStorageConfig(ConfigApi.java:447)
    at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:223)
    at io.treeverse.clients.ApiClient$$anon$7.get(ApiClient.scala:222)
    at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
    at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
    at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
    at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:193)
    at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
    at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
    at io.treeverse.clients.RequestRetryWrapper.wrapWithRetry(ApiClient.scala:315)
    at io.treeverse.clients.ApiClient.getBlockstoreType(ApiClient.scala:225)
    at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:334)
    at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Jonathan Rosenberg

05/17/2023, 10:38 AM

did you specify the

spark.hadoop.lakefs.api.url

flag?

Cristian Caloian

05/17/2023, 11:19 AM

I did, but I am wondering if it has to do with the fact that I’m running it as a

SparkApplication

in Kubernetes. Does it happen that you have a SparkApplication example for running the GC?

Jonathan Rosenberg

05/17/2023, 11:24 AM

do you mean that you run it with Kubernetes deployment mode?

Cristian Caloian

05/17/2023, 11:28 AM

Yes.

Jonathan Rosenberg

05/17/2023, 11:29 AM

I’m sorry I don’t have such an example

Cristian Caloian

05/17/2023, 11:36 AM

Thanks @Jonathan Rosenberg 🙏 I will try to debug this and get back to you.

Jonathan Rosenberg

05/17/2023, 11:36 AM

great

Cristian Caloian

05/23/2023, 8:06 AM

Hi again @Jonathan Rosenberg I think I made a bit more progress with this, but still not yet fully functional. I am trying to authenticate to S3 using AWS IAM role by using the following configuration

Copy code

-c fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
-c fs.s3a.assumed.role.arn="arn:aws:iam::<my-role>"

But I get the following error:

Copy code

Exception in thread "main" java.nio.file.AccessDeniedException: lakefs-eu-aws-volvocars-ai: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVaria
bleCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.AmazonServiceException: Bad Gateway (Service: null; Status Code: 502; Error Code: null; Request ID: null)

Any idea what might happen? I am using the following versions:

Copy code

hadoop-aws-3.2.0.jar
hadoop-lakefs-assembly-0.1.14.jar
lakefs-spark-client-312-hadoop3-assembly-0.7.3.jar

Jonathan Rosenberg

05/23/2023, 10:15 AM

According to this, you should try setting

fs.s3a.aws.credentials.provider

to both:

Copy code

org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
  org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider

Not exactly sure why, but you can try setting it to

org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider

instead of

org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider

👀 1

Cristian Caloian

05/23/2023, 12:21 PM

Thanks for the suggestion 🙏 I think now I am hitting an issue on my end.

Copy code

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on <MY-BUCKET-NAME>: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset

The bucket does exist, but I might face some other networking issues. I will try to investigate this.

40 Views

Open in Slack

Previous Next