Title
#help
c

cweng

10/18/2022, 2:59 PM
#help Hello, I tried to setup a garbage collection routine based on this user manual. I have a local dev LakeFS running which uses s3 as storage. When I tried to start the GC job (spark job), I got the following error. 22/10/18 16:46:07 INFO S3AFileSystem: Caught an AmazonServiceException com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: ADF087vKmPaF3bhLuJ9ty+mlbW6H3eIprmlbopQt/lOO5+fb4or8OP2E70lY5q9KOYksyDHaP3k= 22/10/18 16:46:07 INFO S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden 22/10/18 16:46:07 INFO S3AFileSystem: HTTP Status Code: 403 22/10/18 16:46:07 INFO S3AFileSystem: AWS Error Code: null 22/10/18 16:46:07 INFO S3AFileSystem: Error Type: Client 22/10/18 16:46:07 INFO S3AFileSystem: Request ID: CS2AC7CA2JJC5W3Z 22/10/18 16:46:07 INFO S3AFileSystem: Stack com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: ADF087vKmPaF3bhLuJ9ty+mlbW6H3eIprmlbopQt/lOO5+fb4or8OP2E70lY5q9KOYksyDHaP3k= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) ...
3:06 PM
More info: ā€¢ I am sure AWS credentials and policies have been set up correctly, as I can use LakeFS on localhost for all functions except for the GC. ā€¢ I searched a little bit on the internet, and suspected that the error is caused by hadoop. I've tried with different spark-hadoop bundles, including spark-3.2.1-bin-hadoop2.7 and spark-3.3.0-bin-hadoop2, etc, but none of them helps. ā€¢ I've tried to run the script both on Windows and EC2 (Linux), but the same error occurs. ā€¢ The attached picture shows the script (on Windows) to trigger the GC, where the s3a access key and secret are the one assigned to my aws user:
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/18/2022, 3:11 PM
Hi @cweng, Sorry to hear that this is not working for you. Spark with Hadoop 2.7 should definitely work. Let me dig around a bit...
3:14 PM
How narrowly are permissions set for the access key that you use with S3A to access S3? I ask because this file access is on a somewhat different path from data objects. If your access key is allowed to access the entire storage namespace for your repository then everything should be OK; if you've narrowed it down even further then we may have this difficulty.
c

cweng

10/18/2022, 3:59 PM
Thank you for the quick response @Ariel Shaqed (Scolnicov). The access key has s3 full access, and the user has been added to s3 -> permission -> policies. I also tried with an admin user in my own AWS account, but the error remains occurring.
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/18/2022, 4:06 PM
This is strange. Another thing that sometimes trips me up is trying to run with the s3a endpoint set to hit the lakeFS s3 gateway. I know it's a reach, but could you verify that you have no other s3a properties set?
c

cweng

10/18/2022, 4:11 PM
yes I can assure. While running the spark gc job, the only s3a setup is to specify the access key and secret, as shown in the previously attached figure. Here is the setup for running the lakefs server:
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/18/2022, 4:24 PM
Thanks! I will try to reproduce, I think I have everything. The next update will be by tomorrow at 0900 UTC.
c

cweng

10/19/2022, 7:38 AM
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/19/2022, 7:47 AM
Thanks!
9:28 AM
Hi, I'm still trying to reproduce. There might be a specific issue with S3A authentication; it will take 2-3 hours to be sure.
c

cweng

10/19/2022, 9:41 AM
Please take your time @Ariel Shaqed (Scolnicov)
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/19/2022, 12:05 PM
Trying to switch to use "provided" hadoop packages. If all goes well it will considerably simplify setup and improve compatibility. However you might need to add
--packages org.apache.hadoop:hadoop-aws:2.7.7
after this change.
12:49 PM
The fix seems to work nicely in a variety of slightly mismatched Spark versions. We should have an RC client version tomorrow.
c

cweng

10/19/2022, 2:24 PM
thank you @Ariel Shaqed (Scolnicov). I'll try it right away.
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/20/2022, 1:19 PM
We released this as part of 0.5.1. Please let me know whether it solves your issue! Note that you will now have to provide
hadoop-aws
yourself with a suitable version, probably 2.7.7.
c

cweng

10/20/2022, 2:35 PM
Hello @Ariel Shaqed (Scolnicov), now the job is running! It's just that the job is interrupted in the middle by the following error. I've been trying to fix it and I think it's just a dependency issue. I'll update when it's done.
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/20/2022, 2:37 PM
I think I may have seen that one before. You might be using a different hadoop-aws version than the hadoop-common already in your Spark... or perhaps you have a bad AWS SDK in there?
c

cweng

10/20/2022, 2:51 PM
"You might be using a different hadoop-aws version than the hadoop-common already in your Spark" You are right. The hadoop-common in the current spark is hadoop-common-2.7.4. I'll use a spark-without-hadoop, and a hadoop 2.7.7 to see if it does the trick.
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/20/2022, 2:51 PM
You'll still need a Hadoop. Can you try hadoop-aws:2.7.4 first, please?
c

cweng

10/20/2022, 2:52 PM
Sure I'll do it
2:56 PM
"hadoop-aws:2.7.4": Still the same error... Actually by "spark-without-hadoop, and a hadoop 2.7.7", I meant I use spark with my own hadoop downloaded separately. I'll try. But now I guess the issue is mainly "perhaps you have a bad AWS SDK in there?".
9:34 PM
Hi @Ariel Shaqed (Scolnicov), finally, it works! As you said, it's the aws sdk version issue. Lower versions of aws-java-sdk (e.g., 1.7.4) does not have the "AWSStaticCredentialsProvider" used in S3ClientBuilder. But when I switch to higher versions, other errors occurs, such as what is shown in the 1st screenshot in this chat. I think this might be due to the mixing versions of aws sdk in the dependencies (hadoop-aws 2.7.7 and aws-java-sdk-bundle 1.12.194), but I am not sure, and I don't know why it only happened in my case. Perhaps I overlooked something. My current solution is to use lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar, which depends on higher version of hadoop-aws. I setup the GC job as shown in the 2nd screenshot. The hadoop-3.2.1 I used comes with aws-java-sdk-bundle-1.11.375 (rather than aws-java-sdk-1.7.4.jar as in the hadoop-2.7.7), which seems to work well with the client. I think there should be better solutions. I will let you know if I found anything later. I think the tricky part is the compatibility of AWS SDK versions. Thank you for all the help you provided!šŸ‘
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

10/20/2022, 9:51 PM
Wow, well done! Really glad to hear it's working for you, and sorry it didn't right away. These aws sdk issues are really frustrating. Thanks for the exhaustive analysis, too, of course.