<#C016726JLJW|help> Hello, I tried to setup a garb...
# help
c
#help Hello, I tried to setup a garbage collection routine based on this user manual. I have a local dev LakeFS running which uses s3 as storage. When I tried to start the GC job (spark job), I got the following error. 22/10/18 164607 INFO S3AFileSystem: Caught an AmazonServiceException com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: ADF087vKmPaF3bhLuJ9ty+mlbW6H3eIprmlbopQt/lOO5+fb4or8OP2E70lY5q9KOYksyDHaP3k= 22/10/18 164607 INFO S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden 22/10/18 164607 INFO S3AFileSystem: HTTP Status Code: 403 22/10/18 164607 INFO S3AFileSystem: AWS Error Code: null 22/10/18 164607 INFO S3AFileSystem: Error Type: Client 22/10/18 164607 INFO S3AFileSystem: Request ID: CS2AC7CA2JJC5W3Z 22/10/18 164607 INFO S3AFileSystem: Stack com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: CS2AC7CA2JJC5W3Z, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: ADF087vKmPaF3bhLuJ9ty+mlbW6H3eIprmlbopQt/lOO5+fb4or8OP2E70lY5q9KOYksyDHaP3k= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) ...
More info: ā€¢ I am sure AWS credentials and policies have been set up correctly, as I can use LakeFS on localhost for all functions except for the GC. ā€¢ I searched a little bit on the internet, and suspected that the error is caused by hadoop. I've tried with different spark-hadoop bundles, including spark-3.2.1-bin-hadoop2.7 and spark-3.3.0-bin-hadoop2, etc, but none of them helps. ā€¢ I've tried to run the script both on Windows and EC2 (Linux), but the same error occurs. ā€¢ The attached picture shows the script (on Windows) to trigger the GC, where the s3a access key and secret are the one assigned to my aws user:
a
Hi @cweng, Sorry to hear that this is not working for you. Spark with Hadoop 2.7 should definitely work. Let me dig around a bit...
How narrowly are permissions set for the access key that you use with S3A to access S3? I ask because this file access is on a somewhat different path from data objects. If your access key is allowed to access the entire storage namespace for your repository then everything should be OK; if you've narrowed it down even further then we may have this difficulty.
c
Thank you for the quick response @Ariel Shaqed (Scolnicov). The access key has s3 full access, and the user has been added to s3 -> permission -> policies. I also tried with an admin user in my own AWS account, but the error remains occurring.
a
This is strange. Another thing that sometimes trips me up is trying to run with the s3a endpoint set to hit the lakeFS s3 gateway. I know it's a reach, but could you verify that you have no other s3a properties set?
c
yes I can assure. While running the spark gc job, the only s3a setup is to specify the access key and secret, as shown in the previously attached figure. Here is the setup for running the lakefs server:
a
Thanks! I will try to reproduce, I think I have everything. The next update will be by tomorrow at 0900 UTC.
šŸ‘ 1
c
In case more info needed: The applications I used are from the links below: # Download lakefs wget https://github.com/treeverse/lakeFS/releases/download/v0.80.2/lakeFS_0.80.2_Linux_x86_64.tar.gz # Download java (path after extraction will be added to JAVA_HOME) wget https://javadl.oracle.com/webapps/download/AutoDL?BundleId=246799_424b9da4b48848379167015dcc250d8d # Download spark (for gc) wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz # Download gc task wget http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-301/0.5.0/lakefs-spark-client-301-assembly-0.5.0.jar In S3, I set the bucket permission as follows:
a
Thanks!
Hi, I'm still trying to reproduce. There might be a specific issue with S3A authentication; it will take 2-3 hours to be sure.
šŸ‘ 1
c
Please take your time @Ariel Shaqed (Scolnicov)
levitating lakefs 1
a
Trying to switch to use "provided" hadoop packages. If all goes well it will considerably simplify setup and improve compatibility. However you might need to add
--packages org.apache.hadoop:hadoop-aws:2.7.7
after this change.
The fix seems to work nicely in a variety of slightly mismatched Spark versions. We should have an RC client version tomorrow.
c
thank you @Ariel Shaqed (Scolnicov). I'll try it right away.
šŸ‘šŸ¼ 1
a
We released this as part of 0.5.1. Please let me know whether it solves your issue! Note that you will now have to provide
hadoop-aws
yourself with a suitable version, probably 2.7.7.
šŸ‘€ 1
c
Hello @Ariel Shaqed (Scolnicov), now the job is running! It's just that the job is interrupted in the middle by the following error. I've been trying to fix it and I think it's just a dependency issue. I'll update when it's done.
a
I think I may have seen that one before. You might be using a different hadoop-aws version than the hadoop-common already in your Spark... or perhaps you have a bad AWS SDK in there?
c
"You might be using a different hadoop-aws version than the hadoop-common already in your Spark" You are right. The hadoop-common in the current spark is hadoop-common-2.7.4. I'll use a spark-without-hadoop, and a hadoop 2.7.7 to see if it does the trick.
a
You'll still need a Hadoop. Can you try hadoop-aws:2.7.4 first, please?
c
Sure I'll do it
"hadoop-aws:2.7.4": Still the same error... Actually by "spark-without-hadoop, and a hadoop 2.7.7", I meant I use spark with my own hadoop downloaded separately. I'll try. But now I guess the issue is mainly "perhaps you have a bad AWS SDK in there?".
šŸ‘šŸ¼ 1
Hi @Ariel Shaqed (Scolnicov), finally, it works! As you said, it's the aws sdk version issue. Lower versions of aws-java-sdk (e.g., 1.7.4) does not have the "AWSStaticCredentialsProvider" used in S3ClientBuilder. But when I switch to higher versions, other errors occurs, such as what is shown in the 1st screenshot in this chat. I think this might be due to the mixing versions of aws sdk in the dependencies (hadoop-aws 2.7.7 and aws-java-sdk-bundle 1.12.194), but I am not sure, and I don't know why it only happened in my case. Perhaps I overlooked something. My current solution is to use lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar, which depends on higher version of hadoop-aws. I setup the GC job as shown in the 2nd screenshot. The hadoop-3.2.1 I used comes with aws-java-sdk-bundle-1.11.375 (rather than aws-java-sdk-1.7.4.jar as in the hadoop-2.7.7), which seems to work well with the client. I think there should be better solutions. I will let you know if I found anything later. I think the tricky part is the compatibility of AWS SDK versions. Thank you for all the help you provided!šŸ‘
a
Wow, well done! Really glad to hear it's working for you, and sorry it didn't right away. These aws sdk issues are really frustrating. Thanks for the exhaustive analysis, too, of course.
šŸ‘ 1