Kevin Vasko
01/30/2023, 7:06 PMElad Lachmi
01/30/2023, 7:13 PMKevin Vasko
01/30/2023, 9:12 PMIddo Avneri
01/30/2023, 9:24 PMKevin Vasko
01/30/2023, 9:24 PMIddo Avneri
01/30/2023, 10:30 PMKevin Vasko
01/30/2023, 10:32 PMIddo Avneri
01/30/2023, 10:32 PMKevin Vasko
01/30/2023, 10:34 PMIddo Avneri
01/30/2023, 10:38 PM_lakefs/retention/gc/addresses/mark_id=MARK_ID
Kevin Vasko
01/30/2023, 10:42 PMIddo Avneri
01/30/2023, 10:53 PMElad Lachmi
01/31/2023, 3:42 PMKevin Vasko
01/31/2023, 3:51 PMElad Lachmi
01/31/2023, 3:52 PMKevin Vasko
01/31/2023, 3:53 PMElad Lachmi
01/31/2023, 3:56 PMEssentially I’m just trying to clean up all the crap that people have deleted.Yep, that's exactly what it's for 🙂
Kevin Vasko
01/31/2023, 3:57 PMElad Lachmi
01/31/2023, 3:58 PMcsv
file is created using lakeFS's role, while the cleanup is done with the AWS credentials, but I just want to make sure before we dig deeperKevin Vasko
01/31/2023, 4:00 PMElad Lachmi
01/31/2023, 4:12 PMKevin Vasko
01/31/2023, 4:13 PMElad Lachmi
01/31/2023, 4:13 PMKevin Vasko
01/31/2023, 4:13 PMElad Lachmi
01/31/2023, 4:20 PMKevin Vasko
01/31/2023, 4:21 PMElad Lachmi
01/31/2023, 4:22 PMAWS_REGION
to your region in your terminal session and trying again?Kevin Vasko
01/31/2023, 4:37 PMElad Lachmi
01/31/2023, 4:38 PMexport AWS_REGION=<your region>
in the terminal and running the job againKevin Vasko
01/31/2023, 4:40 PMElad Lachmi
01/31/2023, 4:43 PM-Dcom.amazonaws.services.s3.enableV4
if they want SigV4 to work and you'll probably need to configure an S3 endpoint with the Spark configuration property
spark.hadoop.fs.s3a.endpoint
Kevin Vasko
01/31/2023, 4:44 PMElad Lachmi
01/31/2023, 4:45 PMspark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
... (other spark options)
Kevin Vasko
01/31/2023, 5:04 PMElad Lachmi
01/31/2023, 5:09 PMKevin Vasko
01/31/2023, 5:10 PMElad Lachmi
01/31/2023, 5:11 PMKevin Vasko
01/31/2023, 5:13 PMElad Lachmi
01/31/2023, 5:13 PMKevin Vasko
01/31/2023, 5:14 PMElad Lachmi
01/31/2023, 5:16 PMKevin Vasko
01/31/2023, 5:19 PMElad Lachmi
01/31/2023, 5:21 PMKevin Vasko
01/31/2023, 5:22 PMElad Lachmi
01/31/2023, 5:23 PMKevin Vasko
01/31/2023, 5:23 PMElad Lachmi
01/31/2023, 5:23 PMKevin Vasko
01/31/2023, 5:24 PMElad Lachmi
01/31/2023, 5:24 PMKevin Vasko
01/31/2023, 5:25 PMElad Lachmi
01/31/2023, 5:27 PMKevin Vasko
01/31/2023, 5:28 PMElad Lachmi
01/31/2023, 5:28 PMKevin Vasko
01/31/2023, 5:29 PMElad Lachmi
01/31/2023, 5:30 PMKevin Vasko
01/31/2023, 5:32 PMElad Lachmi
01/31/2023, 5:34 PMKevin Vasko
01/31/2023, 5:35 PMElad Lachmi
01/31/2023, 5:35 PMKevin Vasko
01/31/2023, 5:35 PMElad Lachmi
01/31/2023, 5:37 PMKevin Vasko
01/31/2023, 5:54 PMElad Lachmi
01/31/2023, 5:55 PMKevin Vasko
01/31/2023, 8:33 PMElad Lachmi
01/31/2023, 8:34 PMok I got it to successfully run…First things first... 🥳
Kevin Vasko
01/31/2023, 8:36 PMElad Lachmi
01/31/2023, 8:37 PMspark.hadoop.fs.s3a.endpoint.region option didn’t exist until hadoop-aws:3.3.2I see... so it might have been related to the issue I saw in the issue tracker 🤔
Kevin Vasko
01/31/2023, 8:37 PMElad Lachmi
01/31/2023, 8:37 PMKevin Vasko
01/31/2023, 8:38 PMElad Lachmi
01/31/2023, 8:38 PMKevin Vasko
01/31/2023, 8:41 PMElad Lachmi
01/31/2023, 8:41 PMlakectl
or via the UI
https://docs.lakefs.io/howto/garbage-collection.html#configuring-gc-rulesKevin Vasko
01/31/2023, 8:47 PMElad Lachmi
01/31/2023, 8:47 PM_lakefs/retention/gc/addresses
If one doesn't exist or has very few objects, that could point us towards a policy/configuration issue
If it has all the objects, but doesn't hard-delete them, then we'll look into thatKevin Vasko
01/31/2023, 8:49 PMElad Lachmi
01/31/2023, 8:50 PMKevin Vasko
01/31/2023, 8:50 PMElad Lachmi
01/31/2023, 8:50 PMKevin Vasko
01/31/2023, 8:52 PMElad Lachmi
01/31/2023, 8:54 PMKevin Vasko
01/31/2023, 9:01 PMElad Lachmi
01/31/2023, 9:03 PMKevin Vasko
01/31/2023, 9:03 PMElad Lachmi
01/31/2023, 9:05 PMKevin Vasko
01/31/2023, 9:12 PMElad Lachmi
01/31/2023, 9:14 PMKevin Vasko
01/31/2023, 9:26 PMElad Lachmi
01/31/2023, 9:28 PMKevin Vasko
01/31/2023, 9:30 PMElad Lachmi
01/31/2023, 9:30 PMfs.s3.impl
or fs.s3a.impl
Kevin Vasko
01/31/2023, 9:30 PMElad Lachmi
01/31/2023, 9:31 PMfs.s3a.impl
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.htmls3://
path
Can you try s3a://
instead, before configuring more stuffKevin Vasko
01/31/2023, 9:33 PMElad Lachmi
01/31/2023, 9:39 PMKevin Vasko
01/31/2023, 9:40 PMElad Lachmi
02/02/2023, 3:46 PMKevin Vasko
02/02/2023, 3:57 PMI think you have a missing config param
fs.s3.impl
or fs.s3a.impl
Elad Lachmi
02/02/2023, 4:01 PMKevin Vasko
02/02/2023, 4:01 PMElad Lachmi
02/02/2023, 4:24 PM--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
Kevin Vasko
02/02/2023, 4:59 PMElad Lachmi
02/02/2023, 5:09 PMKevin Vasko
02/02/2023, 5:09 PMElad Lachmi
02/02/2023, 5:11 PMKevin Vasko
02/02/2023, 5:15 PMElad Lachmi
02/02/2023, 5:15 PMKevin Vasko
02/02/2023, 5:16 PMElad Lachmi
02/02/2023, 5:16 PMKevin Vasko
02/02/2023, 6:06 PMElad Lachmi
02/02/2023, 6:14 PMKevin Vasko
02/02/2023, 6:15 PMElad Lachmi
02/02/2023, 6:20 PMKevin Vasko
02/02/2023, 6:21 PMElad Lachmi
02/02/2023, 6:23 PMKevin Vasko
02/02/2023, 6:37 PMElad Lachmi
02/02/2023, 6:49 PMKevin Vasko
02/02/2023, 6:50 PMElad Lachmi
02/02/2023, 6:53 PMKevin Vasko
02/02/2023, 6:53 PMElad Lachmi
02/02/2023, 7:05 PMspark.hadoop.lakefs.gc.do_mark=false
spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you used on a previous mark-only run
Kevin Vasko
02/02/2023, 7:06 PMElad Lachmi
02/02/2023, 7:20 PMdeleting marked addresses: <mark_id>
Kevin Vasko
02/02/2023, 7:31 PMElad Lachmi
02/02/2023, 7:32 PMKevin Vasko
02/02/2023, 7:32 PMElad Lachmi
02/02/2023, 7:33 PMfs.
or lakefs.
, which seems okKevin Vasko
02/02/2023, 7:44 PMElad Lachmi
02/02/2023, 7:47 PMKevin Vasko
02/02/2023, 7:47 PMElad Lachmi
02/02/2023, 7:51 PMKevin Vasko
02/02/2023, 7:53 PMit’s like once it gets to actually doing the deletes i bet it failsYeah, that's for sure The question is where is it not getting the credentials, getting the wrong credentials, or not setting up the client correctly? (or some other option, which I'm not even thinking about right now)
Kevin Vasko
02/02/2023, 8:12 PMElad Lachmi
02/02/2023, 8:12 PMWhen you are testing credentials in Boto3: The error you receive may say this,
ClientError: An error occurred (InvalidAccessKeyId) when calling the ListBuckets operation: The AWS Access Key Id you provided does not exist in our records.
but may mean you are missing anif you are using temporary credentials (in my case, role-based credentials).aws_session_token
Kevin Vasko
02/02/2023, 8:19 PMElad Lachmi
02/02/2023, 8:20 PMKevin Vasko
02/02/2023, 8:20 PMElad Lachmi
02/02/2023, 8:21 PMKevin Vasko
02/02/2023, 8:22 PMElad Lachmi
02/02/2023, 8:24 PMKevin Vasko
02/02/2023, 8:25 PMElad Lachmi
02/02/2023, 8:26 PMKevin Vasko
02/02/2023, 8:37 PMElad Lachmi
02/02/2023, 8:39 PMKevin Vasko
02/02/2023, 8:40 PMElad Lachmi
02/02/2023, 8:40 PMUse access key ID
Kevin Vasko
02/02/2023, 8:43 PMElad Lachmi
02/02/2023, 8:44 PMKevin Vasko
02/02/2023, 8:44 PMElad Lachmi
02/02/2023, 8:45 PMKevin Vasko
02/02/2023, 8:46 PMElad Lachmi
02/02/2023, 10:30 PMKevin Vasko
02/03/2023, 2:29 PMElad Lachmi
02/03/2023, 2:45 PMKevin Vasko
02/17/2023, 3:58 PMJonathan Rosenberg
02/17/2023, 4:20 PMKevin Vasko
02/17/2023, 7:48 PMAriel Shaqed (Scolnicov)
02/17/2023, 8:06 PMKevin Vasko
02/17/2023, 8:06 PMIddo Avneri
02/17/2023, 8:25 PM