To run garbage collection is the assumption that I need to h lakeFS #help

Join Slack

To run garbage collection, is the assumption that ...

# help

Kevin Vasko

01/30/2023, 7:06 PM

To run garbage collection, is the assumption that I need to have a spark cluster set up?

Elad Lachmi

01/30/2023, 7:13 PM

Hi @Kevin Vasko, Yes, lakeFS assumes you have some means of submitting and running spark jobs See here for additional details on running GC jobs

Kevin Vasko

01/30/2023, 9:12 PM

Thanks.

Kevin Vasko

01/30/2023, 9:14 PM

I am seeing “Error while looking for metadata directory in the path: s3a://bucket/projects/myproject/_lakefs/retention/gc/commits/run_id={id}/commits.csv_

Kevin Vasko

01/30/2023, 9:14 PM

“

Kevin Vasko

01/30/2023, 9:24 PM

Do i have to create that path?

Iddo Avneri

01/30/2023, 9:24 PM

Hi Kevin! Which user are you using to run the spark job and can you share it’s permissions?

Kevin Vasko

01/30/2023, 9:24 PM

i’m using my account which is an admin account

Kevin Vasko

01/30/2023, 9:25 PM

well it’s in the “Admins” group

Kevin Vasko

01/30/2023, 9:29 PM

It seems to only be a warning but the next line is AWSBadRequestException: getFileStatus on {path i mentioned above}

Iddo Avneri

01/30/2023, 10:30 PM

Hi Kevin, let me try to collect some data to assist here: 1. Which version of the lakeFS Spark client are you using? 2. Can you share your spark-submit command (without the values for the keys of course) 3. What version of lakeFS are you running?

Kevin Vasko

01/30/2023, 10:32 PM

0.6.0 i tried the 3.0.1 and the 3.1.2 and got same result. I’ll send this tomorrow (not in front of computer) 0.80

Iddo Avneri

01/30/2023, 10:32 PM

Awesome. Thank you!

Kevin Vasko

01/30/2023, 10:34 PM

so i did see it was creating the commits file (i didn’t see it bc it wasn’t showing in lakefs ui).

Kevin Vasko

01/30/2023, 10:34 PM

but it’s in the aws s3 console

Kevin Vasko

01/30/2023, 10:35 PM

how closely tied to the spark version do I need?

Kevin Vasko

01/30/2023, 10:36 PM

i was just running a local standalone spark cluster on my local machine and grabbed the latest spark version

Iddo Avneri

01/30/2023, 10:38 PM

When you say “i did see it was creating the commits file” Do you mean the the list of expired objects? under

_lakefs/retention/gc/addresses/mark_id=MARK_ID

Kevin Vasko

01/30/2023, 10:42 PM

yes

Kevin Vasko

01/30/2023, 10:43 PM

i saw it in the s3 console but wasn’t seeing it lakefs ui so i didn’t think it was making it

Kevin Vasko

01/30/2023, 10:43 PM

but it is

Iddo Avneri

01/30/2023, 10:53 PM

Got it. Thanks

Elad Lachmi

01/31/2023, 3:42 PM

Hi @Kevin Vasko , Just to make sure - you're configuring the job with both your lakeFS and AWS S3 access key id/secret access key pairs?

Kevin Vasko

01/31/2023, 3:51 PM

@Iddo Avneri @Elad Lachmi Yup! See attached error log. I also included the script I’m using to run the code. https://gist.github.com/vaskokj/621cdcc328f4bbbf4586e96a3968a16b

Elad Lachmi

01/31/2023, 3:52 PM

Great, I'll take a look

Kevin Vasko

01/31/2023, 3:53 PM

I am using a local spark cluster. I just spun up a 3.3.1 Standalone spark cluster on my local box

Kevin Vasko

01/31/2023, 3:54 PM

Not sure if I need to match the lakefs-spark-client-312-hadoop3-assembly-0.6.0.jar

Kevin Vasko

01/31/2023, 3:55 PM

Essentially I’m just trying to clean up all the crap that people have deleted.

Elad Lachmi

01/31/2023, 3:56 PM

From the looks of it it's failing earlier It's getting a 400 HTTP error from AWS I'll need to dig into this a bit

Elad Lachmi

01/31/2023, 3:57 PM

Essentially I’m just trying to clean up all the crap that people have deleted.

Yep, that's exactly what it's for 🙂

Kevin Vasko

01/31/2023, 3:57 PM

@Elad Lachmi so what’s weird is the credentials are correct…because it creates the commit.csv file

Elad Lachmi

01/31/2023, 3:58 PM

I believe the

csv

file is created using lakeFS's role, while the cleanup is done with the AWS credentials, but I just want to make sure before we dig deeper

Kevin Vasko

01/31/2023, 4:00 PM

also could this be an issue with not passing an endpoint URL?

Kevin Vasko

01/31/2023, 4:00 PM

When i pass an endpoint url I had more issues

Elad Lachmi

01/31/2023, 4:12 PM

Still looking into it

Kevin Vasko

01/31/2023, 4:13 PM

no rush. i’ll be around. It’s probably a mistake on my end

Elad Lachmi

01/31/2023, 4:13 PM

Either way, I'd be happy to assist

Kevin Vasko

01/31/2023, 4:13 PM

Or something to do with my environment

Elad Lachmi

01/31/2023, 4:20 PM

Seems like there's an issue with STS credentials and S3A GetFileStatus It's an old issue, but the circumstances seem too similar to be a coincidence I'll read up on it a bit

Kevin Vasko

01/31/2023, 4:21 PM

Issue with lakeFS client or issue with something else?

Kevin Vasko

01/31/2023, 4:22 PM

The AWS account I have access to requires me to use a session token

Elad Lachmi

01/31/2023, 4:22 PM

Hadoop + STS credentials + S3A GetFileStatus

Elad Lachmi

01/31/2023, 4:22 PM

Yes, I understand

Elad Lachmi

01/31/2023, 4:37 PM

Can you try something real quick (if you haven't already) Can you try setting

AWS_REGION

to your region in your terminal session and trying again?

Kevin Vasko

01/31/2023, 4:37 PM

can you clarify?

Kevin Vasko

01/31/2023, 4:37 PM

it’s set in the command

Kevin Vasko

01/31/2023, 4:38 PM

not sure how to set it in “terminal session”

Kevin Vasko

01/31/2023, 4:38 PM

as in “export” a variable?

Elad Lachmi

01/31/2023, 4:38 PM

I mean

export AWS_REGION=<your region>

in the terminal and running the job again

Kevin Vasko

01/31/2023, 4:40 PM

same error

Elad Lachmi

01/31/2023, 4:43 PM

So I think you'll need to enable sigV4 and specifically configure an S3 endpoint both the driver and all worker Spark nodes must run Java with

Copy code

-Dcom.amazonaws.services.s3.enableV4

if they want SigV4 to work and you'll probably need to configure an S3 endpoint with the Spark configuration property

spark.hadoop.fs.s3a.endpoint

Kevin Vasko

01/31/2023, 4:44 PM

ok let me see if i can figure that out

Elad Lachmi

01/31/2023, 4:45 PM

I'm not a Java expert by any stretch of the imagination, but I'll try to assist as best I can

Elad Lachmi

01/31/2023, 4:45 PM

Hopefully, we can work it out

Elad Lachmi

01/31/2023, 4:46 PM

I'll do some reading up meanwhile Let me know how it goes

Elad Lachmi

01/31/2023, 5:03 PM

Just in case you're searching for how to use that parameter, here's an example

Copy code

spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)

Kevin Vasko

01/31/2023, 5:04 PM

yup, i got it. Different error now i’m pastebinning it

Kevin Vasko

01/31/2023, 5:06 PM

@Elad Lachmi https://gist.github.com/vaskokj/29904888ad20739bd95977da2fc1ce78

Kevin Vasko

01/31/2023, 5:07 PM

acts like it’s almost complaining about the structure of something

Kevin Vasko

01/31/2023, 5:07 PM

“the authorization header is malformed”

Kevin Vasko

01/31/2023, 5:08 PM

looks like a bug

Elad Lachmi

01/31/2023, 5:09 PM

I think I saw an issue with the AWS Java SDK re S3A + vpc endpoints Let me try to find it

Elad Lachmi

01/31/2023, 5:09 PM

I think you're right It's either a bug or a "feature" (a.k.a it's unsupported)

Kevin Vasko

01/31/2023, 5:10 PM

same exact error https://github.com/aws-amplify/aws-sdk-android/issues/3018

Kevin Vasko

01/31/2023, 5:10 PM

but that’s for aws android sdk core

Elad Lachmi

01/31/2023, 5:11 PM

I'm guessing Java SDK and Android SDK have common ancestry

Elad Lachmi

01/31/2023, 5:13 PM

I see that in the command you ran you didn't enable sigV4

Kevin Vasko

01/31/2023, 5:13 PM

https://github.com/aws/aws-sdk-java/issues/1451

Elad Lachmi

01/31/2023, 5:13 PM

I think both are needed

Kevin Vasko

01/31/2023, 5:14 PM

But my error seems to be more similar to the first in a parsing issue

Elad Lachmi

01/31/2023, 5:16 PM

Yes, but I'm not sure of all the differences between sigV2 and sigV4 I think it's worth making sure we have both the correct region picked up by the Java SDK, the correct endpoint, and that we're using sigV4 before we dig deeper AFAIK, you'll need to configure this anyway, so might as well take care of it now and then move on Otherwise we might hit on the right solution and not even know it

Kevin Vasko

01/31/2023, 5:19 PM

exact problem here

Kevin Vasko

01/31/2023, 5:19 PM

https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html

Kevin Vasko

01/31/2023, 5:19 PM

“Authorization Header is Malformed”(400) exception when PrivateLink URL is used in “fs.s3a.endpoint” When PrivateLink URL is used instead of standard s3a endpoint, it returns “authorization header is malformed” exception. So, if we set fs.s3a.endpoint=bucket.vpce -<some_string>.s3.ca-central-1.vpce.amazonaws.com and make s3 calls we get: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'vpce' is wrong; expecting 'ca-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed; Request ID: req-id; S3 Extended Request ID: req-id-2), S3 Extended Request ID: req-id-2AuthorizationHeaderMalformed The authorization header is malformed; the region 'vpce' is wrong; expecting 'ca-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed; Request ID: req-id; Cause: Since, endpoint parsing is done in a way that it assumes the AWS S3 region would be the 2nd component of the fs.s3a.endpoint URL delimited by “.”, in case of PrivateLink URL, it can’t figure out the region and throws an authorization exception. Thus, to add support to using PrivateLink URLs we use fs.s3a.endpoint.region to set the region and bypass this parsing of fs.s3a.endpoint, in the case shown above to make it work we’ll set the AWS S3 region as ca-central-1.

Elad Lachmi

01/31/2023, 5:21 PM

Yes, looks like it

Elad Lachmi

01/31/2023, 5:21 PM

So you want to try that?

Kevin Vasko

01/31/2023, 5:22 PM

yup, i’m doing that now

lakefs 1

Kevin Vasko

01/31/2023, 5:22 PM

it hasn’t crashed…yet

Kevin Vasko

01/31/2023, 5:22 PM

been running longer than i’ve seen it run before

Kevin Vasko

01/31/2023, 5:22 PM

not sure if it’s hung up lol

Elad Lachmi

01/31/2023, 5:23 PM

well, that's a good sign at least

Kevin Vasko

01/31/2023, 5:23 PM

no logging

Kevin Vasko

01/31/2023, 5:23 PM

so i’m not seeing anything

Elad Lachmi

01/31/2023, 5:23 PM

java + aws = always a great time

Kevin Vasko

01/31/2023, 5:24 PM

aws + anything imo has some god awful esoteric errors that don’t mean anything half the time

Elad Lachmi

01/31/2023, 5:24 PM

that's true

Kevin Vasko

01/31/2023, 5:25 PM

like i was getting a “could not start instance error” no logs, no nothing, machine just died on start up. finally figured out that the system did have permissions to access the key system in aws

Kevin Vasko

01/31/2023, 5:25 PM

it couldn’t decrypt the disk drive or something…

Kevin Vasko

01/31/2023, 5:26 PM

it also doesn’t help that this account i don’t own it’s work so i don’t have global admin and i don’t know what’s “set up” behind the scenes

Kevin Vasko

01/31/2023, 5:26 PM

it still hasn’t crashed…but unsure on what it’s doing

Elad Lachmi

01/31/2023, 5:27 PM

yeah, I know the feeling flying blind

Elad Lachmi

01/31/2023, 5:27 PM

It can take a while Depending on the number of objects/prefixes/refs

Kevin Vasko

01/31/2023, 5:28 PM

only like 2500 objects

Kevin Vasko

01/31/2023, 5:28 PM

just a test location

Elad Lachmi

01/31/2023, 5:28 PM

that's not too bad, but much more strongly correlated to the number of refs

Elad Lachmi

01/31/2023, 5:29 PM

branches, commits, objects (in lakeFS, not S3)

Kevin Vasko

01/31/2023, 5:29 PM

like 2 commits 1 branch, 2500 objects total

Kevin Vasko

01/31/2023, 5:30 PM

in this “testproject”

Elad Lachmi

01/31/2023, 5:30 PM

hmm... then it shouldn't take too long, but it still takes time I wouldn't be too worried about it taking time

Kevin Vasko

01/31/2023, 5:32 PM

It should be doing it for only the single project right?

Kevin Vasko

01/31/2023, 5:33 PM

feature request…logging of some sort ;)

Elad Lachmi

01/31/2023, 5:34 PM

Yes, I think so

Elad Lachmi

01/31/2023, 5:34 PM

Noted 🙂

Kevin Vasko

01/31/2023, 5:35 PM

ah damn, didn’t work

Elad Lachmi

01/31/2023, 5:35 PM

😞

Kevin Vasko

01/31/2023, 5:35 PM

complaining it couldn’t find the end point

Kevin Vasko

01/31/2023, 5:36 PM

so i’m not sure what those instructions are telling me to do

Kevin Vasko

01/31/2023, 5:36 PM

i have to pass the endpoint

Elad Lachmi

01/31/2023, 5:37 PM

tbh I'm not sure not a Java expert and the minimalism in the "solution" isn't appreciated in this case

Elad Lachmi

01/31/2023, 5:38 PM

Maybe try searching for a similar issue with maybe a more detailed solution

Elad Lachmi

01/31/2023, 5:38 PM

Elad Lachmi

01/31/2023, 5:53 PM

Meanwhile I'm working on other things Please feel free to let me know if I can assist further or if you've made any interesting progress

Kevin Vasko

01/31/2023, 5:54 PM

yup! appreciate it! thanks!

lakefs 1

Elad Lachmi

01/31/2023, 5:55 PM

Sure, np

Kevin Vasko

01/31/2023, 8:33 PM

@Elad Lachmi ok I got it to successfully run… I had a couple problem… 1) The comment and troubleshooting issue here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html That spark.hadoop.fs.s3a.endpoint.region option didn’t exist until hadoop-aws:3.3.2 See this: https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-17705 The other issue I had was subtle but…you have to specify fs.s3a.endpoint=bucket.vpce -<some_string>.s3.ca-central-1.vpce.amazonaws.com So now it successfully ran…however it didn’t clear anything out or this ref

Elad Lachmi

01/31/2023, 8:34 PM

ok I got it to successfully run…

First things first... 🥳

Kevin Vasko

01/31/2023, 8:36 PM

Should it not delete from s3 all of these object?

Elad Lachmi

01/31/2023, 8:37 PM

spark.hadoop.fs.s3a.endpoint.region option didn’t exist until hadoop-aws:3.3.2

I see... so it might have been related to the issue I saw in the issue tracker 🤔

Kevin Vasko

01/31/2023, 8:37 PM

there is 12k objects 508MB worth

Elad Lachmi

01/31/2023, 8:37 PM

Did you change anything else in the command you ran since you sent it to me last?

Elad Lachmi

01/31/2023, 8:37 PM

(besides the S3 configuration, of course)

Kevin Vasko

01/31/2023, 8:38 PM

yeah so in the documentation of lakeFS needed to change —packages org.apache.hadoophadoop aws3.3.2

Kevin Vasko

01/31/2023, 8:38 PM

Elad Lachmi

01/31/2023, 8:38 PM

Elad Lachmi

01/31/2023, 8:39 PM

checking...

Kevin Vasko

01/31/2023, 8:41 PM

here is my working command https://gist.github.com/vaskokj/0d1a5e2602a112f0c02a48aad51f7f58

Elad Lachmi

01/31/2023, 8:41 PM

So lets look at the simpler options first...

Elad Lachmi

01/31/2023, 8:42 PM

• Any object that is accessible from any branch’s HEAD. • Objects stored outside the repository’s storage namespace. For example, objects imported using the lakeFS import UI are not collected. • Uncommitted objects, see below, These three categories of objects aren't candidates for GC, which is important to note

Elad Lachmi

01/31/2023, 8:44 PM

The second thing I'd double-check is the GC rules policy 1. That one exists 2. That it's configured in a sensible way (whatever sensible means in your context)

Elad Lachmi

01/31/2023, 8:45 PM

You can see a ref for configuring this either via

lakectl

or via the UI https://docs.lakefs.io/howto/garbage-collection.html#configuring-gc-rules

Kevin Vasko

01/31/2023, 8:47 PM

so if i uploaded a bunch of files and they were never committed they wouldn’t bc GCed it seems?

Kevin Vasko

01/31/2023, 8:47 PM

Also where is the “import UI”

Kevin Vasko

01/31/2023, 8:47 PM

nvm i’m an idiot

😅 1

Elad Lachmi

01/31/2023, 8:47 PM

You might also want to check the file in

_lakefs/retention/gc/addresses

If one doesn't exist or has very few objects, that could point us towards a policy/configuration issue If it has all the objects, but doesn't hard-delete them, then we'll look into that

Kevin Vasko

01/31/2023, 8:49 PM

yeah 99% of these files were never committed

Elad Lachmi

01/31/2023, 8:50 PM

I see... so that's a diff type of GC It's called uncommitted GC

Kevin Vasko

01/31/2023, 8:50 PM

yup reading that now

Elad Lachmi

01/31/2023, 8:50 PM

Same method of running and all, but a bit of a different purpose

Kevin Vasko

01/31/2023, 8:52 PM

yeah, makes sense

Kevin Vasko

01/31/2023, 8:52 PM

got to upgrade to later version of lakefs

Elad Lachmi

01/31/2023, 8:54 PM

From my experience, it's worth it That's where the real ROI comes in, because people make copies of huge data sets over and over and many of the branches are just abandoned after a bit of experimentation, but the objects remain and pile up real quick

Elad Lachmi

01/31/2023, 8:55 PM

But in terms of the endpoint setup, it should be the same

Kevin Vasko

01/31/2023, 9:01 PM

if i ran the migrate with 0.80.1 do i need to upgrade to 0.80.2 and migrate again?

Kevin Vasko

01/31/2023, 9:01 PM

or can i just drop the new binary in place?

Elad Lachmi

01/31/2023, 9:03 PM

Migrations automatically trigger a minor version bump, so it's probably a drop in, but let me make sure

Kevin Vasko

01/31/2023, 9:03 PM

i’m going to move all the way up to the latest (0.91.0)

Kevin Vasko

01/31/2023, 9:04 PM

but unsure if i need to use 0.80.2 to migrate up

Elad Lachmi

01/31/2023, 9:05 PM

0.80.1 requires a migrate up The next few do not

Elad Lachmi

01/31/2023, 9:07 PM

From what I can see, there aren't any migrations needed from 0.80.1 up to latest 0.91.0

Kevin Vasko

01/31/2023, 9:12 PM

ok cool!

Elad Lachmi

01/31/2023, 9:14 PM

If you get the chance, I'd love to hear how things went and of course, if you have any further questions, feel free to reach out again

Kevin Vasko

01/31/2023, 9:26 PM

lol round 2 https://gist.github.com/vaskokj/2b396384cf6b121c38c2b1e920d57a6f

Elad Lachmi

01/31/2023, 9:28 PM

🤦🏻‍♂️

Kevin Vasko

01/31/2023, 9:30 PM

unsure on if i’m missing something

Elad Lachmi

01/31/2023, 9:30 PM

I think you have a missing config param

fs.s3.impl

fs.s3a.impl

Kevin Vasko

01/31/2023, 9:30 PM

is that in the docs?

Elad Lachmi

01/31/2023, 9:31 PM

search here for

fs.s3a.impl

https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Elad Lachmi

01/31/2023, 9:33 PM

Oh, wait... maybe it's configured ok I just noticed you were using an

s3://

path Can you try

s3a://

instead, before configuring more stuff

Kevin Vasko

01/31/2023, 9:33 PM

sorry i’m confused where should i pass that?

Kevin Vasko

01/31/2023, 9:34 PM

i’m doing the same thing as I did with the other

Kevin Vasko

01/31/2023, 9:35 PM

i copied and pasted, changed the class and the -c options accordingly for do_sweep

Elad Lachmi

01/31/2023, 9:39 PM

Now that I think of it, the support for only S3 for the uncommitted GC probably means we're using s3 and not s3a, so it might be correct that the s3:// protocol handler needs to be configured

Kevin Vasko

01/31/2023, 9:40 PM

hmmm

Kevin Vasko

01/31/2023, 9:40 PM

docs show s3a

Kevin Vasko

02/02/2023, 3:44 PM

hmmm i’m still struggling with this. i’ve messed with every setting i can think of but still getting the same error. Any thoughts on what to change?

Elad Lachmi

02/02/2023, 3:46 PM

Hi @Kevin Vasko, Can you please remind me what error message you're getting right now? We've been through a few 😅

Kevin Vasko

02/02/2023, 3:57 PM

I tried this

I think you have a missing config param

fs.s3.impl

fs.s3a.impl

Kevin Vasko

02/02/2023, 3:57 PM

I also tried changing the parameters into command line to s3 instead of s3a.

Elad Lachmi

02/02/2023, 4:01 PM

Let me check something It might take a few minutes

Kevin Vasko

02/02/2023, 4:01 PM

yup! no worries

Elad Lachmi

02/02/2023, 4:24 PM

Can you try adding this to your spark command?

Copy code

--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Kevin Vasko

02/02/2023, 4:59 PM

trying

Kevin Vasko

02/02/2023, 5:05 PM

no dice but different error message…getting error message for you

Kevin Vasko

02/02/2023, 5:07 PM

https://gist.github.com/vaskokj/bd505ff7ea2b00ae0ee7800040f1e7cb

Elad Lachmi

02/02/2023, 5:09 PM

That looks like the error we originally got while trying to run GC, I think

Kevin Vasko

02/02/2023, 5:09 PM

it does but i still have all the other parameters in it

Kevin Vasko

02/02/2023, 5:10 PM

is there a fs.s3 set of properties i need to set?

Elad Lachmi

02/02/2023, 5:11 PM

I think when you set it to use s3a it users the same conf params, but I'm not sure

Kevin Vasko

02/02/2023, 5:15 PM

oops, typo in my key!

Kevin Vasko

02/02/2023, 5:15 PM

it’s working!

Elad Lachmi

02/02/2023, 5:15 PM

nice!

Kevin Vasko

02/02/2023, 5:16 PM

well the mark part is haha

Elad Lachmi

02/02/2023, 5:16 PM

I was thinking to myself "This should work... why isn't this working" 🙂

Elad Lachmi

02/02/2023, 5:16 PM

Another step in the right direction - I'll take it

Kevin Vasko

02/02/2023, 6:06 PM

well damn….

Kevin Vasko

02/02/2023, 6:06 PM

so close lol

Kevin Vasko

02/02/2023, 6:07 PM

Now i’m trying to run the sweep…it starts running and then blows up saying the AWS Acess Key Id you provided does not exist in our records

Kevin Vasko

02/02/2023, 6:07 PM

https://gist.github.com/vaskokj/3e827bd3999653ebb38fcb375c527558

Elad Lachmi

02/02/2023, 6:14 PM

So close! 😑

Kevin Vasko

02/02/2023, 6:15 PM

lol no joke

Kevin Vasko

02/02/2023, 6:15 PM

but the job starts running…

Kevin Vasko

02/02/2023, 6:16 PM

so it’s like it’s in the actual lakeFS code

Elad Lachmi

02/02/2023, 6:20 PM

It's still an AWS SDK error The job uses the SDK as well

Elad Lachmi

02/02/2023, 6:20 PM

But it's in the right direction

Kevin Vasko

02/02/2023, 6:21 PM

yeah

Elad Lachmi

02/02/2023, 6:23 PM

I think what it's starting to do is divyy up the work and as soon as tasks start going, it tries to access S3 and hits that error

Elad Lachmi

02/02/2023, 6:28 PM

Just to make sure - you used the mark ID from the latest mark run, right?

Kevin Vasko

02/02/2023, 6:37 PM

yup

Kevin Vasko

02/02/2023, 6:45 PM

i had to add ‘spark.hadoop.lakefs.gc.do_sweep=true’ because it errored saying ‘Nothing to do, must specify at least one of mark, sweep. Exiting”

Elad Lachmi

02/02/2023, 6:49 PM

Yes, you either run with do_sweep=false and do_mark=true first and then the other way around or you can mark and sweep in one go (I think 🤔)

Kevin Vasko

02/02/2023, 6:50 PM

Yeah, in the docs it only has one

Kevin Vasko

02/02/2023, 6:51 PM

In step 1 shows do_sweep=false and nothing else

Kevin Vasko

02/02/2023, 6:51 PM

step 3 for sweep it shows do_mark=false

Elad Lachmi

02/02/2023, 6:53 PM

Yeah, maybe adding a mark_id and setting do_mark to false'll do it

Kevin Vasko

02/02/2023, 6:53 PM

yeah, i did do that

Kevin Vasko

02/02/2023, 6:54 PM

it’s got me to this point haha

Elad Lachmi

02/02/2023, 7:05 PM

From the docs, sweep-only mode should be configured like this

Copy code

spark.hadoop.lakefs.gc.do_mark=false
spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you used on a previous mark-only run

Kevin Vasko

02/02/2023, 7:06 PM

yup, and that errors with above issue

Kevin Vasko

02/02/2023, 7:07 PM

the docs are wrong, have to be

Elad Lachmi

02/02/2023, 7:20 PM

I'm looking through the code to see if we've missed something I'll be a minute

Elad Lachmi

02/02/2023, 7:24 PM

Ok, can confirm that both do_sweep and do_mark are handled in code

Elad Lachmi

02/02/2023, 7:28 PM

so do_mark needs to be false, do_sweep needs to be true and mark_id needs to be the ID generated in the latest mark run

Elad Lachmi

02/02/2023, 7:30 PM

If that's all good, you should see one of the first lines in the log output will be

Copy code

deleting marked addresses: <mark_id>

Kevin Vasko

02/02/2023, 7:31 PM

yeah, that’s what I already did. Unfortunately where I’m at is the error regarding missing AWS Key that I linked

Kevin Vasko

02/02/2023, 7:31 PM

See: https://gist.github.com/vaskokj/3e827bd3999653ebb38fcb375c527558

Elad Lachmi

02/02/2023, 7:32 PM

Yes, but I'm trying to work through the code to see where it's potentially getting stuck and what it's trying to do

Kevin Vasko

02/02/2023, 7:32 PM

that includes the new parameters with true/false for mark and sweep with the mark_id specified

👍🏻 1

Kevin Vasko

02/02/2023, 7:32 PM

So now i’m at the error of the AWS Key error

Elad Lachmi

02/02/2023, 7:33 PM

Yeah, that the key ID you provided doesn't exist

Elad Lachmi

02/02/2023, 7:33 PM

I'm trying to find where it's using the AWS SDK and what it's doing exactly

Elad Lachmi

02/02/2023, 7:35 PM

It's picking up all of the configuration params that start with

fs.

lakefs.

, which seems ok

Kevin Vasko

02/02/2023, 7:44 PM

hmmm, yeah that’s odd

Kevin Vasko

02/02/2023, 7:45 PM

maybe a typo?

Kevin Vasko

02/02/2023, 7:45 PM

I’m unsure on how the mark option works but not the sweep

Elad Lachmi

02/02/2023, 7:47 PM

Yes, that's very strange That's why I'm looking through the code to maybe pick up on something in the difference between how the sweep and mark work

Kevin Vasko

02/02/2023, 7:47 PM

makes sense

Kevin Vasko

02/02/2023, 7:47 PM

is this code public? if so i can look at it too

Elad Lachmi

02/02/2023, 7:51 PM

Yes, it is

Elad Lachmi

02/02/2023, 7:53 PM

btw: the input validation makes sure that the combination of existence/non-existence/values of the mark, sweep, and mark_id make sense So that's probably not the issue

Kevin Vasko

02/02/2023, 7:53 PM

yeah, it’s fully into the code at this point, it’s doing it’s maps and splits and stuff

Kevin Vasko

02/02/2023, 7:54 PM

it’s like once it gets to actually doing the deletes i bet it fails

Elad Lachmi

02/02/2023, 7:54 PM

https://github.com/treeverse/lakeFS/blob/master/clients/spark/core/src/main/scala/io/treeverse/gc/UncommittedGarbageCollector.scala

Elad Lachmi

02/02/2023, 7:56 PM

it’s like once it gets to actually doing the deletes i bet it fails

Yeah, that's for sure The question is where is it not getting the credentials, getting the wrong credentials, or not setting up the client correctly? (or some other option, which I'm not even thinking about right now)

Elad Lachmi

02/02/2023, 8:06 PM

I have a feeling it's not using the session token, but I'm not 100% sure yet

Elad Lachmi

02/02/2023, 8:11 PM

I think that might be the issue I'm not the scala or spark expert around these parts, so I'll pass it on to one of my colleagues and we'll take a look together

Kevin Vasko

02/02/2023, 8:12 PM

ahhh makes perfect sense

Elad Lachmi

02/02/2023, 8:12 PM

I guess that without the session token it can't look up the access key ID, so it's like it doesn't exist

Elad Lachmi

02/02/2023, 8:15 PM

It would seem that that's the error you'd get if you use an STS key pair without a session token Which kind of makes sense, if you imagine how they've implemented it, I guess

Elad Lachmi

02/02/2023, 8:18 PM

When you are testing credentials in Boto3: The error you receive may say this,

ClientError: An error occurred (InvalidAccessKeyId) when calling the ListBuckets operation: The AWS Access Key Id you provided does not exist in our records.

but may mean you are missing an
aws_session_token
if you are using temporary credentials (in my case, role-based credentials).

Elad Lachmi

02/02/2023, 8:19 PM

Looks like that's going to need fixing

Kevin Vasko

02/02/2023, 8:19 PM

dang! ok

Kevin Vasko

02/02/2023, 8:20 PM

well glad it wasn’t me

Elad Lachmi

02/02/2023, 8:20 PM

this time 😂

Kevin Vasko

02/02/2023, 8:20 PM

the next obvious question, do you know how long it might take to fix? I’m sure it’s low priority

Kevin Vasko

02/02/2023, 8:21 PM

haha yup. well technically most of these issues have been errors with environment

Elad Lachmi

02/02/2023, 8:21 PM

I'm not sure that it is... I'll have to talk to someone from that team

Elad Lachmi

02/02/2023, 8:22 PM

I think we can compromise on blaming Java and/or AWS

Kevin Vasko

02/02/2023, 8:22 PM

lol yup would agree

Kevin Vasko

02/02/2023, 8:23 PM

is there a bug report system that this needs to be documented in?

Elad Lachmi

02/02/2023, 8:24 PM

That's a great idea, but first I'd like to verify that I'm not missing something, although I'm relatively sure We work with GitHub issues for bug reports

Kevin Vasko

02/02/2023, 8:25 PM

got it

Kevin Vasko

02/02/2023, 8:25 PM

Not sure if there is a workaround

Elad Lachmi

02/02/2023, 8:26 PM

Well, there might be a couple, but they defeat the purpose of using STS, so I wouldn't recommend them

Elad Lachmi

02/02/2023, 8:27 PM

However, we might be able to push this through relatively quickly I'd like to at least try (once I confirm that that is the issue)

Kevin Vasko

02/02/2023, 8:37 PM

seems like an easy fix hopefully

Elad Lachmi

02/02/2023, 8:39 PM

I'm not sure what the wider context might be It might be a more complex issue than I'm thinking or it might be a different issue As soon as I have more information, I'll let you know

Kevin Vasko

02/02/2023, 8:40 PM

cross my fingers thst it’s simple :)

Kevin Vasko

02/02/2023, 8:40 PM

thanks!

Elad Lachmi

02/02/2023, 8:40 PM

same

Elad Lachmi

02/02/2023, 8:40 PM

sure, np

Elad Lachmi

02/02/2023, 8:41 PM

One quick thing I'd like to confirm In the full log of the run, do you see a line starting with

Copy code

Use access key ID

Kevin Vasko

02/02/2023, 8:43 PM

I do not

Elad Lachmi

02/02/2023, 8:44 PM

ok, thanks!

Kevin Vasko

02/02/2023, 8:44 PM

should I?

Elad Lachmi

02/02/2023, 8:45 PM

I'm not sure, but I was trying to differentiate between two possible paths

Kevin Vasko

02/02/2023, 8:46 PM

got it

Elad Lachmi

02/02/2023, 10:30 PM

@Kevin Vasko Quick update It would seem that's indeed the case Would you be able to open an issue in the repo Send me a link to it and I'll add all he need tags

Kevin Vasko

02/03/2023, 2:29 PM

@Elad Lachmi https://github.com/treeverse/lakeFS/issues/5179

Kevin Vasko

02/03/2023, 2:29 PM

done! Thanks!

👍🏻 1

Elad Lachmi

02/03/2023, 2:45 PM

Thank you 🙏🏻

Kevin Vasko

02/17/2023, 3:58 PM

@Elad Lachmi know of any way to get a timeline for this? weeks? months? potentially never? just trying to find a workaround

Jonathan Rosenberg

02/17/2023, 4:20 PM

Hi @Kevin Vasko, there is a related design proposal and an improvement that might be handy to support STS in the Spark metadata client (which GC uses). We’ll examine them in the beginning of next week and hopefully have an estimate on that.

Kevin Vasko

02/17/2023, 7:48 PM

Sounds good thanks. I guess in the meantime is there any recommendations on running the GC to maybe bypass STS?

Ariel Shaqed (Scolnicov)

02/17/2023, 8:06 PM

Hi @Kevin Vasko, Sorry you're having difficulty with this. Indeed for various reasons we do not currently support sts during sweep. I'm afraid that the only current workaround is not to use sts. We do support straight access key and secret key, or if you're using Spark 3 on Hadoop 3 you might be able to get cross-account delegation. Alternatively, if it's a small installation you might be able to get by with directly listing and then deleting all the swept objects.

Kevin Vasko

02/17/2023, 8:06 PM

yeah it’s a small instance only 10k objects or so

Kevin Vasko

02/17/2023, 8:07 PM

might write a script to do that

😞 1

👍 1

Iddo Avneri

02/17/2023, 8:25 PM

That’s a possibility. Let us know if you run into issues.

18 Views

Open in Slack

Previous Next