To run garbage collection, is the assumption that ...
# help
k
To run garbage collection, is the assumption that I need to have a spark cluster set up?
e
Hi @Kevin Vasko, Yes, lakeFS assumes you have some means of submitting and running spark jobs See here for additional details on running GC jobs
k
Thanks.
I am seeing “Error while looking for metadata directory in the path: s3a://bucket/projects/myproject/_lakefs/retention/gc/commits/run_id={id}/commits.csv_
Do i have to create that path?
i
Hi Kevin! Which user are you using to run the spark job and can you share it’s permissions?
k
i’m using my account which is an admin account
well it’s in the “Admins” group
It seems to only be a warning but the next line is AWSBadRequestException: getFileStatus on {path i mentioned above}
i
Hi Kevin, let me try to collect some data to assist here: 1. Which version of the lakeFS Spark client are you using? 2. Can you share your spark-submit command (without the values for the keys of course) 3. What version of lakeFS are you running?
k
0.6.0 i tried the 3.0.1 and the 3.1.2 and got same result. I’ll send this tomorrow (not in front of computer) 0.80
i
Awesome. Thank you!
k
so i did see it was creating the commits file (i didn’t see it bc it wasn’t showing in lakefs ui).
but it’s in the aws s3 console
how closely tied to the spark version do I need?
i was just running a local standalone spark cluster on my local machine and grabbed the latest spark version
i
When you say “i did see it was creating the commits file” Do you mean the the list of expired objects? under
_lakefs/retention/gc/addresses/mark_id=MARK_ID
k
yes
i saw it in the s3 console but wasn’t seeing it lakefs ui so i didn’t think it was making it
but it is
i
Got it. Thanks
e
Hi @Kevin Vasko , Just to make sure - you're configuring the job with both your lakeFS and AWS S3 access key id/secret access key pairs?
k
@Iddo Avneri @Elad Lachmi Yup! See attached error log. I also included the script I’m using to run the code. https://gist.github.com/vaskokj/621cdcc328f4bbbf4586e96a3968a16b
e
Great, I'll take a look
k
I am using a local spark cluster. I just spun up a 3.3.1 Standalone spark cluster on my local box
Not sure if I need to match the lakefs-spark-client-312-hadoop3-assembly-0.6.0.jar
Essentially I’m just trying to clean up all the crap that people have deleted.
e
From the looks of it it's failing earlier It's getting a 400 HTTP error from AWS I'll need to dig into this a bit
Essentially I’m just trying to clean up all the crap that people have deleted.
Yep, that's exactly what it's for 🙂
k
@Elad Lachmi so what’s weird is the credentials are correct…because it creates the commit.csv file
e
I believe the
csv
file is created using lakeFS's role, while the cleanup is done with the AWS credentials, but I just want to make sure before we dig deeper
k
also could this be an issue with not passing an endpoint URL?
When i pass an endpoint url I had more issues
e
Still looking into it
k
no rush. i’ll be around. It’s probably a mistake on my end
e
Either way, I'd be happy to assist
k
Or something to do with my environment
e
Seems like there's an issue with STS credentials and S3A GetFileStatus It's an old issue, but the circumstances seem too similar to be a coincidence I'll read up on it a bit
k
Issue with lakeFS client or issue with something else?
The AWS account I have access to requires me to use a session token
e
Hadoop + STS credentials + S3A GetFileStatus
Yes, I understand
Can you try something real quick (if you haven't already) Can you try setting
AWS_REGION
to your region in your terminal session and trying again?
k
can you clarify?
it’s set in the command
not sure how to set it in “terminal session”
as in “export” a variable?
e
I mean
export AWS_REGION=<your region>
in the terminal and running the job again
k
same error
e
So I think you'll need to enable sigV4 and specifically configure an S3 endpoint both the driver and all worker Spark nodes must run Java with
Copy code
-Dcom.amazonaws.services.s3.enableV4
if they want SigV4 to work and you'll probably need to configure an S3 endpoint with the Spark configuration property
spark.hadoop.fs.s3a.endpoint
k
ok let me see if i can figure that out
e
I'm not a Java expert by any stretch of the imagination, but I'll try to assist as best I can
Hopefully, we can work it out
I'll do some reading up meanwhile Let me know how it goes
Just in case you're searching for how to use that parameter, here's an example
Copy code
spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)
k
yup, i got it. Different error now i’m pastebinning it
acts like it’s almost complaining about the structure of something
“the authorization header is malformed”
looks like a bug
e
I think I saw an issue with the AWS Java SDK re S3A + vpc endpoints Let me try to find it
I think you're right It's either a bug or a "feature" (a.k.a it's unsupported)
k
but that’s for aws android sdk core
e
I'm guessing Java SDK and Android SDK have common ancestry
I see that in the command you ran you didn't enable sigV4
e
I think both are needed
k
But my error seems to be more similar to the first in a parsing issue
e
Yes, but I'm not sure of all the differences between sigV2 and sigV4 I think it's worth making sure we have both the correct region picked up by the Java SDK, the correct endpoint, and that we're using sigV4 before we dig deeper AFAIK, you'll need to configure this anyway, so might as well take care of it now and then move on Otherwise we might hit on the right solution and not even know it
k
exact problem here
“Authorization Header is Malformed”(400) exception when PrivateLink URL is used in “fs.s3a.endpoint” When PrivateLink URL is used instead of standard s3a endpoint, it returns “authorization header is malformed” exception. So, if we set fs.s3a.endpoint=bucket.vpce -<some_string>.s3.ca-central-1.vpce.amazonaws.com and make s3 calls we get: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'vpce' is wrong; expecting 'ca-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed; Request ID: req-id; S3 Extended Request ID: req-id-2), S3 Extended Request ID: req-id-2AuthorizationHeaderMalformed The authorization header is malformed; the region 'vpce' is wrong; expecting 'ca-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed; Request ID: req-id; Cause: Since, endpoint parsing is done in a way that it assumes the AWS S3 region would be the 2nd component of the fs.s3a.endpoint URL delimited by “.”, in case of PrivateLink URL, it can’t figure out the region and throws an authorization exception. Thus, to add support to using PrivateLink URLs we use fs.s3a.endpoint.region to set the region and bypass this parsing of fs.s3a.endpoint, in the case shown above to make it work we’ll set the AWS S3 region as ca-central-1.
e
Yes, looks like it
So you want to try that?
k
yup, i’m doing that now
lakefs 1
it hasn’t crashed…yet
been running longer than i’ve seen it run before
not sure if it’s hung up lol
e
well, that's a good sign at least
k
no logging
so i’m not seeing anything
e
java + aws = always a great time
k
aws + anything imo has some god awful esoteric errors that don’t mean anything half the time
e
that's true
k
like i was getting a “could not start instance error” no logs, no nothing, machine just died on start up. finally figured out that the system did have permissions to access the key system in aws
it couldn’t decrypt the disk drive or something…
it also doesn’t help that this account i don’t own it’s work so i don’t have global admin and i don’t know what’s “set up” behind the scenes
it still hasn’t crashed…but unsure on what it’s doing
e
yeah, I know the feeling flying blind
It can take a while Depending on the number of objects/prefixes/refs
k
only like 2500 objects
just a test location
e
that's not too bad, but much more strongly correlated to the number of refs
branches, commits, objects (in lakeFS, not S3)
k
like 2 commits 1 branch, 2500 objects total
in this “testproject”
e
hmm... then it shouldn't take too long, but it still takes time I wouldn't be too worried about it taking time
k
It should be doing it for only the single project right?
feature request…logging of some sort ;)
e
Yes, I think so
Noted 🙂
k
ah damn, didn’t work
e
😞
k
complaining it couldn’t find the end point
so i’m not sure what those instructions are telling me to do
i have to pass the endpoint
e
tbh I'm not sure not a Java expert and the minimalism in the "solution" isn't appreciated in this case
Maybe try searching for a similar issue with maybe a more detailed solution
?
Meanwhile I'm working on other things Please feel free to let me know if I can assist further or if you've made any interesting progress
k
yup! appreciate it! thanks!
lakefs 1
e
Sure, np
k
@Elad Lachmi ok I got it to successfully run… I had a couple problem… 1) The comment and troubleshooting issue here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html That spark.hadoop.fs.s3a.endpoint.region option didn’t exist until hadoop-aws:3.3.2 See this: https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-17705 The other issue I had was subtle but…you have to specify fs.s3a.endpoint=bucket.vpce -<some_string>.s3.ca-central-1.vpce.amazonaws.com So now it successfully ran…however it didn’t clear anything out or this ref
e
ok I got it to successfully run…
First things first... 🥳
k
Should it not delete from s3 all of these object?
e
spark.hadoop.fs.s3a.endpoint.region option didn’t exist until hadoop-aws:3.3.2
I see... so it might have been related to the issue I saw in the issue tracker 🤔
k
there is 12k objects 508MB worth
e
Did you change anything else in the command you ran since you sent it to me last?
(besides the S3 configuration, of course)
k
yeah so in the documentation of lakeFS needed to change —packages org.apache.hadoophadoop aws3.3.2
no
e
ok
checking...
k
e
So lets look at the simpler options first...
• Any object that is accessible from any branch’s HEAD. • Objects stored outside the repository’s storage namespace. For example, objects imported using the lakeFS import UI are not collected. • Uncommitted objects, see below, These three categories of objects aren't candidates for GC, which is important to note
The second thing I'd double-check is the GC rules policy 1. That one exists 2. That it's configured in a sensible way (whatever sensible means in your context)
You can see a ref for configuring this either via
lakectl
or via the UI https://docs.lakefs.io/howto/garbage-collection.html#configuring-gc-rules
k
so if i uploaded a bunch of files and they were never committed they wouldn’t bc GCed it seems?
Also where is the “import UI”
nvm i’m an idiot
😅 1
e
You might also want to check the file in
_lakefs/retention/gc/addresses
If one doesn't exist or has very few objects, that could point us towards a policy/configuration issue If it has all the objects, but doesn't hard-delete them, then we'll look into that
k
yeah 99% of these files were never committed
e
I see... so that's a diff type of GC It's called uncommitted GC
k
yup reading that now
e
Same method of running and all, but a bit of a different purpose
k
yeah, makes sense
got to upgrade to later version of lakefs
e
From my experience, it's worth it That's where the real ROI comes in, because people make copies of huge data sets over and over and many of the branches are just abandoned after a bit of experimentation, but the objects remain and pile up real quick
But in terms of the endpoint setup, it should be the same
k
if i ran the migrate with 0.80.1 do i need to upgrade to 0.80.2 and migrate again?
or can i just drop the new binary in place?
e
Migrations automatically trigger a minor version bump, so it's probably a drop in, but let me make sure
k
i’m going to move all the way up to the latest (0.91.0)
but unsure if i need to use 0.80.2 to migrate up
e
0.80.1 requires a migrate up The next few do not
From what I can see, there aren't any migrations needed from 0.80.1 up to latest 0.91.0
k
ok cool!
e
If you get the chance, I'd love to hear how things went and of course, if you have any further questions, feel free to reach out again
k
e
🤦🏻‍♂️
k
unsure on if i’m missing something
e
I think you have a missing config param
fs.s3.impl
or
fs.s3a.impl
k
is that in the docs?
e
Oh, wait... maybe it's configured ok I just noticed you were using an
s3://
path Can you try
s3a://
instead, before configuring more stuff
k
sorry i’m confused where should i pass that?
i’m doing the same thing as I did with the other
i copied and pasted, changed the class and the -c options accordingly for do_sweep
e
Now that I think of it, the support for only S3 for the uncommitted GC probably means we're using s3 and not s3a, so it might be correct that the s3:// protocol handler needs to be configured
k
hmmm
docs show s3a
hmmm i’m still struggling with this. i’ve messed with every setting i can think of but still getting the same error. Any thoughts on what to change?
e
Hi @Kevin Vasko, Can you please remind me what error message you're getting right now? We've been through a few 😅
k
I tried this
I think you have a missing config param
fs.s3.impl
or
fs.s3a.impl
I also tried changing the parameters into command line to s3 instead of s3a.
e
Let me check something It might take a few minutes
k
yup! no worries
e
Can you try adding this to your spark command?
Copy code
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
k
trying
no dice but different error message…getting error message for you
e
That looks like the error we originally got while trying to run GC, I think
k
it does but i still have all the other parameters in it
is there a fs.s3 set of properties i need to set?
e
I think when you set it to use s3a it users the same conf params, but I'm not sure
k
oops, typo in my key!
it’s working!
e
nice!
k
well the mark part is haha
e
I was thinking to myself "This should work... why isn't this working" 🙂
Another step in the right direction - I'll take it
k
well damn….
so close lol
Now i’m trying to run the sweep…it starts running and then blows up saying the AWS Acess Key Id you provided does not exist in our records
e
So close! 😑
k
lol no joke
but the job starts running…
so it’s like it’s in the actual lakeFS code
e
It's still an AWS SDK error The job uses the SDK as well
But it's in the right direction
k
yeah
e
I think what it's starting to do is divyy up the work and as soon as tasks start going, it tries to access S3 and hits that error
Just to make sure - you used the mark ID from the latest mark run, right?
k
yup
i had to add ‘spark.hadoop.lakefs.gc.do_sweep=true’ because it errored saying ‘Nothing to do, must specify at least one of mark, sweep. Exiting”
e
Yes, you either run with do_sweep=false and do_mark=true first and then the other way around or you can mark and sweep in one go (I think 🤔)
k
Yeah, in the docs it only has one
In step 1 shows do_sweep=false and nothing else
step 3 for sweep it shows do_mark=false
e
Yeah, maybe adding a mark_id and setting do_mark to false'll do it
k
yeah, i did do that
it’s got me to this point haha
e
From the docs, sweep-only mode should be configured like this
Copy code
spark.hadoop.lakefs.gc.do_mark=false
spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you used on a previous mark-only run
k
yup, and that errors with above issue
the docs are wrong, have to be
e
I'm looking through the code to see if we've missed something I'll be a minute
Ok, can confirm that both do_sweep and do_mark are handled in code
so do_mark needs to be false, do_sweep needs to be true and mark_id needs to be the ID generated in the latest mark run
If that's all good, you should see one of the first lines in the log output will be
Copy code
deleting marked addresses: <mark_id>
k
yeah, that’s what I already did. Unfortunately where I’m at is the error regarding missing AWS Key that I linked
e
Yes, but I'm trying to work through the code to see where it's potentially getting stuck and what it's trying to do
k
that includes the new parameters with true/false for mark and sweep with the mark_id specified
👍🏻 1
So now i’m at the error of the AWS Key error
e
Yeah, that the key ID you provided doesn't exist
I'm trying to find where it's using the AWS SDK and what it's doing exactly
It's picking up all of the configuration params that start with
fs.
or
lakefs.
, which seems ok
k
hmmm, yeah that’s odd
maybe a typo?
I’m unsure on how the mark option works but not the sweep
e
Yes, that's very strange That's why I'm looking through the code to maybe pick up on something in the difference between how the sweep and mark work
k
makes sense
is this code public? if so i can look at it too
e
Yes, it is
btw: the input validation makes sure that the combination of existence/non-existence/values of the mark, sweep, and mark_id make sense So that's probably not the issue
k
yeah, it’s fully into the code at this point, it’s doing it’s maps and splits and stuff
it’s like once it gets to actually doing the deletes i bet it fails
it’s like once it gets to actually doing the deletes i bet it fails
Yeah, that's for sure The question is where is it not getting the credentials, getting the wrong credentials, or not setting up the client correctly? (or some other option, which I'm not even thinking about right now)
I have a feeling it's not using the session token, but I'm not 100% sure yet
I think that might be the issue I'm not the scala or spark expert around these parts, so I'll pass it on to one of my colleagues and we'll take a look together
k
ahhh makes perfect sense
e
I guess that without the session token it can't look up the access key ID, so it's like it doesn't exist
It would seem that that's the error you'd get if you use an STS key pair without a session token Which kind of makes sense, if you imagine how they've implemented it, I guess
When you are testing credentials in Boto3: The error you receive may say this,
ClientError: An error occurred (InvalidAccessKeyId) when calling the ListBuckets operation: The AWS Access Key Id you provided does not exist in our records.
but may mean you are missing an
aws_session_token
if you are using temporary credentials (in my case, role-based credentials).
Looks like that's going to need fixing
k
dang! ok
well glad it wasn’t me
e
this time 😂
k
the next obvious question, do you know how long it might take to fix? I’m sure it’s low priority
haha yup. well technically most of these issues have been errors with environment
e
I'm not sure that it is... I'll have to talk to someone from that team
I think we can compromise on blaming Java and/or AWS
k
lol yup would agree
is there a bug report system that this needs to be documented in?
e
That's a great idea, but first I'd like to verify that I'm not missing something, although I'm relatively sure We work with GitHub issues for bug reports
k
got it
Not sure if there is a workaround
e
Well, there might be a couple, but they defeat the purpose of using STS, so I wouldn't recommend them
However, we might be able to push this through relatively quickly I'd like to at least try (once I confirm that that is the issue)
k
seems like an easy fix hopefully
e
I'm not sure what the wider context might be It might be a more complex issue than I'm thinking or it might be a different issue As soon as I have more information, I'll let you know
k
cross my fingers thst it’s simple :)
thanks!
e
same
sure, np
One quick thing I'd like to confirm In the full log of the run, do you see a line starting with
Copy code
Use access key ID
k
I do not
e
ok, thanks!
k
should I?
e
I'm not sure, but I was trying to differentiate between two possible paths
k
got it
e
@Kevin Vasko Quick update It would seem that's indeed the case Would you be able to open an issue in the repo Send me a link to it and I'll add all he need tags
k
done! Thanks!
👍🏻 1
e
Thank you 🙏🏻
k
@Elad Lachmi know of any way to get a timeline for this? weeks? months? potentially never? just trying to find a workaround
j
Hi @Kevin Vasko, there is a related design proposal and an improvement that might be handy to support STS in the Spark metadata client (which GC uses). We’ll examine them in the beginning of next week and hopefully have an estimate on that.
k
Sounds good thanks. I guess in the meantime is there any recommendations on running the GC to maybe bypass STS?
a
Hi @Kevin Vasko, Sorry you're having difficulty with this. Indeed for various reasons we do not currently support sts during sweep. I'm afraid that the only current workaround is not to use sts. We do support straight access key and secret key, or if you're using Spark 3 on Hadoop 3 you might be able to get cross-account delegation. Alternatively, if it's a small installation you might be able to get by with directly listing and then deleting all the swept objects.
k
yeah it’s a small instance only 10k objects or so
might write a script to do that
😞 1
👍 1
i
That’s a possibility. Let us know if you run into issues.