https://lakefs.io/ logo
Title
d

Daniel Satubi

01/01/2023, 12:54 PM
Hi Guys! it’s been a while… 🙂 we’re trying to run the Garbage Collector and ran into some dependencies issues we tried running it (sweep stage - mark stage finished successfully) on spark 2.4.8 (2.4.7 with --packages does not work anymore because of bintray deprecations) when specifiying
hadoop-aws 2.7.7
in packages it installs a dependency of
aws-java-sdk 1.7.4
and we get an error for classDef not found (the class definition is found in newer
aws-java-sdk
but not in 1.7.4)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSStaticCredentialsProvider
	at io.treeverse.clients.conditional.S3ClientBuilder$.build(S3ClientBuilder.scala:26)
	at io.treeverse.clients.BulkRemoverFactory$S3BulkRemover.getS3Client(BulkRemoverFactory.scala:69)
	at io.treeverse.clients.BulkRemoverFactory$S3BulkRemover.deleteObjects(BulkRemoverFactory.scala:58)
could you help us figure how to run the sweep? Thanks :)
o

Oz Katz

01/01/2023, 1:13 PM
Hey @Daniel Satubi! Sure 🙂 looping in @Tal Sofer
t

Tal Sofer

01/01/2023, 1:20 PM
Thanks @Oz Katz! and hi @Daniel Satubi 🙂 looking into it
d

Daniel Satubi

01/01/2023, 1:47 PM
Thanks, I’ll add the full stack trace with the submit command
t

Tal Sofer

01/01/2023, 1:47 PM
Thank you!
Hi @Daniel Satubi, I managed to reproduce the issue and we already have a draft pr that fixes it. We would like to run more testing tomorrow to validate the solution and then we will get the fix in. I will update you on the progress tomorrow and on when we are expecting to release the fix (I expect it to be a matter of 1-3 days). Thanks for reporting the issue and have a great evening!
d

Daniel Satubi

01/02/2023, 8:54 AM
Thanks! 🙂 let me know if I can help test it somehow
🙏 1
t

Tal Sofer

01/02/2023, 4:03 PM
Hi @Daniel Satubi 🙂 I’m getting back to you with updates. We ran some tests on the solution in the pr above and planning to run more validations. So far, things look good. so if everything works well we will release the fix by end of this week, hopefully before Thursday. I will of course continue to share updates with you
d

Daniel Satubi

01/03/2023, 9:52 AM
Thanks! 🙂 I read some of the discussion in the PR, are you testing with 2.4.7 or 2.4.8? which hadoop version in the spark jars?
t

Tal Sofer

01/03/2023, 10:38 AM
i’m testing with 2.4.7, hadoop version 2.7.7
d

Daniel Satubi

01/03/2023, 2:35 PM
we have a 2.4.7 & 2.4.8 cluster with hadoop 2.7.3 - I’m not sure it’ll work I’ll see what we can do…
t

Tal Sofer

01/04/2023, 1:21 PM
Hi @Daniel Satubi! i’m considering another optional solution that may be independent of your hadoop version. I will keep you updated.
d

Daniel Satubi

01/04/2023, 1:53 PM
Thanks! looking forward… 🥳
:heart_lakefs: 2
t

Tal Sofer

01/05/2023, 2:29 PM
That’s perfect! 🤗 thanks for letting us know, and for offering to test things up! Sharing updates on our current status: during the last we tested the solution in https://github.com/treeverse/lakeFS/pull/4920, while it works with hadoop 2, it doesn’t work with hadoop 3 which broke our Spark3-hadoop3 client. We are now testing another solution that limits this change only to our hadoop2 builds. For that reason we will not be releasing the fix this week but during the next week after completing validations 🙂
I will update you on the when we expect to release early next week. once the client is out we will appreciate your help in testing it out.
d

Daniel Satubi

01/05/2023, 3:56 PM
Great! thank you very much 🙂 have a nice weekend
:heart_lakefs: 1
Hi, Good Morning ☀️ any news? sorry for pinging about this so much, we saw our storage costs go up linearly with our usage (branch per application run) but we don’t access the data after the application run (only main & ongoing apps branches are relevant) so we’d really like to cleanup our s3 🙂 🙏
t

Tal Sofer

01/09/2023, 8:57 AM
hi Daniel good morning!
I was planning to write you during the next few hours but you are faster! :lakefs: All testing went successfully, but we added an integration test that simulates a run of Spark 3 with hadoop 2.7.4. this test fails probably for hadoop reasons (unrelated to lakeFS). Manual testing on environment of your spec worked, but I’m hoping to try to understand the impact for your env before we release this. We will decide today how to proceed and when to release and I will write back to you by EOD today. I hope this helps!
d

Daniel Satubi

01/09/2023, 9:17 AM
Thanks! if you’d like to send us the artifact for testing before releasing we’d be more than happy 🙂
t

Tal Sofer

01/09/2023, 9:35 AM
Thanks for the kind offer! I will be actually happy to do so. what email I can sent the artifact to?
d

Daniel Satubi

01/09/2023, 9:37 AM
daniels@windward.ai or even here if it works (not sure if there’s a size limit in slack)
t

Tal Sofer

01/09/2023, 9:37 AM
thanks, will let you know once I sent it 🙂 in a meeting will do it right after
🙏 1
@Daniel Satubi I sent you the client via email, let me know if you have any issues with it. Looking forward to hear how testing goes!
d

Daniel Satubi

01/09/2023, 10:47 AM
Thanks! Should I ran it with the same command as in the docs?
--packages
with same dependency?
t

Tal Sofer

01/09/2023, 10:48 AM
yes,
--packages org.apache.hadoop:hadoop-aws:2.7.7
it is
d

Daniel Satubi

01/09/2023, 10:58 AM
Thanks! will run an report back 🙂
🤗 1
It finished successfully but I don’t think the files we expected were deleted…
I’m re-running both mark & sweep together now instead of separate.
t

Tal Sofer

01/09/2023, 11:58 AM
Thanks for reporting back 🙂
It finished successfully but I don’t think the files we expected were deleted…`
For example, what did you expect to happen but didn’t happen? I understand that you ran a sweep only job on a previously marked addresses?
I’m re-running both mark & sweep together now instead of separate.
Cool, let me know how did this go
d

Daniel Satubi

01/09/2023, 12:05 PM
previously we deleted all stale branches and run the mark job, we have a 300TB+ bucket for lakefs with about 20TB of actively accessed data, also we see the data is growing linearly so we assume 20TB is the size of the “main” give or take and the ~300TB is the data from the branches. we expected to see some changes after the sweep but none took effect… can we set the job to log more info?
t

Tal Sofer

01/09/2023, 1:05 PM
Really happy that the artifact we shared with you fixes the dependency issue, thanks for testing it out!:heart_lakefs: As for your question, since it is related to the way GC works can I ask you to move it to our #help channel? I’m sure that other community members can benefit from it. When you move it there I would add more details about your GC setup. For example, what retention policy you have configured, how did you run the job (mark and then sweep or mark+sweep together), and attach logs you feel comfortable sharing. If you think there is a bug you are also welcome to open a github issue describing it.
:lakefs: 1
d

Daniel Satubi

01/09/2023, 1:29 PM
Thanks! I’ll wait for the new run to finish and if there’s any problem I’ll write in the #help channel We’re very grateful for the quick replays and all the help 🙂
t

Tal Sofer

01/09/2023, 2:39 PM
Our pleasure! thank you!
Hi @Daniel Satubi, i’m getting back to you with an update on the client release - I merged the fix you’ve tested today, note that it’ll only work for a hadoop 2.7.7 setup. Given that you already have the build, we will be releasing it by end of the week. does this work for you? Our integration test setup was not accurate and therefore failed and we will enable it in another pr.
d

Daniel Satubi

01/09/2023, 3:57 PM
still running…I’ll report back tomorrow with all the results and our cluster versions
t

Tal Sofer

01/09/2023, 3:58 PM
Cool! thank you!