Hello I have a question regarding the garbage collection Sor lakeFS #help

Hello, I have a question regarding the garbage co...

user

09/17/2021, 8:20 AM

Hello, I have a question regarding the garbage collection. Sorry if it’s somewhere in the docs but I’ve read “everywhere” and I can’t seem to figure this out. I started to look into the garbage collection rules and created a config. I applied it to a repository and all was well. Then I wanted to see how it actually runs and then suddenly there is a

spark-submit

in the docs. 🙂 So my question is: Where do I run this? I can’t see any other references to spark (other than interacting with lakefs). Do I need to set up a spark cluster for this? The documentation doesn’t give that much info, it feels like you are supposed to just know it. But maybe I just missed the info somewhere? Thank you for any help/clarification.. 🙂

user

09/17/2021, 8:25 AM

Hi @mwikstrom! That’s a good question. Yes, Garbage collection indeed relies on Spark. This is a recent addition to lakeFS, so I apologize if documentation is a little rough around the edges. you can use a managed Spark environment, as provided by all major cloud providers, or for smaller environments, run spark in standalone mode. Are you running on a public cloud? (e.g. AWS, Azure, etc?)

user

09/17/2021, 8:28 AM

Hi, thanks for quick answer. 🙂 Yeah right now we’re running LakeFS in kubernetes in GCP. We’re not running spark today (or yet I might say). But I just wanted to know if it was part of lakefs or an external spark environment. But now I know, thanks. 🙂 Is this the plan going forward? Putting some functionality “outside” of the core service itself?

user

09/17/2021, 8:39 AM

Mostly just trying to use tools that are familiar to our users 🙂 Garbage collection is essentially a batch job, scanning and processing (ehm, deleting) potentially large amounts of data. Spark works really well for such use cases, providing a distributed, fast solution I assume for smaller environments and more ML centric use cases, a non-distributed version without such a dependency could be beneficial, but this really depends on community feedback

user

09/17/2021, 8:40 AM

can you elaborate a little about how you’re using lakeFS? which technologies in your stack it integrates with, and if u can share, who are its primary users?

user

09/18/2021, 10:30 AM

No it makes sense to use Spark for that. And not include it in the lakefs core service. Makes your job easier too. 😉 And we’re not really using Lakefs yet. 🙂 Still in an early phase in one of our projects and we’re investigating it. So right now it’s a very ML centric use-case but we’re also looking at it from a broader perspective. So right now we don’t have a spark environment. But that’s more of where we are in the project right now. I very much think we will have that (or something similar) and then this might not be a problem. My question was more related to the documentation I guess, hadn’t seen any reference to it before so I wondered if I had missed something. 🙂

user

09/18/2021, 10:39 AM

You haven't missed anything - we can and should do better. Can I ask for you to open a GitHub issue to improve documentation for Spark requirements on Garbage Collection? (I can help, if you prefer that I do it)

user

09/19/2021, 9:58 AM

No worries, I’ll open an issue. 🙂

user

09/19/2021, 9:59 AM

Thanks!

user

09/19/2021, 11:13 AM

https://github.com/treeverse/lakeFS/issues/2477

user

09/19/2021, 11:17 AM

Thanks! (cc @Guy Hardonag who will likely own this)

2 Views

Open in Slack

Previous Next