Hello, I have a question regarding the garbage co...
# help
u
Hello, I have a question regarding the garbage collection. Sorry if it’s somewhere in the docs but I’ve read “everywhere” and I can’t seem to figure this out. I started to look into the garbage collection rules and created a config. I applied it to a repository and all was well. Then I wanted to see how it actually runs and then suddenly there is a
spark-submit
in the docs. 🙂 So my question is: Where do I run this? I can’t see any other references to spark (other than interacting with lakefs). Do I need to set up a spark cluster for this? The documentation doesn’t give that much info, it feels like you are supposed to just know it. But maybe I just missed the info somewhere? Thank you for any help/clarification.. 🙂
u
Hi @mwikstrom! That’s a good question. Yes, Garbage collection indeed relies on Spark. This is a recent addition to lakeFS, so I apologize if documentation is a little rough around the edges. you can use a managed Spark environment, as provided by all major cloud providers, or for smaller environments, run spark in standalone mode. Are you running on a public cloud? (e.g. AWS, Azure, etc?)
u
Hi, thanks for quick answer. 🙂 Yeah right now we’re running LakeFS in kubernetes in GCP. We’re not running spark today (or yet I might say). But I just wanted to know if it was part of lakefs or an external spark environment. But now I know, thanks. 🙂 Is this the plan going forward? Putting some functionality “outside” of the core service itself?
u
Mostly just trying to use tools that are familiar to our users 🙂 Garbage collection is essentially a batch job, scanning and processing (ehm, deleting) potentially large amounts of data. Spark works really well for such use cases, providing a distributed, fast solution I assume for smaller environments and more ML centric use cases, a non-distributed version without such a dependency could be beneficial, but this really depends on community feedback
u
can you elaborate a little about how you’re using lakeFS? which technologies in your stack it integrates with, and if u can share, who are its primary users?
u
No it makes sense to use Spark for that. And not include it in the lakefs core service. Makes your job easier too. 😉 And we’re not really using Lakefs yet. 🙂 Still in an early phase in one of our projects and we’re investigating it. So right now it’s a very ML centric use-case but we’re also looking at it from a broader perspective. So right now we don’t have a spark environment. But that’s more of where we are in the project right now. I very much think we will have that (or something similar) and then this might not be a problem. My question was more related to the documentation I guess, hadn’t seen any reference to it before so I wondered if I had missed something. 🙂
u
You haven't missed anything - we can and should do better. Can I ask for you to open a GitHub issue to improve documentation for Spark requirements on Garbage Collection? (I can help, if you prefer that I do it)
u
No worries, I’ll open an issue. 🙂
u
Thanks!
u
Thanks! (cc @Guy Hardonag who will likely own this)