I m concerned about storage costs using lakefs Is glacier an lakeFS #help

I'm concerned about storage costs using lakefs. Is...

John Kim

04/11/2024, 3:33 AM

I'm concerned about storage costs using lakefs. Is glacier an option or is intelligent tiering and GC as far as lakefs optimizes for storage cost?

Ariel Shaqed (Scolnicov)

04/11/2024, 5:16 AM

Hi @John Kim, tl;dr: it used to but no longer offers easy way to. It would be possible to reverse this decision, we simply never saw any use for this feature. lakeFS uses S3 to hold your data objects, and also its metadata objects. Metadata size is a small fraction of data size, and can be shared between commits, so lakeFS offers no way to control its tier. Let's talk about data, then... Data: you can hold your data objects on S3 on whatever tier you need. The putObject API has a storageClass argument that can do this. It is deprecated, and we can discuss why on this thread. Or you could upload your data directly into whatever storage class you liked and then link it into lakeFS. With the direct link method lakeFS will not GC. It is hard for lakeFS to know in what tier to place objects: the same object, if unchanged, can appear throughout history on many branches. I am not sure how easy it will be for you to do this, too. lakeFS could also help you change the storage class of inactive objects: if you ran GC in Mark mode it would return a list of old objects. You could then move those to your desired storage class, probably by using your own tagging and lifecycle rules. But bear in mind that this is a pretty big bet on future use of the object. This tagging could become an option for Sweep, which would give an automated solution. I realize that this is discouraging. IIRC, storageClass support was actually my first feature on lakeFS, and saw no use. Perhaps we had a bad use-case for it; are you open to discuss?

John Kim

04/11/2024, 4:34 PM

Thanks for your response. I'll look more into the direct linking option to see if it satisfies our use case.

👍 1

John Kim

04/11/2024, 4:57 PM

As a followup, but related item. My organization needs to split storage costs across different orgs for accounting reasons. Is there a way to link/inventory the content address data blocks in the s3 bucket to directories in lakefs?

Ariel Shaqed (Scolnicov)

04/11/2024, 5:28 PM

Directories won't work for this. If you can give each org its own repository, then just measure the storage namespaces used by each repo separately. Probably even give each org its repo, and each repo its bucket owned by that prg, and then AWS can do the accounting for you.

John Kim

04/11/2024, 5:31 PM

Could that result in data duplication as data moves across repositories?

Ariel Shaqed (Scolnicov)

04/11/2024, 5:33 PM

Of course. I guess I'm not clear on who pays for what, sorry.

gratitude thank you 1

John Kim

04/11/2024, 5:40 PM

Ok, appreciate you answering my very naive questions!

4 Views

Open in Slack

Previous Next