Hi all!! First of all, thanks for your work. Lak...
# help
u
Hi all!! First of all, thanks for your work. LakeFS is an interesting project. I do have one query. I am trying to streamline my storage layer for machine learning and I came across your repo. I am planning to setup a airflow setup where my users can commit changes in Azure and then a databricks ml notebook is triggered. I am worried of storage cost. I am expecting a solution where the data is deleted from azure storage every N days based on the branch except for one folder which has a scripts or tools to regenerate the data. For example, if my repo structure is as below,
Copy code
repo
|_ config 
|_ input data
|_ ml output data
I want the
input data
and
ml output data
to be deleted on 10th day of the commit but the
config
should live forever. Is it possible?
u
hey @Selva and welcome! lakeFS does provide built in garbage collection capabilities, though I'm afraid it might not serve your exact use-case, but you can achieve that with a different approach. garbage collection detects the files that are not accessible from a HEAD of a branch and eventually deletes these files. this does solve your cost concern, but not the use-case you presented. note: if config always exist in the branch HEAD, it'll remain still, and if I understand correctly, that's what you're looking for. read more here on how garbage collection for lakeFS is designed.
u
Thanks @Or Tzabary. I did go through the
GC
and was hoping there will be a possibility to auto delete based on retention policy and have support for exclude folder/formats. Do you think a possible workaround solution for this and do you think this is a less common usecase?
u
Sure, happy to help 🙂 GC does automatically delete based on your rules, unfortunately, it doesn't take folders or formats into account, only branches and days since the files detached from HEAD. Mind sharing some information of the specific use-case you're trying to achieve? I think we might be missing the point as I'm not fully understand what you end goal. FYI, we have SDK and API that might help you achieve your goal
u
Hi @Selva - Just to make sure it's clear - all you need to do is delete files from
input data
and
ml output data
on your main branch, and run garbage collection. if you do this periodically (say, every 10 days), your storage costs should remain in check. Does that meet your requirements?
u
Hi @Oz Katz and @Or Tzabary Thanks a lot and sorry for delay. Thanks for the suggestion on
GC
. Forgive me if I am asking wrong or basic questions. Usecase: In my org, there are multiple Machine learning repositories using GIT LFS. However, we are facing challenges with it. Then, I came across
DVC
and
LakeFS
.
LakeFS
perfectly fits my usecase however I am worried what will happen when there are 1000s of commits over several years in
main
branch which affects the
input data
and
ml output data
. I am worried of storage cost on versioning all the 1000 flavours of the data in the 1000 commits (Note:
input data
and
ml output data
contains 30k binary files which are augmented/produced by a
python
script). Since we are okay to loss data rather than pay hefty storage cost, I am expecting a solution where the data (except for
config
folder) is deleted for commits older than 365 days and commit message shows a identifier that data is missing. So, commits within 365 days will have the data and I can use it in Databricks whereas commits after 365 days will say data not found in Databricks. With this usecase, I am not sure I will be able to delete it from
main
branch because I want it for 365 days.
u
Hi Selva, So you want to keep the config folder from deleting by the garbage collection. You need to make sure there is always a commit newer than 365 days that contains it. That way the garbage collection won't delete it.
u
You can commit it once in a while, or if you're already editing it and committing it
u
Thanks @Eden Ohana. Now I am thinking did I misunderstand
GC
😀. Does
GC
delete files from the previous commits to save space or does it delete files which don’t have any commits associated (like deleted branch or delete (unlink) command in
LakeFS
)?
u
It deletes files from commits that are older than your retention date only if you don't have a commit newer than your retention date containing them.
u
Thanks @Eden Ohana. Then that is what I am looking for but I also need exclusion or filter support. That is, I want to set different retention period (say 5y or inf) for a specific folder (
config
) in the repository. Is that possible? Btw, I was going through
LakeFS
blogs and that is really informative. I came across https://lakefs.io/data-versioning/ and I think my usecase is similar to the section ‘_Use TTL's to expire old versions_’ but with a exclude option.
u
Btw, however, https://docs.lakefs.io/reference/garbage-collection.html#considerations says GC will work on data being deleted from branch and no mention on deleting older versions.
Copy code
Garbage collection rules in lakeFS define for how long to retain objects after they have been deleted
u
We currently do not support the exclude folder option. You’re welcome to open an issue regarding it 🙂 I can suggest you to make sure it includes in a not expired commit. You can do it by committing it once in a while, or you can check it to a separate branch and don’t have a retention policy on that branch.
u
The GC will delete older versions. “lakeFS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules.”
u
Thanks @Eden Ohana. I will definitely create an issue as I see this will be useful. On your suggestion of using a separate branch, do you mean by committing the
config
folder alone to a
inf retention policy branch
say, every night? But then if I have 5 commits on that day, then I will be able to retain one commit. Am I right?
u
Are you updating the config folder on daily basis?
u
Thanks for quote.
Copy code
The GC will delete older versions. “lakeFS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules.”
When I read it earlier, I misunderstood it as the file should be deleted and at the same time, should have crossed the retention period for
GC
to delete. Basically, I thought of
GC
as a auto recycle bin clean-up based on retention policy for that branch.
u
Yeah. My config folder contains the ML python scripts and documents on how the
input data
is generated or augmented. Since many can work on the same repo, there are no strict check on whether someone is modifying the existing scripts/docs or deleting and creating new one.
u
Got it. And you want to keep versions of the config folder or you just want to make sure the latest version will not be deleted?
u
All the versions and the latest one. Basically, we believe one can reproduce the
input data
and
ml output data
by running script and manually following document in the config folder of that commit. For example- say we found a production issue and the version used by a user is 2 year old. Then, we will look into the commit and see the data is missing. Then, we will open config folder and follow the steps laid in there to reproduce the model. This is rarer and usually we see production issues in 6-12 months version. So in those common issues, we will have the data and no need to reproduce (since the retention will be 365 days)
u
You can have a dedicated branch for the config folder that will not have a retention policy and write the input and output data to a separate branch with a retention policy. Then you will need to know which config version (commit) produces the output-data (you can add it to the commit metadata). Or you can write the config folder twice, once to a dedicated branch that contains only it without a retention policy. (You can use it when it expired from the main branch after 365 days). And write it again with the output data to a separate branch. That way you can reproduce config and output in a single commit within the last 365 days.
u
Thanks @Eden Ohana. Unfortunately I feel this approach will over complicate our setup and that is also one reason we are looking for alternatives. Nevertheless, thanks for your support. I will create an issue in GitHub and hope one day this is available 🙂
u
Sure 😀 let me know if you have additional questions