user
11/06/2022, 3:58 PMrepo
|_ config
|_ input data
|_ ml output data
I want the input data
and ml output data
to be deleted on 10th day of the commit but the config
should live forever. Is it possible?user
11/06/2022, 4:08 PMuser
11/06/2022, 4:36 PMGC
and was hoping there will be a possibility to auto delete based on retention policy and have support for exclude folder/formats.
Do you think a possible workaround solution for this and do you think this is a less common usecase?user
11/06/2022, 4:52 PMuser
11/07/2022, 11:48 AMinput data
and ml output data
on your main branch, and run garbage collection. if you do this periodically (say, every 10 days), your storage costs should remain in check.
Does that meet your requirements?user
11/07/2022, 2:42 PMGC
. Forgive me if I am asking wrong or basic questions.
Usecase: In my org, there are multiple Machine learning repositories using GIT LFS. However, we are facing challenges with it. Then, I came across DVC
and LakeFS
. LakeFS
perfectly fits my usecase however I am worried what will happen when there are 1000s of commits over several years in main
branch which affects the input data
and ml output data
. I am worried of storage cost on versioning all the 1000 flavours of the data in the 1000 commits (Note: input data
and ml output data
contains 30k binary files which are augmented/produced by a python
script). Since we are okay to loss data rather than pay hefty storage cost, I am expecting a solution where the data (except for config
folder) is deleted for commits older than 365 days and commit message shows a identifier that data is missing. So, commits within 365 days will have the data and I can use it in Databricks whereas commits after 365 days will say data not found in Databricks.
With this usecase, I am not sure I will be able to delete it from main
branch because I want it for 365 days.user
11/07/2022, 3:00 PMuser
11/07/2022, 3:07 PMuser
11/07/2022, 3:16 PMGC
😀.
Does GC
delete files from the previous commits to save space or does it delete files which don’t have any commits associated (like deleted branch or delete (unlink) command in LakeFS
)?user
11/07/2022, 3:43 PMuser
11/07/2022, 3:53 PMconfig
) in the repository. Is that possible?
Btw, I was going through LakeFS
blogs and that is really informative. I came across https://lakefs.io/data-versioning/ and I think my usecase is similar to the section ‘_Use TTL's to expire old versions_’ but with a exclude option.user
11/07/2022, 3:55 PMGarbage collection rules in lakeFS define for how long to retain objects after they have been deleted
user
11/07/2022, 4:03 PMuser
11/07/2022, 4:05 PMuser
11/07/2022, 4:10 PMconfig
folder alone to a inf retention policy branch
say, every night? But then if I have 5 commits on that day, then I will be able to retain one commit. Am I right?user
11/07/2022, 4:13 PMuser
11/07/2022, 4:14 PMThe GC will delete older versions. “lakeFS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules.”
When I read it earlier, I misunderstood it as the file should be deleted and at the same time, should have crossed the retention period for GC
to delete. Basically, I thought of GC
as a auto recycle bin clean-up based on retention policy for that branch.user
11/07/2022, 4:17 PMinput data
is generated or augmented. Since many can work on the same repo, there are no strict check on whether someone is modifying the existing scripts/docs or deleting and creating new one.user
11/07/2022, 4:22 PMuser
11/07/2022, 4:25 PMinput data
and ml output data
by running script and manually following document in the config folder of that commit.
For example- say we found a production issue and the version used by a user is 2 year old. Then, we will look into the commit and see the data is missing. Then, we will open config folder and follow the steps laid in there to reproduce the model. This is rarer and usually we see production issues in 6-12 months version. So in those common issues, we will have the data and no need to reproduce (since the retention will be 365 days)user
11/07/2022, 4:40 PMuser
11/07/2022, 4:45 PMuser
11/07/2022, 4:55 PM