I think I know the answer but just want to be sure...
# help
h
I think I know the answer but just want to be sure: Can we delete a commit ??
e
Hi @HT, You know it… 😉 No, you cannot
h
😅
a
No support for this at all. It would be a tough one to pull off. TBH I'm not even sure the desired behaviour would be. It would probably depend on why exactly you are deleting it. Things like what happens to commits downstream of the deleted commit, or even whether the commit metadata should vanish or be kept (and accessible through its digest?) are all user-level requirements for such a feature. If you need it please open an issue so that we have somewhere to ask all these questions. I'm not even sure we will be able to do it or when -- but it's probably worth asking the question 😕
h
I had a look how git do it: it's quite complex !! You basically need to rebase and rewrite all commit after the one you want to delete and rewrite the repo history. In short very hard. So I think for lakefs, a simple "no, too hard" is the right answer ... Especially what happen when you add a file in commit
c1
and delete it in
c2
, and then you delete commit
c1
: what happen to
c2
??
@Ariel Shaqed (Scolnicov) Do you want me to still open an issue to have discussion and have a trace ?
e
If it’s a feature you’d like to see in lakeFS and you have a compelling use-case, which might interest the community at large, then opening an issue is a good way to capture the use-case and to have a discussion about the requirements, possibilities, limitations, etc.
a
You're both right! But also note that your workflow using rebase on Git satisfies some but not all possible requirements. First, as you explained, you need to handle conflicts. But that's just metadata for a single branch. The commit is still there, it is still accessible by tag or by other branches or by digest. Now you need to delete all tags pointing at that commit, and also rebase all branches dependent on that commit. Some of these will include merges; there is no porcelain in Git for rebase-across-merges, so I think you lose merge metadata. It gets simpler after that: if you want that data deleted from your storage, that's still not enough. The data goes away only when Git gc's it. Eventually you'll have a GC run on your data, and then it gets deleted... if no commits reference it.
sunglasses lakefs 1
Depending on your use cases you may want each of these steps or not. This is why we implemented GC as we did. We delete only data, never metadata, so no explicit user action is required. This satisfies cases where the important thing is to get the data off your storage. Anything that requires rewriting history will need user interaction that is potentially painful. Few things are impossible, as we're seeing this week. Perhaps the next step is to talk about a rebase flow?
h
rebase start to be advance usage of git ... not sure how many data engineer is familiar with this ... In our case, we will simply try to not go there. As rebase and rewriting commit will probably break our use case where we want to travel back in time and retrieve our dataset state at given commit. If one start to change commit around then things will fall apart for us
I mean, if it only impact the commit that we want to delete, then fine, that commit is gone. But rebase will change all child commit to the delete one so will break potentially a lot of thing in our use case
a
Yeah, every time I wanted to spec it out I end up with "git rebase at scale but distributed actions should still work and also across merge bases and make sure GC still works". That will not be a usable feature. If I may be cynical: Given that I usually try each rebase 3-5 times before it works, I just tried to trick you into spec'cing it out.
😅 1
👍 1