Title
#data-discussion
h

Hugh Nolan

08/29/2022, 4:42 PM
Hi, I'm looking at options for "rewriting history" for branches/tags/commits. Context - we have tabular files containing data from multiple users together. We're using LakeFS to version these files, then they get merged to act as complementary data/metadata for bigger less easily structured data. We're running on top of S3 storage. I'm looking to use tags/commits on data as immutable "data release" points in time, so we can do analysis on a specific version and revert if necessary for data traceability & lineage reasons. HOWEVER If a user wants to withdraw their data, or we need to remove a data column/feature/etc for some other reason (accidental commit, change of laws, data expiry, whatever), the data needs to be removed from history. I'm looking for thoughts on practicalities. Most straightforward <to me, naive user> working option: • iterate all commits, find all historical versions of a file that needs updated (stat object, look at physical address), then load/redact/save all of these back to S3 (sidestepping LakeFS). This seems to work in a test, but I'm not sure is there checksum/data integrity tests/etc that will break, or are in a roadmap to break. This is a useful flow as it means all the data loading/tagging/history works as it is, data appears immutable and historical analysis can be recreated (so long as redactions don't affect a specific analysis). Commit IDs remain unchanged so I don't have to mess around with tags, branches, etc. Alternatively, • Iterate all tags ("immutable" versions), add an extra commit with the updated/redacted file(s), delete/recreate tag, then delete old file from physical storage. This is "more legit" as the change is tracked in LakeFS history, but branch histories get all messed up - can't revert or checkout old commits, as the file will be missing in history. Any thoughts or ideas, things I've missed, other architectures/setups that might support this better?