I'm looking at options for "rewriting history" for branches/tags/commits.Context - we have tabular files containing data from multiple users together. We're using LakeFS to version these files, then they get merged to act as complementary data/metadata for bigger less easily structured data. We're running on top of S3 storage.
I'm looking to use tags/commits on data as immutable "data release" points in time, so we can do analysis on a specific version and revert if necessary for data traceability & lineage reasons.HOWEVERIf a user wants to withdraw their data, or we need to remove a data column/feature/etc for some other reason (accidental commit, change of laws, data expiry, whatever), the data needs to be removed from history. I'm looking for thoughts on practicalities.Most straightforward <to me, naive user> working option:
• iterate all commits, find all historical versions of a file that needs updated (stat object, look at physical address), then load/redact/save all of these back to S3 (sidestepping LakeFS). This seems to work in a test, but I'm not sure is there checksum/data integrity tests/etc that will break, or are in a roadmap to break.
This is a useful flow as it means all the data loading/tagging/history works as it is, data appears immutable and historical analysis can be recreated (so long as redactions don't affect a specific analysis). Commit IDs remain unchanged so I don't have to mess around with tags, branches, etc.Alternatively,
• Iterate all tags ("immutable" versions), add an extra commit with the updated/redacted file(s), delete/recreate tag, then delete old file from physical storage.
This is "more legit" as the change is tracked in LakeFS history, but branch histories get all messed up - can't revert or checkout old commits, as the file will be missing in history.Any thoughts or ideas, things I've missed, other architectures/setups that might support this better?
4 weeks ago
Hi @Hugh Nolan When thinking about this challenge, I had the same two options in mind. For the first option, which I prefer, you can use our commit log to find the commits in which a file changed. The reason I prefer the first option, is that it preserves reproducibility, to the best of your data governance ability. What you must remove is removed, but other aspects of the version are kept.
Is that approximated reproducibility valuable in your opinion?
1 month ago
Using spark with s3, there is a triple ( or a double depending on the output file commiter algorithm version) write for each output data which is painful on S3. I guess with lakeFS it could solve this kind of problem. Is it already the case ?
2 weeks ago
Hi <!here> , i am researching managed Hadoop offerings. I'm curious, from your experience, what are the culprits of EMR and S3 solutions? This is what I have learned so far.EMR Pain:Upgrading Hadoop/Spark/Hive/Presto etc., Requires a new EMR Release, which upgrades often more than what you need. – this can trigger a full migration. Some docker images are supported, but not all of them. EMR bootstrap actions seems like an interesting solution for installing system dependencies1. Installing system dependencies doesn't always work as expected with bootstrap.
2. Spot - sometimes AWS will kill all your spot instances and cause clusters and jobs to fail (to happen infrequently enough that it is hard to justify paying for on-demand instances, but frequently enough that it causes a significant and draining level of operational toil).
3. Configuring Ranger Authorization requires installing it outside of EMR cluster and creates conflicts with AWS internal record reader Authorization. Conflicting security rules – can be resolved but require more efforts.S3 Pain:1. S3 Consistency used to be a pain and was solved with
strong read after write which made EMRFS obsolete. File consistency is not a pain anymore, but lack of atomic directories operations is a pain – copying a large set of parquet files (10k) and having multiple readers and writers at the same time can be a challenge when looking at the directories as a whole and not a file by file consistency.
2. AWS will often return HTTP 503 Slow Down errors for request rates that are much, much lower than their advertised limits.
3. You are supposed to code clients with backoffs/retries adequate to absorb arbitrary levels of HTTP 503 Slow Down until they finish scaling (which is entirely unobservable and could be minutes, hours, or days).
4. Many readers to table with 1K parquet files and one update – update is not an atomic action and readers might read the wrong data.
5. Storage optimization – optimize performance - no optimization at a table level.