Hi, i have a theoretical question. One of the thi...
# help
j
Hi, i have a theoretical question. One of the things that worries me about using this technology, is the potential for lakefs metadata corruption. Not knowing what your underlying s3 write strategy is, I'm wondering if this is something to be concerned about. There are things that can affect the ability to hit s3. The service has an outage or you're suddenly s3 rate limited due to cardinality issues. Basically anything that can cause s3 api failures, has the potential to cause commits or merges or whatever to fail or be inconsistent. What happens if we're in the middle of a commit, merge, and lakefs is writing to s3, and in the middle of it all the writes, s3 starts exhibiting failures?
o
Hey @Joe M - good questions! I'll try to cover them here and will also open an issue to make sure this is properly documented on the lakeFS docs. Here are a few important safety guarantees that lakeFS provides: 1. Commit and merge operations are atomic, meaning they cannot partially complete. branches can only move from 1 complete commit to another. 2. All metadata that describes what a commit contains is stored on the object store itself (i.e. gets to enjoy S3's 11 9's of durability) 3. While the lakeFS server doesn't typically write data objects to the underlying object store (this is relegated to lakeFS clients the leverage pre-signed URLs), the server does instruct clients where to write objects to, in its managed storage namespace. By doing so, it intentionally speards data files across many S3 "partitions" to reduce likelihood of throttling by the object store. I'll try and walk through what happens if a commit fails "mid-way", but first let's cover the execution flow of a commit in lakeFS: 1. Client issues a
commit
API call to the server, with a repository, branch ID, message and optional metadata 2. Server takes note of the current commit ID that branch is currently pointing to. 3. Server creates a new staging token, sealing the current one for the branch, to ensure new writes are excluded. From here on out, we have a set of changes to commit, that are ensured to be immutable. 4. All sealed staging tokens are then serialized to the object store, making up a tree of RocksDB-compatible SSTables. 5. A commit record is written, pointing to the root of that tree on the object store 6. Once that's done (and this is the part that ensures atomicity): the branch pointer is modified to point to the commit we created. this a an atomic compare-and-swap operation: the new commit takes effect only if the current commit ID is still the one observed in step 2. Failing at any point prior to step 6 means we may have created orphan metadata objects on the object store, but reading and writing from the branch always starts by de-referencing the current commit a branch is pointing to, so has no other side effects. Failing at step 6 could happen for 2 reasons: 1. generic error writing to the lakeFS backing KV store, in which case the server would retry the KV write operation or give up - in this case the commit operation fails and you're still pointing to the existing commit 2. compare-and-swap predicate failure: this means someone "beat us to it" - another commit/merge has successfully finished before ours did - in which case, we restart the flow at step 2. This ensures atomicity and also that parent-child relationships are properly maintained. Just like in Git, each commit points to its parents. As the saying goes "If I Had More Time, I Would Have Written a Shorter Slack message" - sorry if this is a little contrived, but I'm happy to elaborate or answer any follow ups!
👍 1
j
ok great, thanks for that. It was not something that could be answered in one sentence, and i appreciate the step walkthrough. So the bottom line is, if something does fail midway for whatever reason, the orphaned metadata objects are just that, orphaned. The server and metadata store are still consistent, as of the last commit hash, before the one that failed. And So integrity of the system is intact.
👌 1
o
Exactly that! This was a major design goal for the project.
👍 1