lingyu zhang08/31/2023, 12:18 PM
of a file when comparing local files with remote commits. However, I believe there may be some risks associated with this approach. For instance, if two clients have different system times and both edit the same file, let's say
could be the same on both clients, but the contents are different. Consequently, when I commit the changes on client A and then pull from the remote on client B, the modifications could be lost. So here are my questions: 1. Do you have any evidence or user cases to prove this situation is rare? 2. What is the probability of encountering this risk? Have any tests been conducted? 3. Why do we only use second-level precision for the
and not something more precise like nanoseconds? (Unix time in nanoseconds cannot be represented by an int64 for dates only prior to the year 1678 or after 2262) Thanks a lot! :)
, it's either returned by OSS or calculated by md5 on lakeFs server. When we use
? Will it be used to compare? Thank you so much!
🙂 This is not a novel concept - in fact, utilities such as
use the same approach: 1. when lakectl local pulls a remote file locally, it also sets its local mtime to that of the server. 2. timestamp resolution varies a lot between filesystem types (ntfs, vfat, ext3, ...). 1 second usually works on all of them.
WEI HOPE3n08/31/2023, 3:39 PM
Niro09/01/2023, 12:32 AM
lingyu zhang09/01/2023, 2:22 AM
), this is not required as you're not expected to work with the entire billion-object repository on local storage (you're much more likely to be bottlenecked by something else before hitting lakeFS' limits...) 2. As explained, ETag is used to calculate identity which serves 2 purposes: a. ensure data consistency - by comparing the etag with the one provided by the underlying filesystem. This ensures lakeFS commits are temper-proof b. to enable efficient diffing and merging by constructing a merkle-tree-like structure on top of the records. This is what makes change calculations relative to the size of the change (as opposed to DVC for example, where it's linear to the total size of the data) Hope this helps!