lingyu zhang
08/31/2023, 12:18 PMsizeBytes
and modifiedTime
of a file when comparing local files with remote commits. However, I believe there may be some risks associated with this approach. For instance, if two clients have different system times and both edit the same file, let's say a.txt
, the sizeBytes
and modifiedTime
could be the same on both clients, but the contents are different. Consequently, when I commit the changes on client A and then pull from the remote on client B, the modifications could be lost. So here are my questions:
1. Do you have any evidence or user cases to prove this situation is rare?
2. What is the probability of encountering this risk? Have any tests been conducted?
3. Why do we only use second-level precision for the modifiedTime
and not something more precise like nanoseconds? (Unix time in nanoseconds cannot be represented by an int64 for dates only prior to the year 1678 or after 2262)
Thanks a lot! :)etag
in ObjectStats
, it's either returned by OSS or calculated by md5 on lakeFs server. When we use etag
? Will it be used to compare?
Thank you so much!Barak Amar
Oz Katz
lakectl local
🙂
This is not a novel concept - in fact, utilities such as aws sync
use the same approach:
1. when lakectl local pulls a remote file locally, it also sets its local mtime to that of the server.
2. timestamp resolution varies a lot between filesystem types (ntfs, vfat, ext3, ...). 1 second usually works on all of them.Barak Amar
WEI HOPE3n
08/31/2023, 3:39 PMNiro
09/01/2023, 12:32 AMlingyu zhang
09/01/2023, 2:22 AMOz Katz
lakectl local
), this is not required as you're not expected to work with the entire billion-object repository on local storage (you're much more likely to be bottlenecked by something else before hitting lakeFS' limits...)
2. As explained, ETag is used to calculate identity which serves 2 purposes:
a. ensure data consistency - by comparing the etag with the one provided by the underlying filesystem. This ensures lakeFS commits are temper-proof
b. to enable efficient diffing and merging by constructing a merkle-tree-like structure on top of the records. This is what makes change calculations relative to the size of the change (as opposed to DVC for example, where it's linear to the total size of the data)
Hope this helps!