Hello everyone Currently in lakeFS we just use the `sizeByte lakeFS #help

Hello everyone! Currently in lakeFS, we just use t...

lingyu zhang

08/31/2023, 12:18 PM

Hello everyone! Currently in lakeFS, we just use the

sizeBytes

and

modifiedTime

of a file when comparing local files with remote commits. However, I believe there may be some risks associated with this approach. For instance, if two clients have different system times and both edit the same file, let's say

a.txt

, the

sizeBytes

and

modifiedTime

could be the same on both clients, but the contents are different. Consequently, when I commit the changes on client A and then pull from the remote on client B, the modifications could be lost. So here are my questions: 1. Do you have any evidence or user cases to prove this situation is rare? 2. What is the probability of encountering this risk? Have any tests been conducted? 3. Why do we only use second-level precision for the

modifiedTime

and not something more precise like nanoseconds? (Unix time in nanoseconds cannot be represented by an int64 for dates only prior to the year 1678 or after 2262) Thanks a lot! :)

lingyu zhang

08/31/2023, 12:25 PM

By the way, I noticed the

etag

ObjectStats

, it's either returned by OSS or calculated by md5 on lakeFs server. When we use

etag

? Will it be used to compare? Thank you so much!

Barak Amar

08/31/2023, 12:30 PM

Hi @lingyu zhang, will not recommend using timestamp to understand change. You can use `lakectl local`` to help you sync local data changes with remote. It uses the object metadata to store checksum information and not timestamp. The commands reference can be found here https://docs.lakefs.io/reference/cli.html#lakectl-local Also there is a great blog by our @Niro that explains the usage https://lakefs.io/blog/scalable-data-version-control-getting-the-best-of-both-worlds-with-lakefs/

gratitude thank you 1

Barak Amar

08/31/2023, 12:31 PM

About etag the value gets different meaning depends on how the object is uploaded. Because there are cases where the data is uploaded using s3 gateway's multipart (as example) the final etag will be different that upload the same data using put object.

Oz Katz

08/31/2023, 12:46 PM

I believe @lingyu zhang is referring to the implementation of

lakectl local

🙂 This is not a novel concept - in fact, utilities such as

aws sync

use the same approach: 1. when lakectl local pulls a remote file locally, it also sets its local mtime to that of the server. 2. timestamp resolution varies a lot between filesystem types (ntfs, vfat, ext3, ...). 1 second usually works on all of them.

gratitude thank you 1

👍 1

Barak Amar

08/31/2023, 12:51 PM

Thanks @Oz Katz, my bad it doesn't use the object metadata - it sync the object timestamp from/to local and remote.

WEI HOPE3n

08/31/2023, 3:39 PM

Thanks @Barak Amar @Oz Katz ! I also have some questions regarding this topic. 1. As far as I know, Iterative DVC uses the checksum solution to compare changes. Why doesn't lakefs utilize the same checksum approach as iterative dvc for change comparison? Is it solely because of performance concerns with MD5, or are there other factors involved? 2. What is the purpose of including the 'etag' value in the 'valuerecord' of lakefs's range? Is it primarily to ensure data consistency, or does it serve another purpose? If it is not used for data consistency, are there any other ways that lakefs maintains data consistency? Thanks a lot!

Niro

09/01/2023, 12:32 AM

@Barak Amar credit where credit is due belongs to @Oz Katz on the lakectl local blog 😀

lingyu zhang

09/01/2023, 2:22 AM

You’ve been very helpful! Thanks very much🫶

Oz Katz

09/01/2023, 8:34 AM

@WEI HOPE3n just to make sure we address your questions: 1. The ETag / checksum are used on the lakeFS server to calculate an object's identity. Identity is used for efficient diffs on the server (for example if you use the diff/merge APIs available in lakeFS). This is how lakeFS scales to billions of objects. Typically when working locally (as with

lakectl local

), this is not required as you're not expected to work with the entire billion-object repository on local storage (you're much more likely to be bottlenecked by something else before hitting lakeFS' limits...) 2. As explained, ETag is used to calculate identity which serves 2 purposes: a. ensure data consistency - by comparing the etag with the one provided by the underlying filesystem. This ensures lakeFS commits are temper-proof b. to enable efficient diffing and merging by constructing a merkle-tree-like structure on top of the records. This is what makes change calculations relative to the size of the change (as opposed to DVC for example, where it's linear to the total size of the data) Hope this helps!

👍 1

3 Views

Open in Slack

Previous Next