https://lakefs.io/ logo
#help
Title
h

HT

07/09/2023, 8:26 AM
Hi, does lakefs S3 interface provide some sort of checksum for each object?
a

Ariel Shaqed (Scolnicov)

07/09/2023, 8:39 AM
Hi @HT, Yes, lakeFS provides ETags, which are for most purposes checksums of the object. ETags are part of the S3 API.
You can read a lot more about ETags on AWS S3 over here.
h

HT

07/09/2023, 8:47 AM
So lakefs follow that specification, which from what I understand, the presence and the content of the etag will depends on how the file were uploaded? Like a big file that trigger multi part upload will not have the checksum in the etag ?
a

Ariel Shaqed (Scolnicov)

07/09/2023, 8:50 AM
Yes. Unfortunately multipart pretty much means you cannot checksum.
h

HT

07/09/2023, 9:09 AM
I wonder how rclone do to not reupload same file when using the flag checksum...
a

Ariel Shaqed (Scolnicov)

07/09/2023, 9:10 AM
I figured it out a while ago. Basically they put another header on the object, and then s3 (both AWS and lakeFS) gives it back on a HEAD request. Still means rclone had to scan a huge file before copying it.
h

HT

07/09/2023, 9:22 AM
Oh, so lakefs will store that header that i can retrieve later on using something like fsspec
Re scanning huge file, I believe not much can be done ... Either you use checksum or modification time, which can be quite unreliable, depending on the use case ...
a

Ariel Shaqed (Scolnicov)

07/09/2023, 9:34 AM
Yup. The fact that it's understandable doesn't make it fun.
But: sometimes etag is good enough!
h

HT

07/09/2023, 9:36 AM
Thanks for the help 😊
4 Views