Hi, does lakefs S3 interface provide some sort of ...
# help
h
Hi, does lakefs S3 interface provide some sort of checksum for each object?
a
Hi @HT, Yes, lakeFS provides ETags, which are for most purposes checksums of the object. ETags are part of the S3 API.
You can read a lot more about ETags on AWS S3 over here.
h
So lakefs follow that specification, which from what I understand, the presence and the content of the etag will depends on how the file were uploaded? Like a big file that trigger multi part upload will not have the checksum in the etag ?
a
Yes. Unfortunately multipart pretty much means you cannot checksum.
h
I wonder how rclone do to not reupload same file when using the flag checksum...
a
I figured it out a while ago. Basically they put another header on the object, and then s3 (both AWS and lakeFS) gives it back on a HEAD request. Still means rclone had to scan a huge file before copying it.
h
Oh, so lakefs will store that header that i can retrieve later on using something like fsspec
Re scanning huge file, I believe not much can be done ... Either you use checksum or modification time, which can be quite unreliable, depending on the use case ...
a
Yup. The fact that it's understandable doesn't make it fun.
But: sometimes etag is good enough!
h
Thanks for the help 😊