CC
01/11/2023, 9:36 PMIddo Avneri
01/11/2023, 9:39 PMCC
01/11/2023, 10:04 PMIddo Avneri
01/11/2023, 10:14 PMCC
01/11/2023, 10:21 PMIddo Avneri
01/11/2023, 10:22 PMAriel Shaqed (Scolnicov)
01/12/2023, 6:59 AMX-Amz-Meta-Md5chksum
tag on lakefs_a, and then further Rclones should work just fine.var matchMd5 = regexp.MustCompile(`^[0-9a-f]{32}$`)
so it will refuse to handle an ETag generated by a multipart upload.
So I came up with this experiment. It appears to show that Rclone does not verify ETags for S3 copies. No lakeFS in sight.
I started by uploading 10 MiB of data to my US bucket:
❯ aws s3 cp /tmp/10m <s3://treeverse-ariels-test-us/test/rclone/10m>
Let's rclone sync it over to my EU bucket! -vv
means to give lots of debug information...❯ aws s3api head-object --bucket treeverse-ariels-test --key test/rclone/10m/10m
{
"AcceptRanges": "bytes",
"LastModified": "Thu, 12 Jan 2023 08:26:03 GMT",
"ContentLength": 10485760,
"ETag": "\"895bed95bff6a288b45c28697991cc08\"",
"ContentType": "binary/octet-stream",
"Metadata": {}
}
❯ rclone sync -vv --checksum --s3-use-multipart-etag=true <s3://treeverse-ariels-test-us/test/rclone/10m> <s3://treeverse-ariels-test/test/rclone/10m>
<7>DEBUG : rclone: Version "1.60.1" starting with parameters ["rclone" "sync" "-vv" "--checksum" "--s3-use-multipart-etag=true" "<s3://treeverse-ariels-test-us/test/rclone/10m>" "<s3://treeverse-ariels-test/test/rclone/10m>"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test-us/test/rclone/10m>"
<7>DEBUG : Using config file from "/home/ariels/.config/rclone/rclone.conf"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: adding new entry for parent of "<s3://treeverse-ariels-test-us/test/rclone/10m>", "s3{5gJq8}:treeverse-ariels-test-us/test/rclone"
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test/test/rclone/10m>"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: Switched region to "eu-central-1" from "us-east-1"
<7>DEBUG : pacer: low level retry 1/2 (error BucketRegionError: incorrect region, the bucket is not in 'us-east-1' region at endpoint '', bucket is in 'eu-central-1' region
status code: 301, request id: 64Y2G17VXMKHFPA6, host id: lmJhpdYGGgFn10pdzK2/oiF4jY7MVqVolaE7/lNzpvz+AQZddc158KkisiuXnyEV7UsNlZM8zQU=)
<7>DEBUG : pacer: Rate limited, increasing sleep to 10ms
<7>DEBUG : pacer: Reducing sleep to 0s
<7>DEBUG : fs cache: renaming cache item "<s3://treeverse-ariels-test/test/rclone/10m>" to be canonical "s3{5gJq8}:treeverse-ariels-test/test/rclone/10m"
<7>DEBUG : 10m: Need to transfer - File not found at Destination
<7>DEBUG : 10m: Src hash empty - aborting Dst hash check
<6>INFO : 10m: Copied (server-side copy)
<6>INFO :
Transferred: 10 MiB / 10 MiB, 100%, 0 B/s, ETA -
Transferred: 1 / 1, 100%
Elapsed time: 6.5s
<7>DEBUG : 11 go routines active
The key line is "`10m: Src hash empty - aborting Dst hash check`". Indeed, as expected the US object has a multipart upload ETag suffix "-2" (10 megs is 2 parts, because a part is 5 MiB), and no auxiliary Rclone metadata because it was generated by aws cp
not `rclone`:
❯ aws s3api head-object --bucket treeverse-ariels-test-us --key test/rclone/10m
{
"AcceptRanges": "bytes",
"LastModified": "Thu, 12 Jan 2023 08:21:41 GMT",
"ContentLength": 10485760,
"ETag": "\"e790610b43d3959ea3474ee6097ee2b4-2\"",
"ContentType": "binary/octet-stream",
"Metadata": {}
}
Over on the other bucket, we have a different ETag, presumably because rclone asked S3 to perform a server-side copy:
❯ aws s3api head-object --bucket treeverse-ariels-test --key test/rclone/10m/10m
{
"AcceptRanges": "bytes",
"LastModified": "Thu, 12 Jan 2023 08:26:03 GMT",
"ContentLength": 10485760,
"ETag": "\"895bed95bff6a288b45c28697991cc08\"",
"ContentType": "binary/octet-stream",
"Metadata": {}
}
Server-side copy changes ETags, see for instance this localstack issue.
I also tried to copy to a local minio on my machine, which prevents server-side copy. Again, it claims no hash check:
❯ rclone sync -vv --checksum --s3-use-multipart-etag=true <s3://treeverse-ariels-test-us/test/rclone/10m> <minio://foo/test/rclone>
<7>DEBUG : rclone: Version "1.60.1" starting with parameters ["rclone" "sync" "-vv" "--checksum" "--s3-use-multipart-etag=true" "<s3://treeverse-ariels-test-us/test/rclone/10m>" "<minio://foo/test/rclone>"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test-us/test/rclone/10m>"
<7>DEBUG : Using config file from "/home/ariels/.config/rclone/rclone.conf"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: adding new entry for parent of "<s3://treeverse-ariels-test-us/test/rclone/10m>", "s3{5gJq8}:treeverse-ariels-test-us/test/rclone"
<7>DEBUG : Creating backend with remote "<minio://foo/test/rclone>"
<7>DEBUG : minio: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: renaming cache item "<minio://foo/test/rclone>" to be canonical "minio{5gJq8}:foo/test/rclone"
<7>DEBUG : 10m: Need to transfer - File not found at Destination
<7>DEBUG : 10m: Src hash empty - aborting Dst hash check
<6>INFO : 10m: Copied (new)
<6>INFO :
Transferred: 10 MiB / 10 MiB, 100%, 496.503 KiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 20.5s
<7>DEBUG : 7 go routines active
I am sorry, I see no way forward for us here until we understand how to make Rclone verify hashes.rclone check
. For 2 distinct files, I get:
❯ rclone check <s3://treeverse-ariels-test-us/test/rclone{,-2}/>
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 0 differences found
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 1 hashes could not be checked
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 1 matching files
And for 2 identical files, I get the same output:
❯ rclone check <s3://treeverse-ariels-test-us/test/rclone/10m> s3eu:test/rclone/<5>NOTICE: S3 bucket treeverse-ariels-test path test: Switched region to "eu-central-1" from "us-east-1"
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 0 differences found
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 1 hashes could not be checked
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 1 matching files
Really sorry, but if we cannot figure out how to do it on S3, I will not able to do it on lakeFS either.CC
01/12/2023, 2:51 PMAriel Shaqed (Scolnicov)
01/12/2023, 3:02 PM