Hi, Not sure if this is an issue with LakeFS, Rclo...
# help
c
Hi, Not sure if this is an issue with LakeFS, Rclone, or something on my side. I'm using rclone to copy from one lakefs instance to another, like this: rclone sync --checksum --s3-use-multipart-etag=true lakefs_<a://repo1/tag1> lakefs_<b://repo1/main/> But I'm getting an error from rclone: Failed to copy: multipart upload corrupted: Etag differ: expecting 1234567890abcdef1234567890abcdef-51 but got 1234567890abcdef1234567890abcdef In other words, it's comparing a 32-character etag, including appended chunk count, to the same etag, withOUT the appended chunk count. Of course, they don't match. This happens if the file that's being copied is above the size threshold for multipart uploads. If I drop the "--s3-use-multipart-etag=true" parameter, it doesn't fail, but I'm not sure how much validation is done on the copy. When I do "lakectl fs stat" on the file on lakefs_a, the "Checksum" includes the appended chunk count.
i
Thanks for sharing Chuck
What is your rclone version?
Let me consult with some teammates regarding this. I think some information that can be helpful is the versions (rclone, lakectl) - so we can try and look at the specific configuration when we investigate.
c
LakeFS client is 0.86.0 . rclone version 1.60.0
The rclone config file is: [lakefs_a] type = s3 provider = Other env_auth = false access_key_id = REMOVED secret_access_key = REMOVED endpoint = http://1.2.3.4:8000 no_check_bucket = true [lakefs_b] type = s3 provider = Other env_auth = false access_key_id = REMOVED secret_access_key = REMOVED region = us-east-1 endpoint = http://5.6.7.8:8000 no_check_bucket = true
i
Very helpful. Thank you.
c
Is "provider = Other" kosher? Your examples tend to say "provider = AWS", but rclone errors out with that (possibly because 2 different regions are involved?).
i
I’m honestly not 100% sure and will consult with Ariel. I think m this is sufficient information to look into it.
a
Thanks, @CC, for the excellent report and all the accompanying information! I am investigating. S3 multipart ETags are wonky, this is a good recent explanation, and it can be frustrating to track ETags on S3-compatible object stores (rclone actually documents that ETag doesn't work for S3, it has this line for "qingstor", this issue which seems stuck, this issue about ETag not working, etc.). lakeFS copies ETags during import, so it makes sense that imported objects share the same ETag string. Appending the chunk count to the computed ETag is a really strange move for a checksum. I will check: 1. Can we include the chunk count on S3 multipart uploads? 2. Assuming we can, will it be safe to do so for our existing users? I should have at least a plan by 16:00 UTC today. As a workaround, if your repo import is not very large: You might use Rclone as import, too. It should add the magic
X-Amz-Meta-Md5chksum
tag on lakefs_a, and then further Rclones should work just fine.
Actually, I'm clearly not an Rclone expert. But from reading the above issues and browsing the code I suspect that Rclone cannot handle ETags that arise from multipart uploads to S3! First some code. Rclone ignores any ETag that does not match this line
Copy code
var matchMd5 = regexp.MustCompile(`^[0-9a-f]{32}$`)
so it will refuse to handle an ETag generated by a multipart upload. So I came up with this experiment. It appears to show that Rclone does not verify ETags for S3 copies. No lakeFS in sight. I started by uploading 10 MiB of data to my US bucket:
Copy code
❯ aws s3 cp /tmp/10m <s3://treeverse-ariels-test-us/test/rclone/10m>
Let's rclone sync it over to my EU bucket!
-vv
means to give lots of debug information... aws s3api head-object --bucket treeverse-ariels-test --key test/rclone/10m/10m { "AcceptRanges": "bytes", "LastModified": "Thu, 12 Jan 2023 082603 GMT", "ContentLength": 10485760, "ETag": "\"895bed95bff6a288b45c28697991cc08\"", "ContentType": "binary/octet-stream", "Metadata": {} }
Copy code
❯ rclone sync -vv --checksum --s3-use-multipart-etag=true <s3://treeverse-ariels-test-us/test/rclone/10m> <s3://treeverse-ariels-test/test/rclone/10m>
<7>DEBUG : rclone: Version "1.60.1" starting with parameters ["rclone" "sync" "-vv" "--checksum" "--s3-use-multipart-etag=true" "<s3://treeverse-ariels-test-us/test/rclone/10m>" "<s3://treeverse-ariels-test/test/rclone/10m>"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test-us/test/rclone/10m>"
<7>DEBUG : Using config file from "/home/ariels/.config/rclone/rclone.conf"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: adding new entry for parent of "<s3://treeverse-ariels-test-us/test/rclone/10m>", "s3{5gJq8}:treeverse-ariels-test-us/test/rclone"
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test/test/rclone/10m>"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: Switched region to "eu-central-1" from "us-east-1"
<7>DEBUG : pacer: low level retry 1/2 (error BucketRegionError: incorrect region, the bucket is not in 'us-east-1' region at endpoint '', bucket is in 'eu-central-1' region
        status code: 301, request id: 64Y2G17VXMKHFPA6, host id: lmJhpdYGGgFn10pdzK2/oiF4jY7MVqVolaE7/lNzpvz+AQZddc158KkisiuXnyEV7UsNlZM8zQU=)
<7>DEBUG : pacer: Rate limited, increasing sleep to 10ms
<7>DEBUG : pacer: Reducing sleep to 0s
<7>DEBUG : fs cache: renaming cache item "<s3://treeverse-ariels-test/test/rclone/10m>" to be canonical "s3{5gJq8}:treeverse-ariels-test/test/rclone/10m"
<7>DEBUG : 10m: Need to transfer - File not found at Destination
<7>DEBUG : 10m: Src hash empty - aborting Dst hash check
<6>INFO  : 10m: Copied (server-side copy)
<6>INFO  : 
Transferred:           10 MiB / 10 MiB, 100%, 0 B/s, ETA -
Transferred:            1 / 1, 100%
Elapsed time:         6.5s

<7>DEBUG : 11 go routines active
The key line is "`10m: Src hash empty - aborting Dst hash check`". Indeed, as expected the US object has a multipart upload ETag suffix "-2" (10 megs is 2 parts, because a part is 5 MiB), and no auxiliary Rclone metadata because it was generated by
aws cp
not `rclone`:
Copy code
❯ aws s3api head-object --bucket treeverse-ariels-test-us --key test/rclone/10m
{
    "AcceptRanges": "bytes",
    "LastModified": "Thu, 12 Jan 2023 08:21:41 GMT",
    "ContentLength": 10485760,
    "ETag": "\"e790610b43d3959ea3474ee6097ee2b4-2\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}
Over on the other bucket, we have a different ETag, presumably because rclone asked S3 to perform a server-side copy:
Copy code
❯ aws s3api head-object --bucket treeverse-ariels-test --key test/rclone/10m/10m
{
    "AcceptRanges": "bytes",
    "LastModified": "Thu, 12 Jan 2023 08:26:03 GMT",
    "ContentLength": 10485760,
    "ETag": "\"895bed95bff6a288b45c28697991cc08\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}
Server-side copy changes ETags, see for instance this localstack issue. I also tried to copy to a local minio on my machine, which prevents server-side copy. Again, it claims no hash check:
Copy code
❯ rclone sync -vv --checksum --s3-use-multipart-etag=true <s3://treeverse-ariels-test-us/test/rclone/10m> <minio://foo/test/rclone>
<7>DEBUG : rclone: Version "1.60.1" starting with parameters ["rclone" "sync" "-vv" "--checksum" "--s3-use-multipart-etag=true" "<s3://treeverse-ariels-test-us/test/rclone/10m>" "<minio://foo/test/rclone>"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "<s3://treeverse-ariels-test-us/test/rclone/10m>"
<7>DEBUG : Using config file from "/home/ariels/.config/rclone/rclone.conf"
<7>DEBUG : s3: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: adding new entry for parent of "<s3://treeverse-ariels-test-us/test/rclone/10m>", "s3{5gJq8}:treeverse-ariels-test-us/test/rclone"
<7>DEBUG : Creating backend with remote "<minio://foo/test/rclone>"
<7>DEBUG : minio: detected overridden config - adding "{5gJq8}" suffix to name
<7>DEBUG : fs cache: renaming cache item "<minio://foo/test/rclone>" to be canonical "minio{5gJq8}:foo/test/rclone"
<7>DEBUG : 10m: Need to transfer - File not found at Destination
<7>DEBUG : 10m: Src hash empty - aborting Dst hash check
<6>INFO  : 10m: Copied (new)
<6>INFO  : 
Transferred:           10 MiB / 10 MiB, 100%, 496.503 KiB/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:        20.5s

<7>DEBUG : 7 go routines active
I am sorry, I see no way forward for us here until we understand how to make Rclone verify hashes.
Final message for today, I promise! This is not just checksum checking to see if sync is needed, I also tried
rclone check
. For 2 distinct files, I get:
Copy code
❯ rclone check <s3://treeverse-ariels-test-us/test/rclone{,-2}/>
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 0 differences found
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 1 hashes could not be checked
<5>NOTICE: S3 bucket treeverse-ariels-test-us path test/rclone-2: 1 matching files
And for 2 identical files, I get the same output:
Copy code
❯ rclone check <s3://treeverse-ariels-test-us/test/rclone/10m> s3eu:test/rclone/<5>NOTICE: S3 bucket treeverse-ariels-test path test: Switched region to "eu-central-1" from "us-east-1"
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 0 differences found
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 1 hashes could not be checked
<5>NOTICE: S3 bucket treeverse-ariels-test path test/rclone: 1 matching files
Really sorry, but if we cannot figure out how to do it on S3, I will not able to do it on lakeFS either.
c
Thanks for that research. I see several takeaways: 1. Using rclone across the board may be the answer I need. 2. Using -vv for a couple of simple tests should at least tell me if it's using the hash. 3. From what I've seen on the web, I gather that rclone is using checksums on each chunk of the uploads, even if it's not doing it for them as a set, but that isn't an option for downloads (which weakens the net result, of course).
👍🏼 1
a
Yup! The whole situation with Rclone and S3 and ETag is deeply unsatisfying at an engineering level.
132 Views