Hey, we are using an S3 compatible bucket, and whe...
# help
m
Hey, we are using an S3 compatible bucket, and when we try to upload data with "--direct" flag, we get the following error
Copy code
upload to backing store: MissingRegion: could not find region configuration
I believe this can be simply solved by setting ``blockstore.s3.region``, but what should we set it to when we are not using AWS directly?
e
Hi @Matija Teršek, My I ask which s3 compatible provider you are using?
I believe you can bypass this check with setting a config parameter
blockstore.s3.discover_bucket_region
to false
m
Backblaze. Thanks, will give it a try 🙂
e
Sure, np
m
So it does seem we do have this set up already:
Copy code
LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION=false
e
You'd also need to set also need to set a few other parameters You can reference this part of the docs It's for on-prem, but should be the same for s3-compatible blockstores
a
Is this error message from lakectl perhaps? If so you may need to configure the AWS region in an environment variable, probably AWS_REGION but perhaps AWS_DEFAULT_REGION. Basically in direct mode your client environment has to work with the AWS cli.
Sorry, crossed messages with @Elad Lachmi. Do his suggestion first! 😬
sunglasses lakefs 1
m
We do have it set up like in the on-prem installation. So force path, discover region to false, and then connection with S3 storage works. But when using
--direct
in command line causes the error yeah. So for --direct mode, AWS cli is also required?
e
You're using
lakectl
, right?
m
Exactly
a
Sorry for confusing messages. You'll need a client-side configuration that can access AWS directly to use direct mode. There may be a way using the new presigned URLs feature, that won't require any additional client-side configuration. I'll need to get to a keyboard to verify, unless @Barak Amar can beat me to it?!
b
Hi
Pre-sign is availabl for upload and download - replace
--direct
with
--pre-sign
It will not require any client side credentials and the client will get a link to download/upload the data directly to the bucket.
@Matija Teršek which version are you using? pre-sign was added recently.
m
Hey, we are using 0.90.1
b
You will need v0.91.0
a
Both lakeFS server and lakectl. Brand new.
c
Testing it out now and benchmarking (it seems to work). I know @Ariel Shaqed (Scolnicov) said this
That passes data directly to s3, and only performs metadata operations on lakeFS.
And the docs say this
Copy code
write directly to backing store (faster but requires more credentials)
But I'm still a bit confused on what the difference is between using
--direct
/
--pre-sign
vs without? Why should it be faster?
e
With "regular" upload, the data flow looks something like this: local machine --> lakeFS --> S3 The up side is that you don't need the local machine to have credentials to the backing store and you can still upload files
With "--direct", the data flow looks something like this: local machine --> S3 then local machine --> lakeFS (but only for the purposes of registering the metadata) This is faster, but requires the client to have credentials for the backing store
and
--pre-sign
is more or less the best of both worlds the client gets a pre-signed URL from lakeFS and uploads directly to S3
So it's a direct upload like
--direct
, but also doesn't require the client to have access to the backing store, like a "regular" upload
c
Great thank you @Elad Lachmi, I understand
e
Sure, np
Feel free to reach out if you have any other questions
💯 1
c
So after timing with
--pre-sign
it seems like it still took over 2 hours to upload 2.5 GB
e
Interesting Thanks for taking the time to benchmark! We'll need to look into that
For context - I'm not sure you said how long it took to upload directly to s3 using the SDK, but I do believe you mentioned you benchmarked that as well
m
Looking at the reverse proxy logs, it seems that for each file a link is generated that let's you upload a single file. Not sure if that's possible at all with S3 API, but perhaps it might make more sense to get the link for the entire folder?
I think with Backblaze directly it takes approximately 10mins
e
Unfortunately I don't know of S3 support for presigned URLs for anything except a single object. Generally presigning is for a particular request, and it cannot be changed. We will of course investigate. If direct uploads are fast, we are probably running into a metadata rate limitation rather than a data rate limitation. Would you feel comfortable sharing any of the following information? • How many files are we uploading? I saw 5000 on some previous conversation, but that sounds really slow so I'm making sure. • Which database are we using for KV: Postgres / DynamoDB / ...? • Where is the lakeFS server located? Where is the database? Where is the bucket? • Could you share the lakeFS config file? We obviously don't need any of the secrets, so anything under auth, secret_access_key, session_token, dynamodb.aws_secret_access_key, encrypt.secret_key. If you can, please send a scrubbed version directly to me.
c
I saw 5000 on some previous conversation, but that sounds really slow so I'm making sure.
to answer this one - it's 5000 images and 5000 json files, so 10,000 total
👍🏻 1
For upload without
--pre-sign
, we didn't let it run all the way but the tqdm estimate looping through the files was 2-4 hours. This was without
--recursive
but the test with
--pre-sign
had the recursive flag
👍🏻 1
e
Any additional information you can comfortably share would be helpful and we'll look into it
Thank you
c
I'll refer to @Matija Teršek for the other bullet points
👍🏻 1
👍 1
m
Bucket is US west, DB and server are both east. We use the following env variables (value not provided where secret)
Copy code
LAKEFS_DATABASE_POSTGRES_TYPE=postgres
LAKEFS_BLOCKSTORE_TYPE=s3
LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION=false
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY
LAKEFS_DATABASE_CONNECTION_STRING
LAKEFS_AUTH_ENCRYPT_SECRET_KEY
LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS=10
LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS=22
One potential issue that comes to mind on our side could be a reverse proxy that we are using on top of lakeFS.
v
Hello @Matija Teršek! Thanks for sharing your configs. I’ll let @Conor Simmons and lakeFS team take a look at this.
a
Bucket is US west, DB and server are both east.
Ouch, this may be an issue: lakeFS is best used when DB, server and bucket are in the same region. It would be ideal if you could test with the bucket and the client in the same region also. If you can, you really should run bucket, DB and server close to one another. Everything from now on assumes that for some reason you cannot. If I had to guess, I would say that the issue is metadata not data. You are uploading 10_000 files IIUC, but only 2.5GiB in total. And that takes you over 2 hours. You're clearly not running out of data throughput anywhere, so I guess the cross-region metadata operations on lakeFS are slowing everything down. Where is the client located? A direct upload requires 2 API calls on lakeFS, and a long-running API call on the bucket. If you're running the client near the bucket on the west coast, each API call is at least the inter-region RTT -- this is on average 62ms (us-east-1 to us-west-1 or us-west-2), and you'll be limited to fewer than 8 uploads per second in the best case. If you need to reconnect the client, that's another round-trip. Ideally you would not be running lakeFS far from the data store. Doing it like this will be slow and expensive because of all the cross-region transfers. If all this is true, one way to speed up uploads would be to perform them massively in parallel. Since lakeFS is not busy in this case, we could easily perform many hundreds of these uploads in parallel, maybe add some jitter before LinkPhysicalAddress calls, and I would expect a nice speedup.
m
Thanks! Will be trying out us-east server soon. We've tried removing the proxy and the performance is similar so we ruled that out.
Hey, we've tried removing the proxy and also moving the bucket to the US East. Proxy doesn't have an effect at all, while client in the east reduced the upload from ~2h to ~1.2h, which is still significantly more than when using the b2 CLI directly. We will look into just abstracting the latter and see whether we can make it work like that directly.
👍 1