Hey we are using an S3 compatible bucket and when we try to lakeFS #help

Hey, we are using an S3 compatible bucket, and whe...

Matija Teršek

02/02/2023, 3:32 PM

Hey, we are using an S3 compatible bucket, and when we try to upload data with "--direct" flag, we get the following error

Copy code

upload to backing store: MissingRegion: could not find region configuration

I believe this can be simply solved by setting ``blockstore.s3.region``, but what should we set it to when we are not using AWS directly?

Elad Lachmi

02/02/2023, 3:38 PM

Hi @Matija Teršek, My I ask which s3 compatible provider you are using?

Elad Lachmi

02/02/2023, 3:40 PM

I believe you can bypass this check with setting a config parameter

blockstore.s3.discover_bucket_region

to false

Matija Teršek

02/02/2023, 3:40 PM

Backblaze. Thanks, will give it a try 🙂

Elad Lachmi

02/02/2023, 3:40 PM

Sure, np

Matija Teršek

02/02/2023, 3:41 PM

So it does seem we do have this set up already:

Copy code

LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION=false

Elad Lachmi

02/02/2023, 3:43 PM

You'd also need to set also need to set a few other parameters You can reference this part of the docs It's for on-prem, but should be the same for s3-compatible blockstores

Ariel Shaqed (Scolnicov)

02/02/2023, 3:43 PM

Is this error message from lakectl perhaps? If so you may need to configure the AWS region in an environment variable, probably AWS_REGION but perhaps AWS_DEFAULT_REGION. Basically in direct mode your client environment has to work with the AWS cli.

Ariel Shaqed (Scolnicov)

02/02/2023, 3:44 PM

Sorry, crossed messages with @Elad Lachmi. Do his suggestion first! 😬

sunglasses lakefs 1

Matija Teršek

02/02/2023, 3:45 PM

We do have it set up like in the on-prem installation. So force path, discover region to false, and then connection with S3 storage works. But when using

--direct

in command line causes the error yeah. So for --direct mode, AWS cli is also required?

Elad Lachmi

02/02/2023, 3:55 PM

You're using

lakectl

, right?

Matija Teršek

02/02/2023, 4:13 PM

Exactly

Ariel Shaqed (Scolnicov)

02/02/2023, 4:19 PM

Sorry for confusing messages. You'll need a client-side configuration that can access AWS directly to use direct mode. There may be a way using the new presigned URLs feature, that won't require any additional client-side configuration. I'll need to get to a keyboard to verify, unless @Barak Amar can beat me to it?!

Barak Amar

02/02/2023, 4:27 PM

Barak Amar

02/02/2023, 4:28 PM

Pre-sign is availabl for upload and download - replace

--direct

with

--pre-sign

Barak Amar

02/02/2023, 4:28 PM

It will not require any client side credentials and the client will get a link to download/upload the data directly to the bucket.

Barak Amar

02/02/2023, 4:39 PM

@Matija Teršek which version are you using? pre-sign was added recently.

Matija Teršek

02/02/2023, 4:39 PM

Hey, we are using 0.90.1

Barak Amar

02/02/2023, 4:39 PM

You will need v0.91.0

Ariel Shaqed (Scolnicov)

02/02/2023, 4:58 PM

Both lakeFS server and lakectl. Brand new.

Conor Simmons

02/02/2023, 5:36 PM

Testing it out now and benchmarking (it seems to work). I know @Ariel Shaqed (Scolnicov) said this

That passes data directly to s3, and only performs metadata operations on lakeFS.

And the docs say this

Copy code

write directly to backing store (faster but requires more credentials)

But I'm still a bit confused on what the difference is between using

--direct

--pre-sign

vs without? Why should it be faster?

Elad Lachmi

02/02/2023, 5:40 PM

With "regular" upload, the data flow looks something like this: local machine --> lakeFS --> S3 The up side is that you don't need the local machine to have credentials to the backing store and you can still upload files

Elad Lachmi

02/02/2023, 5:42 PM

With "--direct", the data flow looks something like this: local machine --> S3 then local machine --> lakeFS (but only for the purposes of registering the metadata) This is faster, but requires the client to have credentials for the backing store

Elad Lachmi

02/02/2023, 5:43 PM

and

--pre-sign

is more or less the best of both worlds the client gets a pre-signed URL from lakeFS and uploads directly to S3

Elad Lachmi

02/02/2023, 5:44 PM

So it's a direct upload like

--direct

, but also doesn't require the client to have access to the backing store, like a "regular" upload

Conor Simmons

02/02/2023, 5:44 PM

Great thank you @Elad Lachmi, I understand

Elad Lachmi

02/02/2023, 5:46 PM

Sure, np

Elad Lachmi

02/02/2023, 5:47 PM

Feel free to reach out if you have any other questions

💯 1

Conor Simmons

02/02/2023, 8:07 PM

So after timing with

--pre-sign

it seems like it still took over 2 hours to upload 2.5 GB

Elad Lachmi

02/02/2023, 8:09 PM

Interesting Thanks for taking the time to benchmark! We'll need to look into that

Elad Lachmi

02/02/2023, 8:09 PM

For context - I'm not sure you said how long it took to upload directly to s3 using the SDK, but I do believe you mentioned you benchmarked that as well

Matija Teršek

02/02/2023, 8:10 PM

Looking at the reverse proxy logs, it seems that for each file a link is generated that let's you upload a single file. Not sure if that's possible at all with S3 API, but perhaps it might make more sense to get the link for the entire folder?

Matija Teršek

02/02/2023, 8:11 PM

I think with Backblaze directly it takes approximately 10mins

Elad Lachmi

02/02/2023, 8:36 PM

Unfortunately I don't know of S3 support for presigned URLs for anything except a single object. Generally presigning is for a particular request, and it cannot be changed. We will of course investigate. If direct uploads are fast, we are probably running into a metadata rate limitation rather than a data rate limitation. Would you feel comfortable sharing any of the following information? • How many files are we uploading? I saw 5000 on some previous conversation, but that sounds really slow so I'm making sure. • Which database are we using for KV: Postgres / DynamoDB / ...? • Where is the lakeFS server located? Where is the database? Where is the bucket? • Could you share the lakeFS config file? We obviously don't need any of the secrets, so anything under auth, secret_access_key, session_token, dynamodb.aws_secret_access_key, encrypt.secret_key. If you can, please send a scrubbed version directly to me.

Conor Simmons

02/02/2023, 8:44 PM

I saw 5000 on some previous conversation, but that sounds really slow so I'm making sure.

to answer this one - it's 5000 images and 5000 json files, so 10,000 total

👍🏻 1

Conor Simmons

02/02/2023, 8:45 PM

For upload without

--pre-sign

, we didn't let it run all the way but the tqdm estimate looping through the files was 2-4 hours. This was without

--recursive

but the test with

--pre-sign

had the recursive flag

👍🏻 1

Elad Lachmi

02/02/2023, 8:48 PM

Any additional information you can comfortably share would be helpful and we'll look into it

Elad Lachmi

02/02/2023, 8:48 PM

Thank you

Conor Simmons

02/02/2023, 8:48 PM

I'll refer to @Matija Teršek for the other bullet points

👍🏻 1

👍 1

Matija Teršek

02/02/2023, 10:20 PM

Bucket is US west, DB and server are both east. We use the following env variables (value not provided where secret)

Copy code

LAKEFS_DATABASE_POSTGRES_TYPE=postgres
LAKEFS_BLOCKSTORE_TYPE=s3
LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION=false
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY
LAKEFS_DATABASE_CONNECTION_STRING
LAKEFS_AUTH_ENCRYPT_SECRET_KEY
LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS=10
LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS=22

Matija Teršek

02/02/2023, 10:21 PM

One potential issue that comes to mind on our side could be a reverse proxy that we are using on top of lakeFS.

Vino

02/02/2023, 10:26 PM

Hello @Matija Teršek! Thanks for sharing your configs. I’ll let @Conor Simmons and lakeFS team take a look at this.

❓ 2

Ariel Shaqed (Scolnicov)

02/03/2023, 6:54 AM

Bucket is US west, DB and server are both east.

Ouch, this may be an issue: lakeFS is best used when DB, server and bucket are in the same region. It would be ideal if you could test with the bucket and the client in the same region also. If you can, you really should run bucket, DB and server close to one another. Everything from now on assumes that for some reason you cannot. If I had to guess, I would say that the issue is metadata not data. You are uploading 10_000 files IIUC, but only 2.5GiB in total. And that takes you over 2 hours. You're clearly not running out of data throughput anywhere, so I guess the cross-region metadata operations on lakeFS are slowing everything down. Where is the client located? A direct upload requires 2 API calls on lakeFS, and a long-running API call on the bucket. If you're running the client near the bucket on the west coast, each API call is at least the inter-region RTT -- this is on average 62ms (us-east-1 to us-west-1 or us-west-2), and you'll be limited to fewer than 8 uploads per second in the best case. If you need to reconnect the client, that's another round-trip. Ideally you would not be running lakeFS far from the data store. Doing it like this will be slow and expensive because of all the cross-region transfers. If all this is true, one way to speed up uploads would be to perform them massively in parallel. Since lakeFS is not busy in this case, we could easily perform many hundreds of these uploads in parallel, maybe add some jitter before LinkPhysicalAddress calls, and I would expect a nice speedup.

Matija Teršek

02/03/2023, 9:44 PM

Thanks! Will be trying out us-east server soon. We've tried removing the proxy and the performance is similar so we ruled that out.

Matija Teršek

02/06/2023, 9:08 AM

Hey, we've tried removing the proxy and also moving the bucket to the US East. Proxy doesn't have an effect at all, while client in the east reduced the upload from ~2h to ~1.2h, which is still significantly more than when using the b2 CLI directly. We will look into just abstracting the latter and see whether we can make it work like that directly.

👍 1

141 Views

Open in Slack

Previous Next