So good news bad news on setting fs lakefs delete bulk size lakeFS #help

So, good news/bad news on setting fs.lakefs.delete...

Joe M

06/14/2024, 12:00 AM

So, good news/bad news on setting fs.lakefs.delete.bulk_size to 50. It did seem to fix the bulk delete problem. but, now it's failing after that in a different place. Also there's a few exceptions at the beginning with this message, not sure what it means

Copy code

Caused by: java.io.IOException: Failed to finish writing to presigned link. No ETag found.

I think the operation was retried and it succeeded cuz my job didn't fail on those. It only failed on the timeouts.

server-logs.gz stdout.gz stderr.gz

Itai Admi

06/14/2024, 9:47 AM

That’s good to hear @Joe M. I don’t see anything from the lakeFS logs and the Spark logs are all over the place. Can you share the exact stacktrace of the Spark failure? Also - did you manage to run more than a single lakeFS server?

Joe M

06/14/2024, 3:29 PM

lakefs server logs

lakefs-server-logs.zip

Joe M

06/14/2024, 3:30 PM

The first error started here

Untitled

Joe M

06/14/2024, 3:31 PM

it's the same socket timeout error we've been seing, just this time while calling "ListObjects"

Joe M

06/14/2024, 3:32 PM

we discussed how we would get multiple lakefs servers running. basically we'll have 2 or more lakefs server ECS tasks running and we'll have an ALB setup in front of them routing traffic to all of them. this is a last resort at this point. I really don't want to have to have multiple boxes running for this one proof of concept test.

Itai Admi

06/16/2024, 7:44 AM

Hey Joe, running multiple servers is essential for any HA setup. Moreover, not sure I fully understand the value of doing a stress test poc on a single server. Is running a single server a requirement for some reason?

Joe M

06/16/2024, 4:46 PM

Right, this is the question i keep asking. I'm not sure what is considered a stress test for this product. Like, which part of what we're doing are you saying is stressing lakefs? Is it the number of partitions? the amount of data? both? or something else?

Itai Admi

06/16/2024, 6:34 PM

I said it’s the amount of objects written together with the fact that it’s a single server. Considering the fact that we estimated roughly a 20M API calls during 2-3 hours, I’d expect several minutes of >10k req/sec to that server. That’s a lot, while I’ve seen many installations that handle bigger bursts, they were never relying on a single server for that.

Joe M

06/16/2024, 7:01 PM

ah, ok. I missed that part of the discussion.

6 Views

Open in Slack

Previous Next