We are seeing quite often S3 timeouts when we are doing many lakeFS #help

We are seeing quite often S3 timeouts when we are ...

Ion

03/14/2024, 8:44 AM

We are seeing quite often S3 timeouts when we are doing many concurrent runs, not sure if it's because of LakeFS or rather timeouts from the underlying Azure Storage. Any suggestions or ideas?

Itai Admi

03/14/2024, 10:07 AM

It could be a lot of things, from network throughput to DB throttling. What’s your setup?

Ion

03/14/2024, 10:42 AM

We have Lakefs deployed on k8s, then we use adls2 as our object store, which sits behind a vnet

Amit Kesarwani

03/15/2024, 12:38 AM

@Ion 1. How many concurrent runs were you running when you received timeouts? 2. Is lakeFS deployed on Azure cloud or on-prem? 3. Are you using Postgres database or Cosmos DB? 4. What tool/language are you using to access the data in lakeFS repo? 5. Are you using S3A Gateway, pre-signed URL or lakeFS File System (lakeFSFS)? Can you please provide lakeFS logs?

👍 1

Ion

03/15/2024, 7:25 AM

1. Around 40 concurrent steps 2. Azure cloud 3. PostgreDB in Kubernetes (we are still waiting on an Azure postegresQL db) 4. Python high level API 5. S3 gateway since we use delta-rs to write data I would need to check the logs, and see what I can share

Ion

03/15/2024, 10:02 AM

This is the error log I see in Lakefs: 2024-03-15T105654+01:00 time="2024-03-15T095654Z" level=error msg="could not write response body for object" func="pkg/gateway/operations.(*GetObject).Handle" file="build/pkg/gateway/operations/getobject.go:133" error="context canceled" host=lakefs matched_host=false method=GET operation_id=get_object path="<redacted>/table/event_year_month_clt=202312/column=<redacted>/part-00001-a8242922-5748-4e3c-9057-d5569a655ab4-c000.snappy.parquet" ref=main-step-jobid-14c42f40-ebcd-4771-869e-80f0161290f4-table repository=silver request_id=5cb40861-5e99-4dc2-ac7f-4b5d424c0a85 service_name=s3_gateway user=admin

Iddo Avneri

03/15/2024, 1:12 PM

I’m going based on this previous message . Usually context cancellation at this point cause by client disconnection. Any chance you have more logs from the request context on the server side or capture an error in the client side?

Stefan Verbruggen

03/15/2024, 2:12 PM

Hi @Iddo Avneri, on the request side we see Generic S3 error: request or response body error: operation timed out

Itai Admi

03/15/2024, 2:52 PM

Hey @Stefan Verbruggen & @Ion, requests that are being cancelled during load could be the result of several root causes: 1. Your LB (Azure Application Gateway?) is cancelling the requests since its configured timeout is too low. 2. lakeFS instances running on k8s with VMs that have too low network bandwidth per your usage. Or alternatively too few lakeFS replicas. 3. Postgres requests are taking too long (unlikely for your use case) due to low connections amount, not enough resources on the DB itself, etc 4. adls2 poor performance I would explore metrics from the LB, k8s/vms, postgres, adls2 to try and narrow down the problem or just increase the resources until it’s fixed. lakeFS also exposes metrics under

<lakefs_url>/metrics

that you can fetch and analyze. This is slightly outdated, we now also have

kv_request_duration_seconds

for DB latency.

❤️ 1

Iddo Avneri

03/15/2024, 3:16 PM

Thanks for your help on this @Itai Admi!

Stefan Verbruggen

03/15/2024, 3:16 PM

Thanks for the pointers Itai we will investigate this!

Open in Slack

Previous Next