We are seeing quite often S3 timeouts when we are ...
# help
i
We are seeing quite often S3 timeouts when we are doing many concurrent runs, not sure if it's because of LakeFS or rather timeouts from the underlying Azure Storage. Any suggestions or ideas?
i
It could be a lot of things, from network throughput to DB throttling. What’s your setup?
i
We have Lakefs deployed on k8s, then we use adls2 as our object store, which sits behind a vnet
a
@Ion 1. How many concurrent runs were you running when you received timeouts? 2. Is lakeFS deployed on Azure cloud or on-prem? 3. Are you using Postgres database or Cosmos DB? 4. What tool/language are you using to access the data in lakeFS repo? 5. Are you using S3A Gateway, pre-signed URL or lakeFS File System (lakeFSFS)? Can you please provide lakeFS logs?
👍 1
i
1. Around 40 concurrent steps 2. Azure cloud 3. PostgreDB in Kubernetes (we are still waiting on an Azure postegresQL db) 4. Python high level API 5. S3 gateway since we use delta-rs to write data I would need to check the logs, and see what I can share
This is the error log I see in Lakefs: 2024-03-15T105654+01:00 time="2024-03-15T095654Z" level=error msg="could not write response body for object" func="pkg/gateway/operations.(*GetObject).Handle" file="build/pkg/gateway/operations/getobject.go:133" error="context canceled" host=lakefs matched_host=false method=GET operation_id=get_object path="<redacted>/table/event_year_month_clt=202312/column=<redacted>/part-00001-a8242922-5748-4e3c-9057-d5569a655ab4-c000.snappy.parquet" ref=main-step-jobid-14c42f40-ebcd-4771-869e-80f0161290f4-table repository=silver request_id=5cb40861-5e99-4dc2-ac7f-4b5d424c0a85 service_name=s3_gateway user=admin
i
I’m going based on this previous message . Usually context cancellation at this point cause by client disconnection. Any chance you have more logs from the request context on the server side or capture an error in the client side?
s
Hi @Iddo Avneri, on the request side we see Generic S3 error: request or response body error: operation timed out
i
Hey @Stefan Verbruggen & @Ion, requests that are being cancelled during load could be the result of several root causes: 1. Your LB (Azure Application Gateway?) is cancelling the requests since its configured timeout is too low. 2. lakeFS instances running on k8s with VMs that have too low network bandwidth per your usage. Or alternatively too few lakeFS replicas. 3. Postgres requests are taking too long (unlikely for your use case) due to low connections amount, not enough resources on the DB itself, etc 4. adls2 poor performance I would explore metrics from the LB, k8s/vms, postgres, adls2 to try and narrow down the problem or just increase the resources until it’s fixed. lakeFS also exposes metrics under
<lakefs_url>/metrics
that you can fetch and analyze. This is slightly outdated, we now also have
kv_request_duration_seconds
for DB latency.
❤️ 1
i
Thanks for your help on this @Itai Admi!
s
Thanks for the pointers Itai we will investigate this!