Hey all, I’ve noticed that when I roll out changes to my LakeFS server pods on k8s, some clients receive errors (presumably because they are waiting for a response from a pod that was terminated). Is this a known issue? (This seems to be a tricky k8s problem based on some reading)
Is there anything I can do to roll out changes without breaking connections with clients?
09/30/2023, 9:22 AM
I am not aware of such an issue. Can you please share your setup as well as how you perform the rolling update and the errors you encounter?
10/02/2023, 12:45 PM
Hey @Jacob, lakeFS respects the
signal k8s is using to flag that the container will exit soon. For the lakeFS Cloud version, we manage the rollout ourselves and don’t encounter any errors. I find this guide pretty useful for understanding k8s shutdown process.
10/02/2023, 1:11 PM
@Itai Admi Thanks for the pointers! Should we be configuring the probes ourselves? I noticed that the lakefs helm chart just uses the default k8s values
10/02/2023, 3:12 PM
It depends on your particular use-case, like what the clients are doing when they are failing. If I had to guess it’s probably not the probes. Like you mentioned, it seems like a pod is terminated during some clients operation when a deployment is running. That could happen if your clients perform long operations, like transferring big files in a single request. So it seems like the default k8s grace period is too short for your usage. If that’s the case, you can either:
1. Break the long operations into pieces (multiparts, read directly from the object-store, etc.)
2. Increase k8s grace period for terminating pods.