Hello, so starting a new thread here. I ran anoth...
# help
j
Hello, so starting a new thread here. I ran another test, and it failed in the same spot in the bulk delete part of spark. This time, I up'd the size of the lakefs server to 4 vCpu's and 8GB of ram. The lakefs server looked really good this time. CPU usage maxed out at 43% (as opposed to pegged at 100% on all previous runs). The postgres db read/write latency were both under 5ms. The only thing i can think of to try at this point is to turn on that hadoop failure retry setting to try can recover from these bulk delete failures. I'd hate to do that though because that is potentially hiding the real problem.
a
I understand your reluctance to allow Spark jobs to succeed when they failed to clean up their mess. Until we resolve the issue with slowness of multiple uncommitted deletions, I am afraid we will need to suggest workarounds. I will however point out that Spark can generate lots of garbage on its object store; a production system does need to collect this garbage or to live with it. Here is one instance, a question on Microsoft Azure.
j
Hi, so i'm going to give it another go setting the mapreduce.fileoutputcommitter.failures.attempts to 100.
jumping lakefs 1
Sigh...setting:spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.failures.attempts", "100") did not fix the issue, it died in the same place.
stderr.gz,stdout.gz
i've got one more in me. going to set mapreduce.fileoutputcommitter.cleanup-failures.ignored to true. after this, i'm out of ideas.
well that went a little longer, but now the entire lakefs server went down and is unresponsive and the job failed.
stdout.gz,stderr.gz
lakefs server, rds monitoring, rds performance statistics respectively
Also, there were a lot of http 500's thrown this time.
the only thing i can't measure is the disk performance on the lakefs server. but it's a local ephemeral ssd, so it's writing as fast as it can.
but I imagine there's not a lot of local disk work
a
Hi Joe, Thanks for sending that data! Today is a holiday here, and I'll go over them tomorrow morning.
👍 1
i
@Joe M, looking at the logs it seems like there are a lot of errors like this one:
Copy code
Caused by: io.lakefs.hadoop.shade.sdk.ApiException: Message: Unauthorized
HTTP response code: 401
HTTP response body: {"message":"error authenticating request"}
assuming you didn’t change anything with lakeFS IAM or the job creds, this is pretty odd. Can you share the lakeFS server logs during the run?
j
yeah i didn't not change anything. I'm the code is just sitting there running the spark partition command (and another thread is in a loop, committing periodically)
I'll pull the server logs.
but like i said, the server was completely unresponsive so that be a red herring cuz something timed out during the auth check of a particular call.
👍 1
i
Can you share the server configuration too?
j
000000.gz,config.yaml
fyi, in the lakefs server startup shell script, I'm adding the auth information via environment variables: export LAKEFS_AUTH_ENCRYPT_SECRET_KEY="${serverHash}" export LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="postgres://${postgresUsername}:${postgresPassword}@${postgresHost}:${postgresPort}/postgres"
those are coming from secrets manager secrets.
also, in the config.yaml, the REGION is getting replaced with the aws region in the startup script.
i
Few observations: 1. It seems like lakeFS is receiving a signal to exit from the controller. Do you know why ECS stopped lakeFS? health checks?
Copy code
2024-06-11T17:52:03.210Z received SIGTERM
2024-06-11T17:52:03.210Z sending SIGINT to LakeFS process: 68
2024-06-11T17:52:03.210Z waiting for lakefs to shutdown
2024-06-11T17:52:03.210Z Shutting down...
2024-06-11T17:52:03.222Z lakefs still running, waiting...
2. With the data we see I lean towards the number of connections to the DBs. If you want to experiment with non-HA setup, I think you should increase the number of open connections to postgres which defaults to 25. What’s the max amount your postgres can handle? I would try at least 100 and see where it takes you.
j
after i terminated the emr cluster that was doing the partitioning (because the box became unresponsive), i shut down the lakefs server ecs task. so that would have been after everything failed and became unresponsive.
once the emr cluster completely terminated, the lakefs UI eventually did start to respond again before i shut it down.
ok, i can try up'ing the db connection count.
what is the correct yaml format for the database.postgres.max_open_connections field? It says the type is duration, but what does that mean in yaml? connection_max_lifetime: 3h or connection_max_lifetime: "3h"
looks like the former worked
i
It would be something like:
Copy code
listen_address: "0.0.0.0:8000"

database:
  type: "postgres"
  postgres:
    max_open_connections: 100

logging:
  format: text
  level: INFO
  output: "-"

blockstore:
  type: s3
  s3:
    region: __REGION__
j
from the log: database.postgres.connection_max_lifetime=3h0m0s
i
I guess changing
connection_max_lifetime
is worth a shot too
although every 5m should be okay too
j
yeah, i'm thinking any millisecond we can save by not having to reconnect is better. since the majority of this operation happens over 2 hours, i'd rather not have the connections reopened.
anyway, i'm getting setup to rerun this now, i'll let you know how it goes.
I need to check the postgres side and see if it has a max connections limit configured and if so, make sure that exceeds the client value of 100 i set.
argh, i just saw this while i was in RDS. this recommendation was generated after the last test when i set the ignore cleanup failures to true
image.png
I'm not sure why the fileoutputcommitter retry cleanup failures setting had no effect.
the stack trace showed the connection exception thrown from your lakefs code was called by the FileOutputComitter class, so it should have caught it and retried.
then again, maybe it did retry and failed for the 100 times i gave it...and so it still wound up failing.