I have a weird bug with our LakeFS hosted in Azure...
# help
h
I have a weird bug with our LakeFS hosted in Azure:
Copy code
$ lakectl --verbose commit <lakefs://hieu-test/main> --message "add file"
Branch: <lakefs://hieu-test/main>
It takes minutes for the command return, while the web interface say that the commit been done.
the
--verbose
flag is not doing anything. How do I get more verbose in order to diagnose the issue ??
g
Hi @HT, Just let me make sure I’m getting it right, Your committing via lakectl when going to the UI you see the commit immediately But the lakectl command returns after minutes (with success) Is that correct?
h
yes
I got some log from our server :
Copy code
time="2023-05-31T03:35:59Z" level=error msg="Post-commit hook failed" func="pkg/graveler.(*Graveler).Commit" file="build/pkg/graveler/graveler.go:1916" error="updating commit ID: run id 5iutn73meojc73d2gmfg: postgres get: context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> method=POST operation_id=Commit path=/api/v1/repositories/[REDACTED]/branches/main/commits pre_run_id=5iutn73meojc73d2gmfg request_id=16232042-ada5-4d62-a499-1ff37e943d9c run_id=5iutmobmeojc73d2gmg0 service_name=rest_api user=hieut
feels like connection between the lakfs VM and the postgress databse id degraded somehow ...
g
OK,I was just about to ask for logs, Thanks 😃
h
and now things are back to normal !
commit return right away
may be a blip in the Azure space ?!?!
among others:
Copy code
time="2023-05-31T01:22:14Z" level=error msg="could not update metadata" func="pkg/gateway/operations.(*PathOperation).finishUpload" file="build/pkg/gateway/operations/operation_utils.go:51" error="postgres set: timeout: context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=[REDACTED] ref=main repository=XXXX request_id=2332a8c1-7854-49dc-9ee6-60d03c5e64c0 service_name=s3_gateway user=hieut
Copy code
time="2023-05-31T01:22:14Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXXX ref=main repository=XXXX request_id=00aae4e9-acc8-42ac-852d-c7055c5ca4df service_name=s3_gateway user=hieut
g
Are these connected? they are from 2 hours earlier IIUC
h
I am stress testing our server so I am doing a lot of commit and repo creation
g
OK
h
those are random errors that I grab
there are heaps of those
g
It’s verry helpful
h
and now that things are back to normal, no more error poping up in the logs
we did not touch anything from the infrastructure side
g
Can you please look at the metrics/logs in postgres?
h
and the only thing I did was to pause the test for like 10min
g
It looks like you are loosing connection with postgres once in a while
h
our Container App :
looking at the databse now ...
image.png
12h30 ish I restart the transfer. 12h28 a bunch of error 13h22 another bunch The double CPU usage : I did restart transfer with 64 thread in rclone 14h230 that I start to observe the commit not returning
It looks like you are loosing connection with postgres once in a while
It does look like that
g
By the way, the reason that you saw the commit in the UI right away is that the commit succeeded, but the post-commit hook failed (to write to postgres).
h
does it have longterm consequence ?
or inconsistency ?
g
Well, If your failing to write to the database constantly, many bad things can happen, it’s not an ideal way to run, I suggest looking further into it and finding the root cause.
I would start with changing the log to DEBUG mode, that would show more information such as time it takes for each operation
h
Looking at yeaterday log, I have:
Copy code
time="2023-05-30T09:02:41Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXX ref=main repository=XXXX request_id=81b5e6bc-afc8-402b-a108-474521c8fd1c service_name=s3_gateway user=hieut
Is this databse related or Blob storage ?
ok. We need to keep an eye on this as this looks like it's not an one off !
g
This one is Blob storage, but it can happen if you cancelled a request during upload
🙏 1
n
Out of the top of my head, since you are stress testing, I would check the number of connections to the DB and if it reached the connections pool number. It could be that we're waiting for a connection to free up
👍 1
h
Did not though about that ! Thanks!
@Niro Can you explain a bit more about number of connection ? We are using PostGresSQL Flexible. The metric Number of connection pools is empty
(currently having the commit lag ... No error pop up yet on lakefs server ...)
we get
504 Gateway Timeout
about 4min after the commit command ...
n
@HT Hi, the timeout probably comes from the azure load balancer which has a hard limit of 4 minutes. Can you provide the lakeFS logs around this time?
h
My upload/commit log show timeout start happening from around 6pm the log only show error at 20h47, which do match one of the Timeout error in the upload script
errors in 20h47 are all :
Copy code
time="2023-05-31T08:47:01Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXX.xml ref=main repository=capture-20211018-141231879861-clone071 request_id=94047de4-f9c4-4e28-ab3f-cff76101768a service_name=s3_gateway user=hieut
n
Is this the time when you got the
504
error?
h
yes, got one at that time. But got a bunch more
image.png
n
Yes - so it makes sense, the context get canceled due to the time out. What I need to see are the lakefs logs around 202913 - the time you performed the commit itself. I understand you set up hooks?
h
I understand you set up hooks?
I don't think so .... we just deploy the default setup.
The weird thing is that I don't have any other logs, on lakefs server side, than the bunch at 8:47
n
What about the timestamp I mentioned?
h
my Azure logs query and results:
n
I'd like to focus on a specific occurrence. Do you have logs for this specific timeframe?
Also, another question: can you roughly tell me how many files are you committing in this scenario?
h
For the 202913.747 commit: 70626 files (most of them are 1KB)
n
@HT Thanks, Still suspect the number of connections was exhausted. Can you please perform metrics query to lakefs while running this scenario and look at the pgx metrics:
curl <lakefs_url>/metrics | grep ^pgxpool
h
Copy code
pgxpool_acquire_count{db_name="kv"} 2.7428454e+07
pgxpool_acquire_duration_ns{db_name="kv"} 1.227152849835e+12
pgxpool_acquired_conns{db_name="kv"} 1
pgxpool_canceled_acquire_count{db_name="kv"} 30
pgxpool_constructing_conns{db_name="kv"} 0
pgxpool_empty_acquire{db_name="kv"} 158220
pgxpool_idle_conns{db_name="kv"} 24
pgxpool_max_conns{db_name="kv"} 25
pgxpool_total_conns{db_name="kv"} 25
it start happening since I change rclone from 32 thread to 64 ...
I will switch back to 32 thread overnight and deal with the issue tomorrow ...
Follow up : now we have • loglevel DEBUG • rclone at 64 thread
Copy code
LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS
200
LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS
200
LAKEFS_LOGGING_LEVEL
DEBUG
• commit is stalling .... • /metrics:
Copy code
pgxpool_acquire_count{db_name="kv"} 1771
pgxpool_acquire_duration_ns{db_name="kv"} 453377
pgxpool_acquired_conns{db_name="kv"} 0
pgxpool_canceled_acquire_count{db_name="kv"} 0
pgxpool_constructing_conns{db_name="kv"} 0
pgxpool_empty_acquire{db_name="kv"} 0
pgxpool_idle_conns{db_name="kv"} 200
pgxpool_max_conns{db_name="kv"} 200
pgxpool_total_conns{db_name="kv"} 200
because it did not saying anything interesting in the log, we decided to bump loglevel to Trace and .... we cannot reproduce the issue x_x ....
i
Thank you @HT. Let’s see if this reoccurs.
h
commit stalling now. What I did was that I needed to reboot my PC (independantly to lakefs). So I Ctrl+C rclone running 64 threads. Reboot the PC. Then start uploading again. And then as soon as the rclone finish, the commit happen and stall
we are digging in the mountain of logs ... and trying to filter out the log related to the upload file ....
i
Thank you. Please share if you get any new insight. We will continue investigating on our side as well in the next couple of days. If you wish, send the logs for us to review.
h
Timeline from the client/upload side, on the computer on prem:
Copy code
2023-06-01 14:17:59.926 : rclone start copying
2023-06-01 14:18:05.489 : lakectl commit 
2023-06-01 14:22:05.675 : timeout 504
2023-06-01 14:22:06.228 : check if new repo exists and lakectl repo create new repo
2023-06-01 14:22:06.844 : rclone copying new files
Below is ALL logs on lakefs server between 021804 and 022207
What you can see is that from lakefs logs: there are no error, and there is a no log gap of 4 min, which match the 4min that the lakectl commit took to return (with 504 error)
(note: you may want to re-order by time in the csv file. I just found that Azure provide the csv with some time order mismatch !!)
I have here another interesting log. The context: • From about 003000UTC, commit command start to stall and timeout from the lakectl side. • ... • 51722 : commit stall and timeout • 52124: commit take about 50s • 52216: commit take 1s • Subsequence commit take 1s .... We did not touch anything on either infrastructure side nor upload/lakectl side I extracted logs between 051722 and 052217.78 All logs were extracted, without filtering. Then I label the log the best that I can. It would great if someone can have a look to the csv attached. The cycling of command from the client side is as follow: • lakectl list branch : that is to check if the repo already exists or no • lakectl create repo : create a new repo • rclone : copy file to the main branch • lakectl diff : check if there are any file to commit • lakectl commit. • New cycle start: lakectl list branch, lakectl create repo, ....
and now it stall again 🫣
g
Hi @HT, looked into the log, it looks like your running multiple parallel uploads and commits at the same time, this is not a suggested way to use lakeFS, you may have multiple writers, but you should have a single committer per branch lakeFS supports multiple committers (per branch), stressing it could result with commits that take a long time, which is what happens in your case.
h
no, it single thread. Only rclone do copy in parallel
here is the upload code:
Copy code
#!/bin/bash

# lakefsServer="<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io>"
lakefsStorageBackEnd="<http://tidevstorage0312lakefs.blob.core.windows.net/hieu-test|tidevstorage0312lakefs.blob.core.windows.net/hieu-test>"
srcDir=/data/hieu/deleteme/stressTest/clones


nThread=32

log=$(realpath $BASH_SOURCE).log
if [ -n "$2" ]; then
	log=$(realpath $2)
    
fi

function echoLog {
    echo "$(date +'%Y-%m-%d %T.%3N') | $@" | tee -a $log
}

function run {
    echoLog "> $@"
    eval "$@"
    res=$?
    echo 

    return $res
}

function createRepo() {
    name=$1
    lakectl branch list lakefs://$name &> /dev/null

    if [ $? -eq 0 ]; then
        return 0
    fi 

    run lakectl repo create lakefs://$name https://$lakefsStorageBackEnd/$name
    return $?
}

function upload() {
    srcPath=$1; shift
    repo=$1; shift 

    run rclone copy $srcPath sandbox:$repo/main/ --transfers=$nThread -P --checksum
    return $?
}

function commit() {
    repo=$1;
    toCommit=$(lakectl diff lakefs://$repo/main | sed '1d' | wc -l)
    if [ $toCommit -eq 0 ]; then
        echoLog "Nothing to commit. Continuing .."
        return 0
    fi
    run "lakectl commit lakefs://$repo/main --message \"Initial commit\""
    return $?
}

function main() {
    opt=$1
    pushd $srcDir
    repos=$(find . -maxdepth 1 -type d -name "*_clone*"  | sed 's|./||g' | sort $opt)
    for repoOrg in $repos
    do
        repo=$(echo $repoOrg | tr '[:upper:]' '[:lower:]' | tr '_' '-')
        done=".$repo.commit"
        if [ -f $done ]; then
            echoLog "Skipping $repoOrg."
            continue
        fi
        
        echoLog "Processing $repo"
        
        createRepo $repo && \
        upload $repoOrg $repo && \
        commit $repo && \
        run touch $done
                 
    done
    
}

main $1 2>> $log
g
I understand
h
We checked CPU usage on lakefs server:very iddle. We check the Database: looks like it is busy when it need to handle rclone copy. My question is what is happening when lakectl commit stall ? Neither lakefs server, nor the database seems to do anything during that time ...
g
Got confused because the log isn’t sorted by date for some reason
h
I though about network issue: but rclone can still pump data over ....
g
looking into it now
h
is there any chance the issue is on
lakectl commit
itself ?
as it is not finishing some handshake with the server ?
there are currently about 10M files and 2000 repo on our lakefs server. Not sure if this is relevant ...
g
From looking at the log it seems that the commit runs and at some point it just gets stuck
once it gets stuck there are no logs for 4 minutes (thats when we get the timeout)
I will try to reproduce it on my side
Creating an environment running lakeFS on Azure appcontainer connected to Azure postgres
Will update you in a few hours
h
Thanks. It tooks us sometime before it happen.
using rclone with 64 thread instead of 32 seems to make it happen faster
g
Thanks!
How many commits did it take it to happen? (roughly)
h
I would say under 100
🙏 1
couple 10s may be ?
Knowing that my commit have couple 10k of small files
we are using the container image v.100
g
Hi @HT, I wasn’t able to reproduce yet, I will give it another shot tomorrow and update you
👍 1
h
Even with 2 replicas we still have this stalling happening after 3h30 ....
g
Hi @HT, Sorry but I didn’t test it yet, I will test it and update you on Sunday
h
Thank you for all your effort !!
Here are some spec of our Container App:
Copy code
"configuration": {
            "secrets": null,
            "activeRevisionsMode": "Single",
            "ingress": {
                "fqdn": "<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io>",
                "external": true,
                "targetPort": 8000,
                "exposedPort": 0,
                "transport": "Auto",
                "traffic": [
                    {
                        "weight": 100,
                        "latestRevision": true
                    }
                ],
                "customDomains": null,
                "allowInsecure": false,
                "ipSecurityRestrictions": null,
                "corsPolicy": null,
                "clientCertificateMode": null,
                "stickySessions": null
            },
            "registries": null,
            "dapr": null,
            "maxInactiveRevisions": null,
            "service": null
        },
        "template": {
            "revisionSuffix": "",
            "containers": [
                {
                    "image": "treeverse/lakefs:0.100.0",
                    "name": "lakefs-server",
                    "env": [
                        {
                            "name": "LAKEFS_DATABASE_TYPE",
                            "value": "postgres"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING",
                            "value": "postgres://************@ti-dev-postgres-03-12-lakefs.postgres.database.azure.com:5432/lakefs?sslmode=require"
                        },
                        {
                            "name": "LAKEFS_AUTH_ENCRYPT_SECRET_KEY",
                            "value": "************************"
                        },
                        {
                            "name": "LAKEFS_BLOCKSTORE_TYPE",
                            "value": "azure"
                        },
                        {
                            "name": "LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT",
                            "value": "<https://tidevstorage0312lakefs.blob.core.windows.net/>"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS",
                            "value": "200"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS",
                            "value": "200"
                        },
                        {
                            "name": "LAKEFS_LOGGING_LEVEL",
                            "value": "TRACE"
                        }
                    ],
                    "resources": {
                        "cpu": 1,
                        "memory": "2Gi",
                        "ephemeralStorage": "4Gi"
                    },
                    "probes": []
                }
            ],
            "initContainers": null,
            "scale": {
                "minReplicas": 1,
                "maxReplicas": 4,
                "rules": [
                    {
                        "name": "cpuscalerule",
                        "custom": {
                            "type": "cpu",
                            "metadata": {
                                "type": "Utilization",
                                "value": "60"
                            }
                        }
                    }
                ]
            },
            "volumes": null,
            "serviceBinds": null
        },
Our database spec:
Copy code
{
  "sku": {
    "name": "Standard_D2s_v3",
    "tier": "GeneralPurpose"
  },
  "systemData": {
    "createdAt": "2023-05-19T03:58:00.7694907Z"
  },
  "properties": {
    "authConfig": {
      "activeDirectoryAuth": "Disabled",
      "passwordAuth": "Enabled"
    },
    "dataEncryption": {
      "type": "SystemManaged"
    },
    "fullyQualifiedDomainName": "<http://ti-dev-postgres-03-12-lakefs.postgres.database.azure.com|ti-dev-postgres-03-12-lakefs.postgres.database.azure.com>",
    "version": "14",
    "minorVersion": "7",
    "administratorLogin": "******",
    "state": "Ready",
    "availabilityZone": "1",
    "storage": {
      "storageSizeGB": 256
    },
    "backup": {
      "backupRetentionDays": 7,
      "geoRedundantBackup": "Disabled",
      "earliestRestoreDate": "2023-05-27T04:06:03.1274861+00:00"
    },
    "network": {
      "publicNetworkAccess": "Enabled"
    },
    "highAvailability": {
      "mode": "Disabled",
      "state": "NotEnabled"
    },
    "maintenanceWindow": {
      "customWindow": "Disabled",
      "dayOfWeek": 0,
      "startHour": 0,
      "startMinute": 0
    },
    "replicationRole": "Primary",
    "replicaCapacity": 5
  },
  "location": "Australia East",
  "id": "********",
  "name": "ti-dev-postgres-03-12-lakefs",
  "type": "Microsoft.DBforPostgreSQL/flexibleServers"
}
Also, in NZ, coming Monday is public holidays for us 😉
so no rush 😄
🙏 1
i
Enjoy the long weekend. Thanks for that context!
👍 1
h
I tried to reproduce this issue using a lakefs server inside the demo container, on one of our on-prem machine: cannot reproduce the issue. Must be related to networking somewhere in Azure ...