I have a weird bug with our LakeFS hosted in Azure ```$ lake lakeFS #help

I have a weird bug with our LakeFS hosted in Azure...

05/31/2023, 3:20 AM

I have a weird bug with our LakeFS hosted in Azure:

Copy code

$ lakectl --verbose commit <lakefs://hieu-test/main> --message "add file"
Branch: <lakefs://hieu-test/main>

It takes minutes for the command return, while the web interface say that the commit been done.

05/31/2023, 3:20 AM

the

--verbose

flag is not doing anything. How do I get more verbose in order to diagnose the issue ??

Guy Hardonag

05/31/2023, 3:32 AM

Hi @HT, Just let me make sure I’m getting it right, Your committing via lakectl when going to the UI you see the commit immediately But the lakectl command returns after minutes (with success) Is that correct?

05/31/2023, 3:46 AM

yes

05/31/2023, 3:47 AM

I got some log from our server :

Copy code

time="2023-05-31T03:35:59Z" level=error msg="Post-commit hook failed" func="pkg/graveler.(*Graveler).Commit" file="build/pkg/graveler/graveler.go:1916" error="updating commit ID: run id 5iutn73meojc73d2gmfg: postgres get: context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> method=POST operation_id=Commit path=/api/v1/repositories/[REDACTED]/branches/main/commits pre_run_id=5iutn73meojc73d2gmfg request_id=16232042-ada5-4d62-a499-1ff37e943d9c run_id=5iutmobmeojc73d2gmg0 service_name=rest_api user=hieut

05/31/2023, 3:47 AM

feels like connection between the lakfs VM and the postgress databse id degraded somehow ...

Guy Hardonag

05/31/2023, 3:47 AM

OK,I was just about to ask for logs, Thanks 😃

05/31/2023, 3:48 AM

and now things are back to normal !

05/31/2023, 3:48 AM

commit return right away

05/31/2023, 3:48 AM

may be a blip in the Azure space ?!?!

05/31/2023, 3:51 AM

among others:

Copy code

time="2023-05-31T01:22:14Z" level=error msg="could not update metadata" func="pkg/gateway/operations.(*PathOperation).finishUpload" file="build/pkg/gateway/operations/operation_utils.go:51" error="postgres set: timeout: context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=[REDACTED] ref=main repository=XXXX request_id=2332a8c1-7854-49dc-9ee6-60d03c5e64c0 service_name=s3_gateway user=hieut

05/31/2023, 3:51 AM

Copy code

time="2023-05-31T01:22:14Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXXX ref=main repository=XXXX request_id=00aae4e9-acc8-42ac-852d-c7055c5ca4df service_name=s3_gateway user=hieut

Guy Hardonag

05/31/2023, 3:53 AM

Are these connected? they are from 2 hours earlier IIUC

05/31/2023, 3:54 AM

I am stress testing our server so I am doing a lot of commit and repo creation

Guy Hardonag

05/31/2023, 3:54 AM

those are random errors that I grab

05/31/2023, 3:55 AM

there are heaps of those

Guy Hardonag

05/31/2023, 3:55 AM

It’s verry helpful

05/31/2023, 3:56 AM

and now that things are back to normal, no more error poping up in the logs

05/31/2023, 3:56 AM

we did not touch anything from the infrastructure side

Guy Hardonag

05/31/2023, 3:56 AM

Can you please look at the metrics/logs in postgres?

05/31/2023, 3:56 AM

and the only thing I did was to pause the test for like 10min

Guy Hardonag

05/31/2023, 3:56 AM

It looks like you are loosing connection with postgres once in a while

05/31/2023, 3:58 AM

our Container App :

05/31/2023, 3:59 AM

looking at the databse now ...

05/31/2023, 4:00 AM

image.png

05/31/2023, 4:03 AM

12h30 ish I restart the transfer. 12h28 a bunch of error 13h22 another bunch The double CPU usage : I did restart transfer with 64 thread in rclone 14h230 that I start to observe the commit not returning

05/31/2023, 4:04 AM

It looks like you are loosing connection with postgres once in a while

It does look like that

Guy Hardonag

05/31/2023, 4:07 AM

By the way, the reason that you saw the commit in the UI right away is that the commit succeeded, but the post-commit hook failed (to write to postgres).

👍 1

05/31/2023, 4:08 AM

does it have longterm consequence ?

05/31/2023, 4:08 AM

or inconsistency ?

Guy Hardonag

05/31/2023, 4:16 AM

Well, If your failing to write to the database constantly, many bad things can happen, it’s not an ideal way to run, I suggest looking further into it and finding the root cause.

Guy Hardonag

05/31/2023, 4:17 AM

I would start with changing the log to DEBUG mode, that would show more information such as time it takes for each operation

05/31/2023, 4:19 AM

Looking at yeaterday log, I have:

Copy code

time="2023-05-30T09:02:41Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXX ref=main repository=XXXX request_id=81b5e6bc-afc8-402b-a108-474521c8fd1c service_name=s3_gateway user=hieut

Is this databse related or Blob storage ?

05/31/2023, 4:20 AM

ok. We need to keep an eye on this as this looks like it's not an one off !

Guy Hardonag

05/31/2023, 4:32 AM

This one is Blob storage, but it can happen if you cancelled a request during upload

🙏 1

Niro

05/31/2023, 5:35 AM

Out of the top of my head, since you are stress testing, I would check the number of connections to the DB and if it reached the connections pool number. It could be that we're waiting for a connection to free up

👍 1

05/31/2023, 5:39 AM

Did not though about that ! Thanks!

05/31/2023, 8:54 AM

@Niro Can you explain a bit more about number of connection ? We are using PostGresSQL Flexible. The metric Number of connection pools is empty

05/31/2023, 8:55 AM

(currently having the commit lag ... No error pop up yet on lakefs server ...)

05/31/2023, 8:56 AM

we get

504 Gateway Timeout

about 4min after the commit command ...

Niro

05/31/2023, 8:58 AM

@HT Hi, the timeout probably comes from the azure load balancer which has a hard limit of 4 minutes. Can you provide the lakeFS logs around this time?

05/31/2023, 9:01 AM

My upload/commit log show timeout start happening from around 6pm the log only show error at 20h47, which do match one of the Timeout error in the upload script

05/31/2023, 9:01 AM

errors in 20h47 are all :

Copy code

time="2023-05-31T08:47:01Z" level=error msg="could not write request body to block adapter" func=pkg/gateway/operations.handlePut file="build/pkg/gateway/operations/putobject.go:263" error="context canceled" host=<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io> matched_host=false method=PUT operation_id=put_object path=XXXX.xml ref=main repository=capture-20211018-141231879861-clone071 request_id=94047de4-f9c4-4e28-ab3f-cff76101768a service_name=s3_gateway user=hieut

Niro

05/31/2023, 9:02 AM

Is this the time when you got the

error?

05/31/2023, 9:04 AM

yes, got one at that time. But got a bunch more

05/31/2023, 9:04 AM

image.png

Niro

05/31/2023, 9:06 AM

Yes - so it makes sense, the context get canceled due to the time out. What I need to see are the lakefs logs around 202913 - the time you performed the commit itself. I understand you set up hooks?

05/31/2023, 9:06 AM

I understand you set up hooks?

I don't think so .... we just deploy the default setup.

05/31/2023, 9:07 AM

The weird thing is that I don't have any other logs, on lakefs server side, than the bunch at 8:47

Niro

05/31/2023, 9:09 AM

What about the timestamp I mentioned?

05/31/2023, 9:09 AM

my Azure logs query and results:

Niro

05/31/2023, 9:11 AM

I'd like to focus on a specific occurrence. Do you have logs for this specific timeframe?

Niro

05/31/2023, 9:12 AM

Also, another question: can you roughly tell me how many files are you committing in this scenario?

05/31/2023, 9:15 AM

For the 202913.747 commit: 70626 files (most of them are 1KB)

Niro

05/31/2023, 9:45 AM

@HT Thanks, Still suspect the number of connections was exhausted. Can you please perform metrics query to lakefs while running this scenario and look at the pgx metrics:

curl <lakefs_url>/metrics | grep ^pgxpool

05/31/2023, 9:53 AM

Copy code

pgxpool_acquire_count{db_name="kv"} 2.7428454e+07
pgxpool_acquire_duration_ns{db_name="kv"} 1.227152849835e+12
pgxpool_acquired_conns{db_name="kv"} 1
pgxpool_canceled_acquire_count{db_name="kv"} 30
pgxpool_constructing_conns{db_name="kv"} 0
pgxpool_empty_acquire{db_name="kv"} 158220
pgxpool_idle_conns{db_name="kv"} 24
pgxpool_max_conns{db_name="kv"} 25
pgxpool_total_conns{db_name="kv"} 25

05/31/2023, 9:54 AM

it start happening since I change rclone from 32 thread to 64 ...

05/31/2023, 9:54 AM

I will switch back to 32 thread overnight and deal with the issue tomorrow ...

05/31/2023, 9:50 PM

Follow up : now we have • loglevel DEBUG • rclone at 64 thread

Copy code

LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS
200
LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS
200
LAKEFS_LOGGING_LEVEL
DEBUG

• commit is stalling .... • /metrics:

Copy code

pgxpool_acquire_count{db_name="kv"} 1771
pgxpool_acquire_duration_ns{db_name="kv"} 453377
pgxpool_acquired_conns{db_name="kv"} 0
pgxpool_canceled_acquire_count{db_name="kv"} 0
pgxpool_constructing_conns{db_name="kv"} 0
pgxpool_empty_acquire{db_name="kv"} 0
pgxpool_idle_conns{db_name="kv"} 200
pgxpool_max_conns{db_name="kv"} 200
pgxpool_total_conns{db_name="kv"} 200

05/31/2023, 10:16 PM

because it did not saying anything interesting in the log, we decided to bump loglevel to Trace and .... we cannot reproduce the issue x_x ....

Iddo Avneri

05/31/2023, 10:30 PM

Thank you @HT. Let’s see if this reoccurs.

06/01/2023, 12:52 AM

commit stalling now. What I did was that I needed to reboot my PC (independantly to lakefs). So I Ctrl+C rclone running 64 threads. Reboot the PC. Then start uploading again. And then as soon as the rclone finish, the commit happen and stall

06/01/2023, 12:53 AM

we are digging in the mountain of logs ... and trying to filter out the log related to the upload file ....

Iddo Avneri

06/01/2023, 1:20 AM

Thank you. Please share if you get any new insight. We will continue investigating on our side as well in the next couple of days. If you wish, send the logs for us to review.

06/01/2023, 2:41 AM

Timeline from the client/upload side, on the computer on prem:

Copy code

2023-06-01 14:17:59.926 : rclone start copying
2023-06-01 14:18:05.489 : lakectl commit 
2023-06-01 14:22:05.675 : timeout 504
2023-06-01 14:22:06.228 : check if new repo exists and lakectl repo create new repo
2023-06-01 14:22:06.844 : rclone copying new files

Below is ALL logs on lakefs server between 021804 and 022207

query_data(1).csv

06/01/2023, 2:42 AM

What you can see is that from lakefs logs: there are no error, and there is a no log gap of 4 min, which match the 4min that the lakectl commit took to return (with 504 error)

06/01/2023, 2:46 AM

(note: you may want to re-order by time in the csv file. I just found that Azure provide the csv with some time order mismatch !!)

06/01/2023, 9:24 AM

I have here another interesting log. The context: • From about 003000UTC, commit command start to stall and timeout from the lakectl side. • ... • 51722 : commit stall and timeout • 52124: commit take about 50s • 52216: commit take 1s • Subsequence commit take 1s .... We did not touch anything on either infrastructure side nor upload/lakectl side I extracted logs between 051722 and 052217.78 All logs were extracted, without filtering. Then I label the log the best that I can. It would great if someone can have a look to the csv attached. The cycling of command from the client side is as follow: • lakectl list branch : that is to check if the repo already exists or no • lakectl create repo : create a new repo • rclone : copy file to the main branch • lakectl diff : check if there are any file to commit • lakectl commit. • New cycle start: lakectl list branch, lakectl create repo, ....

query_data(2).csv

06/01/2023, 9:31 AM

and now it stall again 🫣

Guy Hardonag

06/01/2023, 9:31 AM

Hi @HT, looked into the log, it looks like your running multiple parallel uploads and commits at the same time, this is not a suggested way to use lakeFS, you may have multiple writers, but you should have a single committer per branch lakeFS supports multiple committers (per branch), stressing it could result with commits that take a long time, which is what happens in your case.

06/01/2023, 9:32 AM

no, it single thread. Only rclone do copy in parallel

06/01/2023, 9:33 AM

here is the upload code:

Copy code

#!/bin/bash

# lakefsServer="<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io>"
lakefsStorageBackEnd="<http://tidevstorage0312lakefs.blob.core.windows.net/hieu-test|tidevstorage0312lakefs.blob.core.windows.net/hieu-test>"
srcDir=/data/hieu/deleteme/stressTest/clones


nThread=32

log=$(realpath $BASH_SOURCE).log
if [ -n "$2" ]; then
	log=$(realpath $2)
    
fi

function echoLog {
    echo "$(date +'%Y-%m-%d %T.%3N') | $@" | tee -a $log
}

function run {
    echoLog "> $@"
    eval "$@"
    res=$?
    echo 

    return $res
}

function createRepo() {
    name=$1
    lakectl branch list lakefs://$name &> /dev/null

    if [ $? -eq 0 ]; then
        return 0
    fi 

    run lakectl repo create lakefs://$name https://$lakefsStorageBackEnd/$name
    return $?
}

function upload() {
    srcPath=$1; shift
    repo=$1; shift 

    run rclone copy $srcPath sandbox:$repo/main/ --transfers=$nThread -P --checksum
    return $?
}

function commit() {
    repo=$1;
    toCommit=$(lakectl diff lakefs://$repo/main | sed '1d' | wc -l)
    if [ $toCommit -eq 0 ]; then
        echoLog "Nothing to commit. Continuing .."
        return 0
    fi
    run "lakectl commit lakefs://$repo/main --message \"Initial commit\""
    return $?
}

function main() {
    opt=$1
    pushd $srcDir
    repos=$(find . -maxdepth 1 -type d -name "*_clone*"  | sed 's|./||g' | sort $opt)
    for repoOrg in $repos
    do
        repo=$(echo $repoOrg | tr '[:upper:]' '[:lower:]' | tr '_' '-')
        done=".$repo.commit"
        if [ -f $done ]; then
            echoLog "Skipping $repoOrg."
            continue
        fi
        
        echoLog "Processing $repo"
        
        createRepo $repo && \
        upload $repoOrg $repo && \
        commit $repo && \
        run touch $done
                 
    done
    
}

main $1 2>> $log

Guy Hardonag

06/01/2023, 10:19 AM

I understand

06/01/2023, 10:19 AM

We checked CPU usage on lakefs server:very iddle. We check the Database: looks like it is busy when it need to handle rclone copy. My question is what is happening when lakectl commit stall ? Neither lakefs server, nor the database seems to do anything during that time ...

Guy Hardonag

06/01/2023, 10:19 AM

Got confused because the log isn’t sorted by date for some reason

06/01/2023, 10:19 AM

I though about network issue: but rclone can still pump data over ....

Guy Hardonag

06/01/2023, 10:19 AM

looking into it now

06/01/2023, 10:20 AM

is there any chance the issue is on

lakectl commit

itself ?

06/01/2023, 10:20 AM

as it is not finishing some handshake with the server ?

06/01/2023, 10:50 AM

there are currently about 10M files and 2000 repo on our lakefs server. Not sure if this is relevant ...

Guy Hardonag

06/01/2023, 10:51 AM

From looking at the log it seems that the commit runs and at some point it just gets stuck

Guy Hardonag

06/01/2023, 10:51 AM

once it gets stuck there are no logs for 4 minutes (thats when we get the timeout)

Guy Hardonag

06/01/2023, 10:52 AM

I will try to reproduce it on my side

Guy Hardonag

06/01/2023, 10:53 AM

Creating an environment running lakeFS on Azure appcontainer connected to Azure postgres

Guy Hardonag

06/01/2023, 10:54 AM

Will update you in a few hours

06/01/2023, 10:55 AM

Thanks. It tooks us sometime before it happen.

06/01/2023, 10:55 AM

using rclone with 64 thread instead of 32 seems to make it happen faster

Guy Hardonag

06/01/2023, 10:55 AM

Thanks!

Guy Hardonag

06/01/2023, 10:56 AM

How many commits did it take it to happen? (roughly)

06/01/2023, 10:57 AM

I would say under 100

🙏 1

06/01/2023, 10:57 AM

couple 10s may be ?

06/01/2023, 10:57 AM

Knowing that my commit have couple 10k of small files

06/01/2023, 11:00 AM

we are using the container image v.100

Guy Hardonag

06/01/2023, 8:30 PM

Hi @HT, I wasn’t able to reproduce yet, I will give it another shot tomorrow and update you

👍 1

06/02/2023, 2:50 AM

Even with 2 replicas we still have this stalling happening after 3h30 ....

Guy Hardonag

06/02/2023, 9:14 AM

Hi @HT, Sorry but I didn’t test it yet, I will test it and update you on Sunday

06/02/2023, 9:47 AM

Thank you for all your effort !!

06/02/2023, 9:48 AM

Here are some spec of our Container App:

Copy code

"configuration": {
            "secrets": null,
            "activeRevisionsMode": "Single",
            "ingress": {
                "fqdn": "<http://ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io|ti-dev-aca-03-12-lakefs.proudplant-6acee3df.australiaeast.azurecontainerapps.io>",
                "external": true,
                "targetPort": 8000,
                "exposedPort": 0,
                "transport": "Auto",
                "traffic": [
                    {
                        "weight": 100,
                        "latestRevision": true
                    }
                ],
                "customDomains": null,
                "allowInsecure": false,
                "ipSecurityRestrictions": null,
                "corsPolicy": null,
                "clientCertificateMode": null,
                "stickySessions": null
            },
            "registries": null,
            "dapr": null,
            "maxInactiveRevisions": null,
            "service": null
        },
        "template": {
            "revisionSuffix": "",
            "containers": [
                {
                    "image": "treeverse/lakefs:0.100.0",
                    "name": "lakefs-server",
                    "env": [
                        {
                            "name": "LAKEFS_DATABASE_TYPE",
                            "value": "postgres"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING",
                            "value": "postgres://************@ti-dev-postgres-03-12-lakefs.postgres.database.azure.com:5432/lakefs?sslmode=require"
                        },
                        {
                            "name": "LAKEFS_AUTH_ENCRYPT_SECRET_KEY",
                            "value": "************************"
                        },
                        {
                            "name": "LAKEFS_BLOCKSTORE_TYPE",
                            "value": "azure"
                        },
                        {
                            "name": "LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT",
                            "value": "<https://tidevstorage0312lakefs.blob.core.windows.net/>"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_MAX_OPEN_CONNECTIONS",
                            "value": "200"
                        },
                        {
                            "name": "LAKEFS_DATABASE_POSTGRES_MAX_IDLE_CONNECTIONS",
                            "value": "200"
                        },
                        {
                            "name": "LAKEFS_LOGGING_LEVEL",
                            "value": "TRACE"
                        }
                    ],
                    "resources": {
                        "cpu": 1,
                        "memory": "2Gi",
                        "ephemeralStorage": "4Gi"
                    },
                    "probes": []
                }
            ],
            "initContainers": null,
            "scale": {
                "minReplicas": 1,
                "maxReplicas": 4,
                "rules": [
                    {
                        "name": "cpuscalerule",
                        "custom": {
                            "type": "cpu",
                            "metadata": {
                                "type": "Utilization",
                                "value": "60"
                            }
                        }
                    }
                ]
            },
            "volumes": null,
            "serviceBinds": null
        },

06/02/2023, 9:50 AM

Our database spec:

Copy code

{
  "sku": {
    "name": "Standard_D2s_v3",
    "tier": "GeneralPurpose"
  },
  "systemData": {
    "createdAt": "2023-05-19T03:58:00.7694907Z"
  },
  "properties": {
    "authConfig": {
      "activeDirectoryAuth": "Disabled",
      "passwordAuth": "Enabled"
    },
    "dataEncryption": {
      "type": "SystemManaged"
    },
    "fullyQualifiedDomainName": "<http://ti-dev-postgres-03-12-lakefs.postgres.database.azure.com|ti-dev-postgres-03-12-lakefs.postgres.database.azure.com>",
    "version": "14",
    "minorVersion": "7",
    "administratorLogin": "******",
    "state": "Ready",
    "availabilityZone": "1",
    "storage": {
      "storageSizeGB": 256
    },
    "backup": {
      "backupRetentionDays": 7,
      "geoRedundantBackup": "Disabled",
      "earliestRestoreDate": "2023-05-27T04:06:03.1274861+00:00"
    },
    "network": {
      "publicNetworkAccess": "Enabled"
    },
    "highAvailability": {
      "mode": "Disabled",
      "state": "NotEnabled"
    },
    "maintenanceWindow": {
      "customWindow": "Disabled",
      "dayOfWeek": 0,
      "startHour": 0,
      "startMinute": 0
    },
    "replicationRole": "Primary",
    "replicaCapacity": 5
  },
  "location": "Australia East",
  "id": "********",
  "name": "ti-dev-postgres-03-12-lakefs",
  "type": "Microsoft.DBforPostgreSQL/flexibleServers"
}

06/02/2023, 9:59 AM

Also, in NZ, coming Monday is public holidays for us 😉

06/02/2023, 9:59 AM

so no rush 😄

🙏 1

Iddo Avneri

06/02/2023, 10:00 AM

Enjoy the long weekend. Thanks for that context!

👍 1

06/05/2023, 9:42 AM

I tried to reproduce this issue using a lakefs server inside the demo container, on one of our on-prem machine: cannot reproduce the issue. Must be related to networking somewhere in Azure ...

38 Views

Open in Slack

Previous Next