Hi all i have this simple file stat call ```file info = the lakeFS #help

Hi all, i have this simple file stat call ```file_...

Yaphet Kebede

04/23/2024, 4:53 PM

Hi all, i have this simple file stat call

Copy code

file_info = the_lake._client.objects_api.stat_object(repository=repository, ref=branch, path=location)

where the the_lake._client is the lakefs python client, and i get a constant error

Copy code

MaxRetryError: HTTPConnectionPool(host='localhost', port=8989): Max retries exceeded with url: /api/v1/repositories/sem-open-alex-kg/refs/main/objects/stat?path=works/rks_updated_date%3D2023-09-24_part_029.trig (Caused by ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)))

my lakefs server is running in k8s , i also increased the committed cache storage space, ... but the server fails silently

Oz Katz

04/23/2024, 5:52 PM

From the error message, seems like the client is attempting to connect to a lakeFS server on

localhost:8989

- is the client also running on K8s? communicating between containers almost never goes through localhost...

Yaphet Kebede

04/23/2024, 5:53 PM

yeah i have port-forwarded to the pod

Yaphet Kebede

04/23/2024, 5:53 PM

i can make requests to other files and get expected response on this setup, just the one fails

Yaphet Kebede

04/23/2024, 5:54 PM

i am running the server with trace level logs now to see if there's anything useful logged

Yaphet Kebede

04/23/2024, 5:58 PM

i was able to locate the file

Copy code

/data/sem-open-alex-kg/data $ ls -al ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g
-rw-r--r--    1 lakefs   lakefs   2653159424 Mar  8 22:17 ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g

and the size in bytes seems to be same as the bytes my client was expecting

Copy code

WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection broken: IncompleteRead(2653159424 bytes read, 178329620 more expected)', IncompleteRead(2653159424 bytes read, 178329620 more expected))': /api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig

Yaphet Kebede

04/23/2024, 5:58 PM

i don't know why the client expects there's more data comming ?

Yaphet Kebede

04/23/2024, 6:03 PM

here's one of the trace logs from Lakefs on one of the failed requests aswell

Copy code

time="2024-04-23T18:00:19Z" level=debug msg="performing API action" func="pkg/api.(*Controller).LogAction" file="build/pkg/api/controller.go:5171" class=api_server client=lakefs-python-sdk/1.8.0 host="localhost:8989" method=GET name=get_object operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" ref=main repository=sem-open-alex-kg request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service=api_gateway service_name=rest_api source_ip="127.0.0.1:59904" source_ref="" user=admin user_id=admin
time="2024-04-23T18:00:19Z" level=trace msg="dispatched execute result" func="pkg/batch.(*Executor).Run.func2" file="build/pkg/batch/executor.go:148" key="GetRepository:sem-open-alex-kg" waiters=1
time="2024-04-23T18:00:19Z" level=trace msg="dispatched execute result" func="pkg/batch.(*Executor).Run.func2" file="build/pkg/batch/executor.go:148" key="GetBranch:sem-open-alex-kg:main" waiters=1
time="2024-04-23T18:00:19Z" level=trace msg="opened locally" func="pkg/pyramid.(*TierFS).Open" file="build/pkg/pyramid/tier_fs.go:234" filename=8051f568642a632184732127b01ee9c0534245049de8b1f9ed27d86d861ad00e host="localhost:8989" method=GET module=pyramid namespace="<local://sem-open-alex-kg>" ns_path=sem-open-alex-kg/ operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service_name=rest_api user=admin
time="2024-04-23T18:00:19Z" level=trace msg="opened locally" func="pkg/pyramid.(*TierFS).Open" file="build/pkg/pyramid/tier_fs.go:234" filename=cc0ac558d6f10a54aa8275ac01f0be5293754a0929a49785305fb1eaf5efedd3 host="localhost:8989" method=GET module=pyramid namespace="<local://sem-open-alex-kg>" ns_path=sem-open-alex-kg/ operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service_name=rest_api user=admin
time="2024-04-23T18:00:19Z" level=trace msg="get repo sem-open-alex-kg ref main path works/rks_updated_date=2023-09-24_part_029.trig: &{CommonLevel:false Path:works/rks_updated_date=2023-09-24_part_029.trig PhysicalAddress:data/ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g CreationDate:2024-03-08 22:17:50.787824751 +0000 UTC Size:2831489044 Checksum:6c0ede77c9bba3f2565dbc22f4ace652 Metadata:map[] Expired:false AddressType:1 ContentType:application/octet-stream}" func="pkg/logging.(*logrusEntryWrapper).Tracef" file="build/pkg/logging/logger.go:304" service=api_gateway
time="2024-04-23T18:00:48Z" level=trace msg="[DEBUG] POST <https://stats.lakefs.io/events>" func="pkg/logging.(*logrusEntryWrapper).Tracef" file="build/pkg/logging/logger.go:304" service=stats_collector

Yaphet Kebede

04/23/2024, 6:04 PM

it opens two files (?) that seems weird (?)...

Yaphet Kebede

04/23/2024, 6:16 PM

i just checked for another file that works , seems like the two opened locally messages are there for this one too...

Oz Katz

04/23/2024, 6:17 PM

Pyramid would typically cache metaranges and ranges on local storage, so it's probably not related to the actual file being read. The trace you sent is for a GetObject operation, not a stat object. This is a big file too - is there a chance a proxy along the way is cutting the request midway due to size?

Yaphet Kebede

04/23/2024, 6:18 PM

i don't think so (?) , the port-forward afaik should just stream (?)

Yaphet Kebede

04/23/2024, 6:20 PM

you are correct Sir, i did send logs for GET, and i did the stat call and it works proper except the result is weird

Copy code

2831489044

is that supposed to be equal to the actual file size?

🤘 1

Yaphet Kebede

04/23/2024, 6:30 PM

and the diff between the size reported by lakefs server and the actual disk size is exactly the same size that the client is reporting missing

Oz Katz

04/23/2024, 7:06 PM

how was this file written to lakefs?

Yaphet Kebede

04/23/2024, 7:41 PM

i think it was via the s3 interface

Oz Katz

04/23/2024, 8:08 PM

i wonder if the file on disk is intact but lakefs metadata is wrong, or the file somehow got cut midway somehow and the metadata is correct

Yaphet Kebede

04/23/2024, 8:12 PM

is there a way to fix it ? should i try uploading the file again?

Yaphet Kebede

04/23/2024, 8:12 PM

also what would be a good way to avoid it ?

Oz Katz

04/23/2024, 8:14 PM

I’ve never seen this happen. in fact, it should never happen. can you tell what the file size should be? let’s figure out if the write failed or the metadata is somehow wrong

Yaphet Kebede

04/23/2024, 8:31 PM

ah ok i have to contact the group which uploaded it the first place, and get the actual size. It might be a minute before i get answers back , but i will be back on this thread 🙂

👍 1

Yaphet Kebede

04/24/2024, 4:53 PM

hi @Oz Katz , i have the actual file size its about 3.84Gb which is amost a gb more than what in the metadata store ...

Oz Katz

04/24/2024, 8:19 PM

Thanks - can you please share the following:

Oz Katz

04/24/2024, 8:20 PM

1. what underlying storage are you using? local directory? s3 bucket? something else? 2. was the file uploaded using the S3 Gateway (i.e. boto3 or something similar), or is it imported?

Yaphet Kebede

04/26/2024, 3:22 PM

Hi @Oz Katz the underlying storage is a local directory (an nfs mount to be more specific) and the file was uploaded using aws cli

Copy code

aws s3 cp /local/works <s3://sem-open-alex-kg/main/works> --endpoint="https://...." --profile lakefs --recursive

and the logs from aws process look like the attached image. But one thing is on the logs there aws cli reports 1.7Tib (which i think is about 1.8Tb ?) of data ? and the local volume on the server , is about 1.5Tb in size , not sure if that would be relevant (as i would assume it would just error out if no disk space)..

Open in Slack

Previous Next