Hi all, i have this simple file stat call ```file_...
# help
y
Hi all, i have this simple file stat call
Copy code
file_info = the_lake._client.objects_api.stat_object(repository=repository, ref=branch, path=location)
where the the_lake._client is the lakefs python client, and i get a constant error
Copy code
MaxRetryError: HTTPConnectionPool(host='localhost', port=8989): Max retries exceeded with url: /api/v1/repositories/sem-open-alex-kg/refs/main/objects/stat?path=works/rks_updated_date%3D2023-09-24_part_029.trig (Caused by ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)))
my lakefs server is running in k8s , i also increased the committed cache storage space, ... but the server fails silently
o
From the error message, seems like the client is attempting to connect to a lakeFS server on
localhost:8989
- is the client also running on K8s? communicating between containers almost never goes through localhost...
y
yeah i have port-forwarded to the pod
i can make requests to other files and get expected response on this setup, just the one fails
i am running the server with trace level logs now to see if there's anything useful logged
i was able to locate the file
Copy code
/data/sem-open-alex-kg/data $ ls -al ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g
-rw-r--r--    1 lakefs   lakefs   2653159424 Mar  8 22:17 ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g
and the size in bytes seems to be same as the bytes my client was expecting
Copy code
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection broken: IncompleteRead(2653159424 bytes read, 178329620 more expected)', IncompleteRead(2653159424 bytes read, 178329620 more expected))': /api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig
i don't know why the client expects there's more data comming ?
here's one of the trace logs from Lakefs on one of the failed requests aswell
Copy code
time="2024-04-23T18:00:19Z" level=debug msg="performing API action" func="pkg/api.(*Controller).LogAction" file="build/pkg/api/controller.go:5171" class=api_server client=lakefs-python-sdk/1.8.0 host="localhost:8989" method=GET name=get_object operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" ref=main repository=sem-open-alex-kg request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service=api_gateway service_name=rest_api source_ip="127.0.0.1:59904" source_ref="" user=admin user_id=admin
time="2024-04-23T18:00:19Z" level=trace msg="dispatched execute result" func="pkg/batch.(*Executor).Run.func2" file="build/pkg/batch/executor.go:148" key="GetRepository:sem-open-alex-kg" waiters=1
time="2024-04-23T18:00:19Z" level=trace msg="dispatched execute result" func="pkg/batch.(*Executor).Run.func2" file="build/pkg/batch/executor.go:148" key="GetBranch:sem-open-alex-kg:main" waiters=1
time="2024-04-23T18:00:19Z" level=trace msg="opened locally" func="pkg/pyramid.(*TierFS).Open" file="build/pkg/pyramid/tier_fs.go:234" filename=8051f568642a632184732127b01ee9c0534245049de8b1f9ed27d86d861ad00e host="localhost:8989" method=GET module=pyramid namespace="<local://sem-open-alex-kg>" ns_path=sem-open-alex-kg/ operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service_name=rest_api user=admin
time="2024-04-23T18:00:19Z" level=trace msg="opened locally" func="pkg/pyramid.(*TierFS).Open" file="build/pkg/pyramid/tier_fs.go:234" filename=cc0ac558d6f10a54aa8275ac01f0be5293754a0929a49785305fb1eaf5efedd3 host="localhost:8989" method=GET module=pyramid namespace="<local://sem-open-alex-kg>" ns_path=sem-open-alex-kg/ operation_id=GetObject path="/api/v1/repositories/sem-open-alex-kg/refs/main/objects?path=works/rks_updated_date%3D2023-09-24_part_029.trig" request_id=52be71d8-bbea-43e5-ba69-6c6367211986 service_name=rest_api user=admin
time="2024-04-23T18:00:19Z" level=trace msg="get repo sem-open-alex-kg ref main path works/rks_updated_date=2023-09-24_part_029.trig: &{CommonLevel:false Path:works/rks_updated_date=2023-09-24_part_029.trig PhysicalAddress:data/ghhngfq95o94e9k8vht0/cnlor0i95o94e9k8vi6g CreationDate:2024-03-08 22:17:50.787824751 +0000 UTC Size:2831489044 Checksum:6c0ede77c9bba3f2565dbc22f4ace652 Metadata:map[] Expired:false AddressType:1 ContentType:application/octet-stream}" func="pkg/logging.(*logrusEntryWrapper).Tracef" file="build/pkg/logging/logger.go:304" service=api_gateway
time="2024-04-23T18:00:48Z" level=trace msg="[DEBUG] POST <https://stats.lakefs.io/events>" func="pkg/logging.(*logrusEntryWrapper).Tracef" file="build/pkg/logging/logger.go:304" service=stats_collector
it opens two files (?) that seems weird (?)...
i just checked for another file that works , seems like the two opened locally messages are there for this one too...
o
Pyramid would typically cache metaranges and ranges on local storage, so it's probably not related to the actual file being read. The trace you sent is for a GetObject operation, not a stat object. This is a big file too - is there a chance a proxy along the way is cutting the request midway due to size?
y
i don't think so (?) , the port-forward afaik should just stream (?)
you are correct Sir, i did send logs for GET, and i did the stat call and it works proper except the result is weird
Copy code
2831489044
is that supposed to be equal to the actual file size?
🤘 1
and the diff between the size reported by lakefs server and the actual disk size is exactly the same size that the client is reporting missing
o
how was this file written to lakefs?
y
i think it was via the s3 interface
o
i wonder if the file on disk is intact but lakefs metadata is wrong, or the file somehow got cut midway somehow and the metadata is correct
y
is there a way to fix it ? should i try uploading the file again?
also what would be a good way to avoid it ?
o
I’ve never seen this happen. in fact, it should never happen. can you tell what the file size should be? let’s figure out if the write failed or the metadata is somehow wrong
y
ah ok i have to contact the group which uploaded it the first place, and get the actual size. It might be a minute before i get answers back , but i will be back on this thread 🙂
👍 1
hi @Oz Katz , i have the actual file size its about 3.84Gb which is amost a gb more than what in the metadata store ...
o
Thanks - can you please share the following:
1. what underlying storage are you using? local directory? s3 bucket? something else? 2. was the file uploaded using the S3 Gateway (i.e. boto3 or something similar), or is it imported?
y
Hi @Oz Katz the underlying storage is a local directory (an nfs mount to be more specific) and the file was uploaded using aws cli
Copy code
aws s3 cp /local/works <s3://sem-open-alex-kg/main/works> --endpoint="https://...." --profile lakefs --recursive
and the logs from aws process look like the attached image. But one thing is on the logs there aws cli reports 1.7Tib (which i think is about 1.8Tb ?) of data ? and the local volume on the server , is about 1.5Tb in size , not sure if that would be relevant (as i would assume it would just error out if no disk space)..