James Daus
10/05/2023, 11:34 PMcom.databricks.sql.ui.FileReadException: Error while reading file lakefs://<path>.c000.snappy.parquet. java.io.EOFException
I looked through the logs, and there are many messages like this mentioning a "Not Found" error:
time="2023-10-05T23:18:55Z" level=debug msg="Not found" func="<http://github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback|github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback>" file="/lakeFS/pkg/api/controller.go:2068" error="not found" host="<ourdomain>:8000" method=GET operation_id=StatObject path="/api/v1/repositories/test1/refs/E2E_20231005110009958/objects/stat?path=providerfeeds%2Ftransportation&user_metadata=false&presign=false" request_id=14e66e44-a26a-4b75-9d17-c4040636b062 service=api_gateway service_name=rest_api user=admin
Itai Admi
10/06/2023, 3:00 PMStatObject
to check if they exist. So during normal operation of Spark `404`s are expected.
The error in the attached image looks like the problem. Is it a delta table? Is it imported? Do you know if the imported data was changed outside of lakeFS (lakeFS keeps pointers to the imported data, if it was moved/changed/deleted in the origin place, lakeFS will not be able to manage that)?
Generally every piece of information on the data would help us to reproduce it and fix:
1. Is the storage namespace of the repo gen v1/gen v2?
2. Same question for the imported data.
3. Format that is being read and written.James Daus
10/06/2023, 4:24 PMdf.write.format("geoparquet").mode("overwrite").partitionBy("theme","type").save(output_path)
b. Note: output_path
is the folder we will now import from
4. The import looks like so:
a. importresponse = client.import_api.import_start( repository=repository, branch=branch, import_creation = ImportCreation(paths=[ImportLocation(type="common_prefix",#common_prefix path=softlink_import_path, destination=dest,),],commit=CommitCreation(message="Uploading new data version for {} into lakefs".format(feedname), metadata={'using': 'import_api.import_start'})))
This error is occurring when running this line: df = sedona.read.parquet("<lakefs://test1/E2E_20231006043002742/providerfeeds/admins/>")
The read works fine on the original data location. The failure is consistently a _committed
file, while I can read fine via postman and is there in the UI.Itai Admi
10/06/2023, 8:17 PMoutput_path
in 3b.
2. Is _committed
file of size 0? I suspect it’s a different occurrence 6718, although you seem to be reading from lakeFS API, and not the lakeFS s3 gateway.
3. As a possible workaround (and valuable debug info), can you try retrying the above using the lakeFS s3 gateway? It would mean replacing the failing line with
df = sedona.read.parquet("<s3a://test1/E2E_20231006043002742/providerfeeds/admins/>")
a. It would mean you would have to configure Spark to work with the s3a scheme by setting the endpoint, access and secret keys. Configuration settings described hereJames Daus
10/06/2023, 9:00 PM{"added":["<name>.snappy.parquet","<name2>.snappy.parquet"],"removed":[]}
3. Will do this and let you know shortly, thanks
Also, we managed to change spark settings to get rid of the _committed file entirely, I'll let you know if this fixes issues.Itai Admi
10/06/2023, 9:22 PMJames Daus
10/06/2023, 10:08 PMorg.apache.spark.SparkException: Exception thrown in awaitResult: null
Itai Admi
10/07/2023, 7:30 PMs3a
scheme with lakeFS through its s3 gateway. The S3 Gateway is the layer in lakeFS responsible for the compatibility with S3. It implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3. You can set it like this:
• `fs.s3a.access.key`: lakeFS access key
• `fs.s3a.secret.key`: lakeFS secret key
• `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io)
• `fs.s3a.path.style.access`: true
Then, every data access using the s3a
scheme, like <s3a://repo/branch/some/path>
, would reach your lakeFS instance S3 gateway.