Hello and thanks for the support! This is a new issue after my previous thread. We are using a fully Azure-based deployment, with CosmosDB for K/V, and after running large imports, we are running into the attached read error in Spark. We have also run into this error once as well on a rerun of the pipeline:
com.databricks.sql.ui.FileReadException: Error while reading file lakefs://<path>.c000.snappy.parquet. java.io.EOFException
I looked through the logs, and there are many messages like this mentioning a "Not Found" error:
time="2023-10-05T23:18:55Z" level=debug msg="Not found" func="<http://github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback|github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback>" file="/lakeFS/pkg/api/controller.go:2068" error="not found" host="<ourdomain>:8000" method=GET operation_id=StatObject path="/api/v1/repositories/test1/refs/E2E_20231005110009958/objects/stat?path=providerfeeds%2Ftransportation&user_metadata=false&presign=false" request_id=14e66e44-a26a-4b75-9d17-c4040636b062 service=api_gateway service_name=rest_api user=admin
Hey James, during Spark runs it schedules many executors which have some non trivial ways of communicating with one another and checking stuff. They keep creating and deleting markers and use
to check if they exist. So during normal operation of Spark `404`s are expected. The error in the attached image looks like the problem. Is it a delta table? Is it imported? Do you know if the imported data was changed outside of lakeFS (lakeFS keeps pointers to the imported data, if it was moved/changed/deleted in the origin place, lakeFS will not be able to manage that)? Generally every piece of information on the data would help us to reproduce it and fix: 1. Is the storage namespace of the repo gen v1/gen v2? 2. Same question for the imported data. 3. Format that is being read and written.
Hi Itai, thanks for your reply. This error is occurring when trying to read data that has been imported successfully.
1. The import is happening within the same container in gen 2 storage, from a different folder. The storage namespace used in the lakeFS repo looks like this: https://domain.blob.core.windows.net/test1/ 2. Currently this data is being imported from this same container. 3. The format of the data is geoparquet. It is first being copied into the Azure storage like so: a.
importresponse = client.import_api.import_start(             repository=repository, branch=branch, import_creation = ImportCreation(paths=[ImportLocation(type="common_prefix",#common_prefix path=softlink_import_path, destination=dest,),],commit=CommitCreation(message="Uploading new data version for {} into lakefs".format(feedname), metadata={'using': 'import_api.import_start'})))
This error is occurring when running this line:
df = sedona.read.parquet("<lakefs://test1/E2E_20231006043002742/providerfeeds/admins/>")
The read works fine on the original data location. The failure is consistently a
file, while I can read fine via postman and is there in the UI.
Our code runs successfully once, but after re-running our pipeline it consistently fails until we restart the service.
This FileReadException java.io.EOFException is also still occurring on some reruns.
Thank you James, the debug info is super helpful. 1. Can you please confirm that the imported data wasn’t removed from the origin, i.e. from the
in 3b. 2. Is
file of size 0? I suspect it’s a different occurrence 6718, although you seem to be reading from lakeFS API, and not the lakeFS s3 gateway. 3. As a possible workaround (and valuable debug info), can you try retrying the above using the lakeFS s3 gateway? It would mean replacing the failing line with
df = sedona.read.parquet("<s3a://test1/E2E_20231006043002742/providerfeeds/admins/>")
a. It would mean you would have to configure Spark to work with the s3a scheme by setting the endpoint, access and secret keys. Configuration settings described here
1. Confirming yes, no changes are made to imported data origin. 2. It is not size 0, it is usually around 200B and holds just data like this:
3. Will do this and let you know shortly, thanks Also, we managed to change spark settings to get rid of the _committed file entirely, I'll let you know if this fixes issues.
Thank you! We'll look deeper and send an update by early next week
Hi Itai, we are using blob storage adls gen 2 as our storage backend. Would we be able to use the s3a endpoint? We are currently getting Access Denied. Also, we are now getting this error after removing _committed:
org.apache.spark.SparkException: Exception thrown in awaitResult: null
Yes, you would be able to use the Spark
scheme with lakeFS through its s3 gateway. The S3 Gateway is the layer in lakeFS responsible for the compatibility with S3. It implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3. You can set it like this: • `fs.s3a.access.key`: lakeFS access key • `fs.s3a.secret.key`: lakeFS secret key • `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io) • `fs.s3a.path.style.access`:
Then, every data access using the
scheme, like
, would reach your lakeFS instance S3 gateway.