Hello and thanks for the support! This is a new is...
# help
j
Hello and thanks for the support! This is a new issue after my previous thread. We are using a fully Azure-based deployment, with CosmosDB for K/V, and after running large imports, we are running into the attached read error in Spark. We have also run into this error once as well on a rerun of the pipeline:
com.databricks.sql.ui.FileReadException: Error while reading file lakefs://<path>.c000.snappy.parquet. java.io.EOFException
I looked through the logs, and there are many messages like this mentioning a "Not Found" error:
time="2023-10-05T23:18:55Z" level=debug msg="Not found" func="<http://github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback|github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback>" file="/lakeFS/pkg/api/controller.go:2068" error="not found" host="<ourdomain>:8000" method=GET operation_id=StatObject path="/api/v1/repositories/test1/refs/E2E_20231005110009958/objects/stat?path=providerfeeds%2Ftransportation&user_metadata=false&presign=false" request_id=14e66e44-a26a-4b75-9d17-c4040636b062 service=api_gateway service_name=rest_api user=admin
i
Hey James, during Spark runs it schedules many executors which have some non trivial ways of communicating with one another and checking stuff. They keep creating and deleting markers and use
StatObject
to check if they exist. So during normal operation of Spark `404`s are expected. The error in the attached image looks like the problem. Is it a delta table? Is it imported? Do you know if the imported data was changed outside of lakeFS (lakeFS keeps pointers to the imported data, if it was moved/changed/deleted in the origin place, lakeFS will not be able to manage that)? Generally every piece of information on the data would help us to reproduce it and fix: 1. Is the storage namespace of the repo gen v1/gen v2? 2. Same question for the imported data. 3. Format that is being read and written.
j
Hi Itai, thanks for your reply. This error is occurring when trying to read data that has been imported successfully.
1. The import is happening within the same container in gen 2 storage, from a different folder. The storage namespace used in the lakeFS repo looks like this: https://domain.blob.core.windows.net/test1/ 2. Currently this data is being imported from this same container. 3. The format of the data is geoparquet. It is first being copied into the Azure storage like so: a.
df.write.format("geoparquet").mode("overwrite").partitionBy("theme","type").save(output_path)
b. Note:
output_path
is the folder we will now import from 4. The import looks like so: a.
importresponse = client.import_api.import_start(             repository=repository, branch=branch, import_creation = ImportCreation(paths=[ImportLocation(type="common_prefix",#common_prefix path=softlink_import_path, destination=dest,),],commit=CommitCreation(message="Uploading new data version for {} into lakefs".format(feedname), metadata={'using': 'import_api.import_start'})))
This error is occurring when running this line:
df = sedona.read.parquet("<lakefs://test1/E2E_20231006043002742/providerfeeds/admins/>")
The read works fine on the original data location. The failure is consistently a
_committed
file, while I can read fine via postman and is there in the UI.
Our code runs successfully once, but after re-running our pipeline it consistently fails until we restart the service.
This FileReadException java.io.EOFException is also still occurring on some reruns.
i
Thank you James, the debug info is super helpful. 1. Can you please confirm that the imported data wasn’t removed from the origin, i.e. from the
output_path
in 3b. 2. Is
_committed
file of size 0? I suspect it’s a different occurrence 6718, although you seem to be reading from lakeFS API, and not the lakeFS s3 gateway. 3. As a possible workaround (and valuable debug info), can you try retrying the above using the lakeFS s3 gateway? It would mean replacing the failing line with
df = sedona.read.parquet("<s3a://test1/E2E_20231006043002742/providerfeeds/admins/>")
a. It would mean you would have to configure Spark to work with the s3a scheme by setting the endpoint, access and secret keys. Configuration settings described here
j
1. Confirming yes, no changes are made to imported data origin. 2. It is not size 0, it is usually around 200B and holds just data like this:
{"added":["<name>.snappy.parquet","<name2>.snappy.parquet"],"removed":[]}
3. Will do this and let you know shortly, thanks Also, we managed to change spark settings to get rid of the _committed file entirely, I'll let you know if this fixes issues.
i
Thank you! We'll look deeper and send an update by early next week
👍 1
j
Hi Itai, we are using blob storage adls gen 2 as our storage backend. Would we be able to use the s3a endpoint? We are currently getting Access Denied. Also, we are now getting this error after removing _committed:
org.apache.spark.SparkException: Exception thrown in awaitResult: null
i
Yes, you would be able to use the Spark
s3a
scheme with lakeFS through its s3 gateway. The S3 Gateway is the layer in lakeFS responsible for the compatibility with S3. It implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3. You can set it like this: • `fs.s3a.access.key`: lakeFS access key • `fs.s3a.secret.key`: lakeFS secret key • `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io) • `fs.s3a.path.style.access`:
true
Then, every data access using the
s3a
scheme, like
<s3a://repo/branch/some/path>
, would reach your lakeFS instance S3 gateway.