Hello and thanks for the support This is a new issue after m lakeFS #help

Hello and thanks for the support! This is a new is...

James Daus

10/05/2023, 11:34 PM

Hello and thanks for the support! This is a new issue after my previous thread. We are using a fully Azure-based deployment, with CosmosDB for K/V, and after running large imports, we are running into the attached read error in Spark. We have also run into this error once as well on a rerun of the pipeline:

com.databricks.sql.ui.FileReadException: Error while reading file lakefs://<path>.c000.snappy.parquet. java.io.EOFException

I looked through the logs, and there are many messages like this mentioning a "Not Found" error:

time="2023-10-05T23:18:55Z" level=debug msg="Not found" func="<http://github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback|github.com/treeverse/lakefs/pkg/api.(*Controller).handleAPIErrorCallback>" file="/lakeFS/pkg/api/controller.go:2068" error="not found" host="<ourdomain>:8000" method=GET operation_id=StatObject path="/api/v1/repositories/test1/refs/E2E_20231005110009958/objects/stat?path=providerfeeds%2Ftransportation&user_metadata=false&presign=false" request_id=14e66e44-a26a-4b75-9d17-c4040636b062 service=api_gateway service_name=rest_api user=admin

Itai Admi

10/06/2023, 3:00 PM

Hey James, during Spark runs it schedules many executors which have some non trivial ways of communicating with one another and checking stuff. They keep creating and deleting markers and use

StatObject

to check if they exist. So during normal operation of Spark `404`s are expected. The error in the attached image looks like the problem. Is it a delta table? Is it imported? Do you know if the imported data was changed outside of lakeFS (lakeFS keeps pointers to the imported data, if it was moved/changed/deleted in the origin place, lakeFS will not be able to manage that)? Generally every piece of information on the data would help us to reproduce it and fix: 1. Is the storage namespace of the repo gen v1/gen v2? 2. Same question for the imported data. 3. Format that is being read and written.

James Daus

10/06/2023, 4:24 PM

Hi Itai, thanks for your reply. This error is occurring when trying to read data that has been imported successfully.

James Daus

10/06/2023, 4:43 PM

1. The import is happening within the same container in gen 2 storage, from a different folder. The storage namespace used in the lakeFS repo looks like this: https://domain.blob.core.windows.net/test1/ 2. Currently this data is being imported from this same container. 3. The format of the data is geoparquet. It is first being copied into the Azure storage like so: a.

df.write.format("geoparquet").mode("overwrite").partitionBy("theme","type").save(output_path)

b. Note:

output_path

is the folder we will now import from 4. The import looks like so: a.

importresponse = client.import_api.import_start(             repository=repository, branch=branch, import_creation = ImportCreation(paths=[ImportLocation(type="common_prefix",#common_prefix path=softlink_import_path, destination=dest,),],commit=CommitCreation(message="Uploading new data version for {} into lakefs".format(feedname), metadata={'using': 'import_api.import_start'})))

This error is occurring when running this line:

df = sedona.read.parquet("<lakefs://test1/E2E_20231006043002742/providerfeeds/admins/>")

The read works fine on the original data location. The failure is consistently a

_committed

file, while I can read fine via postman and is there in the UI.

James Daus

10/06/2023, 5:30 PM

Our code runs successfully once, but after re-running our pipeline it consistently fails until we restart the service.

James Daus

10/06/2023, 6:33 PM

This FileReadException java.io.EOFException is also still occurring on some reruns.

Itai Admi

10/06/2023, 8:17 PM

Thank you James, the debug info is super helpful. 1. Can you please confirm that the imported data wasn’t removed from the origin, i.e. from the

output_path

in 3b. 2. Is

_committed

file of size 0? I suspect it’s a different occurrence 6718, although you seem to be reading from lakeFS API, and not the lakeFS s3 gateway. 3. As a possible workaround (and valuable debug info), can you try retrying the above using the lakeFS s3 gateway? It would mean replacing the failing line with

df = sedona.read.parquet("<s3a://test1/E2E_20231006043002742/providerfeeds/admins/>")

a. It would mean you would have to configure Spark to work with the s3a scheme by setting the endpoint, access and secret keys. Configuration settings described here

James Daus

10/06/2023, 9:00 PM

1. Confirming yes, no changes are made to imported data origin. 2. It is not size 0, it is usually around 200B and holds just data like this:

{"added":["<name>.snappy.parquet","<name2>.snappy.parquet"],"removed":[]}

3. Will do this and let you know shortly, thanks Also, we managed to change spark settings to get rid of the _committed file entirely, I'll let you know if this fixes issues.

Itai Admi

10/06/2023, 9:22 PM

Thank you! We'll look deeper and send an update by early next week

👍 1

James Daus

10/06/2023, 10:08 PM

Hi Itai, we are using blob storage adls gen 2 as our storage backend. Would we be able to use the s3a endpoint? We are currently getting Access Denied. Also, we are now getting this error after removing _committed:

org.apache.spark.SparkException: Exception thrown in awaitResult: null

Itai Admi

10/07/2023, 7:30 PM

Yes, you would be able to use the Spark

s3a

scheme with lakeFS through its s3 gateway. The S3 Gateway is the layer in lakeFS responsible for the compatibility with S3. It implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3. You can set it like this: • `fs.s3a.access.key`: lakeFS access key • `fs.s3a.secret.key`: lakeFS secret key • `fs.s3a.endpoint`: lakeFS S3-compatible API endpoint (e.g. https://example-org.us-east-1.lakefscloud.io) • `fs.s3a.path.style.access`:

true

Then, every data access using the

s3a

scheme, like

<s3a://repo/branch/some/path>

, would reach your lakeFS instance S3 gateway.

99 Views

Open in Slack

Previous Next