Hi guys We re importing folders containing parquet from Azur lakeFS #help

Hi guys! We're importing folders containing parque...

James Daus

10/04/2023, 6:31 PM

Hi guys! We're importing folders containing parquet from Azure Blob (see below) using

common_prefix

, but upon import an additional "file" with the same name as every folder is created (see below). When we try to run

df = sedona.read.parquet(lakefspath)

we are met with this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 4 times, most recent failure: Lost task 0.3 in stage 75.0: org.apache.spark.SparkException: Exception thrown in awaitResult: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: <lakefs://test/testrepo1/providerfeeds/places1/theme=places1>

James Daus

10/04/2023, 6:33 PM

Note: Upon deleting this extra file and all extra files within every level of nesting, we are then able to read.

Amit Kesarwani

10/04/2023, 7:13 PM

Are you using adls or blob subdomain to the URL when importing, as follows: https://<my-account>.adls.core.windows.net/path/to/import/

James Daus

10/04/2023, 8:01 PM

Using blob, like so: https://{}.blob.core.windows.net/{}/{}/{}/

Amit Kesarwani

10/04/2023, 8:25 PM

I think adls resolves the issue

Amit Kesarwani

10/04/2023, 8:25 PM

So, try adls

James Daus

10/04/2023, 8:43 PM

That worked perfectly, thanks

Amit Kesarwani

10/04/2023, 8:52 PM

👍

Niro

10/05/2023, 7:46 AM

@James Daus Please note that the use of "adls" as part of the subdomain for import should be done only when providing an import source which is ADLS Gen2 storage account. This is in fact a hint provided to lakeFS which allows us to choose the correct way to list over the storage objects and is not a valid url when creating a repository or when importing from a Blob Storage account.

👍 1

James Daus

10/05/2023, 11:09 PM

Thanks for the replies @Amit Kesarwani and @Niro! Unfortunately, it looks like the read is still sometimes failing, especially on larger loads. Is this error related?

James Daus

10/05/2023, 11:10 PM

We also got this error on one of the rerun fails: com.databricks.sql.ui.FileReadException: Error while reading file <lakefs://c000.snappy.parquet|lakefs://<path>.c000.snappy.parquet>. java.io.EOFException

Amit Kesarwani

10/05/2023, 11:16 PM

@James Daus I don’t think this error is related. Can you please report this error in a new thread and with additional information? I don’t know about this error but somebody else will assist you.

93 Views

Open in Slack

Previous Next