Hi guys! We're importing folders containing parque...
# help
j
Hi guys! We're importing folders containing parquet from Azure Blob (see below) using
common_prefix
, but upon import an additional "file" with the same name as every folder is created (see below). When we try to run
df = sedona.read.parquet(lakefspath)
we are met with this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 4 times, most recent failure: Lost task 0.3 in stage 75.0: org.apache.spark.SparkException: Exception thrown in awaitResult: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: <lakefs://test/testrepo1/providerfeeds/places1/theme=places1>
Note: Upon deleting this extra file and all extra files within every level of nesting, we are then able to read.
a
Are you using adls or blob subdomain to the URL when importing, as follows: https://<my-account>.adls.core.windows.net/path/to/import/
j
Using blob, like so: https://{}.blob.core.windows.net/{}/{}/{}/
a
I think adls resolves the issue
So, try adls
j
That worked perfectly, thanks
a
👍
n
@James Daus Please note that the use of "adls" as part of the subdomain for import should be done only when providing an import source which is ADLS Gen2 storage account. This is in fact a hint provided to lakeFS which allows us to choose the correct way to list over the storage objects and is not a valid url when creating a repository or when importing from a Blob Storage account.
👍 1
j
Thanks for the replies @Amit Kesarwani and @Niro! Unfortunately, it looks like the read is still sometimes failing, especially on larger loads. Is this error related?
We also got this error on one of the rerun fails: com.databricks.sql.ui.FileReadException: Error while reading file <lakefs://c000.snappy.parquet|lakefs://<path>.c000.snappy.parquet>. java.io.EOFException
a
@James Daus I don’t think this error is related. Can you please report this error in a new thread and with additional information? I don’t know about this error but somebody else will assist you.