My question is half why did it not write out the imported fi lakeFS #help

My question is half why did it not write out the i...

Joe M

05/19/2024, 4:04 AM

My question is half why did it not write out the imported files, and half what is the expected behavior here for imported files? It seems kind of weird to export files from a repo that were imported and already physically in another s3 location. Also of note, the spark-submit command did not output an errors to the console.

Itai Admi

05/19/2024, 6:46 AM

Hey @Joe M, that’s an interesting one and I’m sure more will encounter issues around exporting imported objects. I’ll dump the binary file content here:

Copy code

Unable to copy file <s3a://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/ct-raw-export/badger1/2024/05/15/1716089427/readings.seg1-small.csv.gz> from source <s3://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/backfill/badger1/2024/05/15/readings.seg1-small.csv.gz>: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
Unable to copy file <s3a://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/ct-raw-export/badger1/2024/05/15/1716088435/readings.seg0-small.csv.gz> from source <s3://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/backfill/badger1/2024/05/15/readings.seg0-small.csv.gz>: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

It seems like the imported objects were not exported because the spark-submit runtime is unfamiliar with the

s3

protocol. Here’s the code line that fails:

Copy code

org.apache.hadoop.fs.FileUtil.copy(srcPath.getFileSystem(conf), srcPath, dstFS, dstPath, false, conf)

I think the following config should resolve the issue:

Copy code

spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Itai Admi

05/19/2024, 6:48 AM

Also make sure that Spark has the appropriate creds to copy the objects.

Itai Admi

05/19/2024, 6:53 AM

Regarding your question on the expected behavior: IMO the premise of export is “I want the file structure that exists in lakeFS under lakefs://repo/branch/some1/path1/ to exist in s3://bucket/some2/path2/”. To me it seems that the origin of the objects is not particularly important, maybe a more advanced configuration could skip imported objects.

Itai Admi

05/19/2024, 7:01 AM

I also opened this issue to point this in our documentation.

Joe M

05/19/2024, 4:55 PM

thanks , i'll add that property to spark and try again. and yeah i agree with your version of what export should do. it just seemed like an odd scenario, the more i thought about it, so wanted to ask what the expected functionality was.

👍 1

Joe M

05/21/2024, 2:32 AM

finally got back around to testing this. adding that configuration property fixed the export failure. the data was exported into the expected structure on s3. thanks for the help.

Offir Cohen

05/21/2024, 11:08 AM

Hi @Joe M Thanks for the update. Glad it worked well for you

29 Views

Open in Slack

Previous Next