My question is half why did it not write out the i...
# help
j
My question is half why did it not write out the imported files, and half what is the expected behavior here for imported files? It seems kind of weird to export files from a repo that were imported and already physically in another s3 location. Also of note, the spark-submit command did not output an errors to the console.
i
Hey @Joe M, that’s an interesting one and I’m sure more will encounter issues around exporting imported objects. I’ll dump the binary file content here:
Copy code
Unable to copy file <s3a://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/ct-raw-export/badger1/2024/05/15/1716089427/readings.seg1-small.csv.gz> from source <s3://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/backfill/badger1/2024/05/15/readings.seg1-small.csv.gz>: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
Unable to copy file <s3a://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/ct-raw-export/badger1/2024/05/15/1716088435/readings.seg0-small.csv.gz> from source <s3://ct-rawdata-ddd8ba83-f4f5-4f29-a4dd-b428791af2cd/backfill/badger1/2024/05/15/readings.seg0-small.csv.gz>: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
It seems like the imported objects were not exported because the spark-submit runtime is unfamiliar with the
s3
protocol. Here’s the code line that fails:
Copy code
org.apache.hadoop.fs.FileUtil.copy(srcPath.getFileSystem(conf), srcPath, dstFS, dstPath, false, conf)
I think the following config should resolve the issue:
Copy code
spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
Also make sure that Spark has the appropriate creds to copy the objects.
Regarding your question on the expected behavior: IMO the premise of export is “I want the file structure that exists in lakeFS under lakefs://repo/branch/some1/path1/ to exist in s3://bucket/some2/path2/”. To me it seems that the origin of the objects is not particularly important, maybe a more advanced configuration could skip imported objects.
I also opened this issue to point this in our documentation.
j
thanks , i'll add that property to spark and try again. and yeah i agree with your version of what export should do. it just seemed like an odd scenario, the more i thought about it, so wanted to ask what the expected functionality was.
👍 1
finally got back around to testing this. adding that configuration property fixed the export failure. the data was exported into the expected structure on s3. thanks for the help.
o
Hi @Joe M Thanks for the update. Glad it worked well for you