Trying to test multipart upload to our gateway using Spark as part of Nessie. I'm having a lot of trouble making Spark run multipart uploads to our S3 gateway. It's tough because we need
big files and our containers are not so big.
I want to be lazy, and run a program that exercises the S3a hadoopfs directly rather than going through Spark. Now I can just easily write a 20MiB file with 5MiB parts, read it back, and be done with it.
Pros: Short, nonbrittle (small changes don't make the test silently stop testing what it should) and easy to write.
Cons: Doesn't actually test Spark performing multipart uploads to the lakeFS S3 gateway, but merely the S3a hadoopfs. There
might be differences between some Spark writer, due to arbitrary changes that it might make to S3a config or similar.
I believe it still makes sense: the cons that we overcome are relatively minor; most would occur regardless unless we managed to hit the exact combination that triggers the hypothetical failure. (E.g. something like
https://github.com/treeverse/lakeFS/issues/2429).
WDYT?