Trying to test multipart upload to our gateway using Spark a lakeFS #dev

Trying to test multipart upload to our gateway usi...

Ariel Shaqed (Scolnicov)

10/17/2021, 7:01 AM

Trying to test multipart upload to our gateway using Spark as part of Nessie. I'm having a lot of trouble making Spark run multipart uploads to our S3 gateway. It's tough because we need big files and our containers are not so big. I want to be lazy, and run a program that exercises the S3a hadoopfs directly rather than going through Spark. Now I can just easily write a 20MiB file with 5MiB parts, read it back, and be done with it. Pros: Short, nonbrittle (small changes don't make the test silently stop testing what it should) and easy to write. Cons: Doesn't actually test Spark performing multipart uploads to the lakeFS S3 gateway, but merely the S3a hadoopfs. There might be differences between some Spark writer, due to arbitrary changes that it might make to S3a config or similar. I believe it still makes sense: the cons that we overcome are relatively minor; most would occur regardless unless we managed to hit the exact combination that triggers the hypothetical failure. (E.g. something like https://github.com/treeverse/lakeFS/issues/2429). WDYT?

Itai Admi

10/17/2021, 7:16 AM

Asking the obvious, is “big” files configurable in Spark? Assuming you already checked that, your plan sounds fine to me. We’re testing our gateway receiving multipart uploads and not the Spark driver itself. Even if you will make it work there is always the possibility that someone out there using/configuring Spark in some different way that isn’t tested.

Ariel Shaqed (Scolnicov)

10/17/2021, 7:27 AM

Sure, "big" is configurable, I can set

fs.s3a.multipartthreshold

. The thing is, it gets difficult to create files that have partitions that big, Spark likes to hold stuff in memory. The minimal size is 5 MiB, and it's getting sticky to write stuff for some reason. (I could debug and figure out a way to fit everything I want to do into Spark executors. Just not sure it's a good use of time.)

Ariel Shaqed (Scolnicov)

10/17/2021, 7:29 AM

This is basically a system test (Spark writing MPUs) vs. a component test (hadoopfs writing MPU). Component test is smaller, more controlled, less brittle, more general. System test tests a single but more realistic scenario.

33 Views

Open in Slack

Previous Next