Hi <!here> , i am researching managed Hadoop offerings. I'm curious, from your experience, what are the culprits of EMR and S3 solutions? This is what I have learned so far.
EMR Pain:
Upgrading Hadoop/Spark/Hive/Presto etc., Requires a new EMR Release, which upgrades often more than what you need. – this can trigger a full migration. Some docker images are supported, but not all of them. EMR bootstrap actions seems like an interesting solution for installing system dependencies
1. Installing system dependencies doesn't always work as expected with bootstrap.
2. Spot - sometimes AWS will kill all your spot instances and cause clusters and jobs to fail (to happen infrequently enough that it is hard to justify paying for on-demand instances, but frequently enough that it causes a significant and draining level of operational toil).
3. Configuring Ranger Authorization requires installing it outside of EMR cluster and creates conflicts with AWS internal record reader Authorization. Conflicting security rules – can be resolved but require more efforts.
S3 Pain:
1. S3 Consistency used to be a pain and was solved with strong read after write which made EMRFS obsolete. File consistency is not a pain anymore, but lack of atomic directories operations is a pain – copying a large set of parquet files (10k) and having multiple readers and writers at the same time can be a challenge when looking at the directories as a whole and not a file by file consistency.
2. AWS will often return HTTP 503 Slow Down errors for request rates that are much, much lower than their advertised limits.
3. You are supposed to code clients with backoffs/retries adequate to absorb arbitrary levels of HTTP 503 Slow Down until they finish scaling (which is entirely unobservable and could be minutes, hours, or days).
4. Many readers to table with 1K parquet files and one update – update is not an atomic action and readers might read the wrong data.
5. Storage optimization – optimize performance - no optimization at a table level.