https://lakefs.io/ logo
Title
a

Adi Polak

09/07/2022, 11:37 PM
Hi <!here> , i am researching managed Hadoop offerings. I'm curious, from your experience, what are the culprits of EMR and S3 solutions? This is what I have learned so far. EMR Pain: Upgrading Hadoop/Spark/Hive/Presto etc., Requires a new EMR Release, which upgrades often more than what you need. – this can trigger a full migration. Some docker images are supported, but not all of them. EMR bootstrap actions seems like an interesting solution for installing system dependencies 1. Installing system dependencies doesn't always work as expected with bootstrap. 2. Spot - sometimes AWS will kill all your spot instances and cause clusters and jobs to fail (to happen infrequently enough that it is hard to justify paying for on-demand instances, but frequently enough that it causes a significant and draining level of operational toil). 3. Configuring Ranger Authorization requires installing it outside of EMR cluster and creates conflicts with AWS internal record reader Authorization. Conflicting security rules – can be resolved but require more efforts. S3 Pain: 1. S3 Consistency used to be a pain and was solved with strong read after write which made EMRFS obsolete. File consistency is not a pain anymore, but lack of atomic directories operations is a pain – copying a large set of parquet files (10k) and having multiple readers and writers at the same time can be a challenge when looking at the directories as a whole and not a file by file consistency. 2. AWS will often return HTTP 503 Slow Down errors for request rates that are much, much lower than their advertised limits. 3. You are supposed to code clients with backoffs/retries adequate to absorb arbitrary levels of HTTP 503 Slow Down until they finish scaling (which is entirely unobservable and could be minutes, hours, or days). 4. Many readers to table with 1K parquet files and one update – update is not an atomic action and readers might read the wrong data. 5. Storage optimization – optimize performance - no optimization at a table level.
v

Vino

09/07/2022, 11:58 PM
@Adi Polak Another interesting problem we faced is troubleshooting failed spark jobs on EMR. From my experience, we had airflow EMR operator to spin up and terminate an EMR cluster on demand. So the airflow dag looks something like: emr_spin_up_task >> spark_etl_task >> emr_teminate_task. When the spark job fails, the EMR logs would be in an S3 directory, and the cluster would be already terminated by airflow. So troubleshooting a failed spark job mining through logs when the cluster is not available is a pain point as well.
a

Ariel Shaqed (Scolnicov)

09/08/2022, 6:25 AM
Hi Adi, The planned LakeFSOutputCommitter will directly solve your points 1 and 4 about S3 (it is literally atomic). Points 2 and 3 cannot be resolved in a system that runs on top of S3 - but it will reduce metadata ops 2x-3x during writes, which should really help. I am looking right now for user wins for this proposal in this PR in order to make the case for implementing it. Would you mind posting these issues there, so that I can include them? Thanks!
a

Adi Polak

09/08/2022, 10:10 PM
done.
a

Ariel Shaqed (Scolnicov)

09/09/2022, 5:52 AM
THANKS :-)