just to follow up. it was definitely my box that ...
# help
j
just to follow up. it was definitely my box that was the bottleneck lol. I ran an EMR cluster with just one task node as a test, and ran the same pyspark code (more or less, needed some small changes). It partitioned and wrote out all 30M rows with year, month, day, hour in 6 hours. note this was outside of lakefs completely. I'm assuming using LakeFS as a proxy to s3 wouldn't slow it down too much. regardless for this initial backfill there's no reason to do that. we'll just process all the data we have (scaling the emr cluster as needed), writing directly to s3 and then do a one time import to lakefs.
👍 1