Hello. We're trying to figure out our backfill strategy for a one time massive data import. Though, this also is sort of related to incremental updates going forward after the backfill is done. Let me preface this by saying i know this was by far not the ideal means for testing this, but it was done as a benchmark to see what kind of performance we'd get.
I used the lakefs examples github repo, and ran the data-lineage.ipynb example for my testing, and modified the python code to work for our test case. This was all done locally running docker compose (as explained by the examples instructions), which ran a few docker images locally on my development box on linux (windows/WSL/Ubuntu 22.04). the docker images involved were lakefs, minio, and the examples image.
What we're after here is best practices for loading TB of data as a backfill process. The process seemed to go well until the end.
Our first test was with a 2GB file that contained a file with 30 million rows in it. These were our steps:
• We loaded the file into the ingestion branch and committed it.
• I merged the ingestion branch into the transformation branch
• I then modified the spark code to partition the data to a new lakefs repo folder in the transformation branch, where the partitions were taken from a timestamp in each row of the file. The timestamp was in the format: 2018-10-10T170015.0000000Z. I set up spark to partition year, month, day, and hour, from the timestamp.
• I had to adjust the spark and jvm memory to allow the partitioning part of it to succeed. The partitioning into sparks temporary folder took about 2 hours to run, which was not bad considering it was my local box.
• The problem started when the partitioning completed, and spark tried to start writing all the data to the correct partitioned folder structure in lakefs. The spark commit operation started, and continued to run, writing files for over 16 hours....at which point i killed it, because clearly this was not going to work.
Basically, the structure of the data in the test load file was:
• 8 years of data for approximately 200 unique keys (one of the columns in the file).
8 years of data basically means spark created (if my math is correct), 50,880 partition folders. for which there was only 200 entities of data that needed to go into all of those folders.
It's unclear where the bottleneck is there. All the data was going to the minio server. I was watching the monitoring page in the minio server UI, and basically, all that was going on was spark creating/putting files as fast as it could to it. My system is fairly beefy, and it's all SSD's. It's unclear if having this deployed in aws would be any faster writing to the real s3 or not. Our Ops guy says the performance would not be dramatically different there.
I realize that the problem here is not necessarily Lakefs, as it's performing it's function correctly, and just responding to and forwarding the s3a requests to minio.
How would you suggest that we alter the structure of the data in the backfill load files to get better performance through the whole ingestion process through lake fs? The ultimate goal here was to get the data generated in the partitioned structure into the transformation branch, and then commit it, and merge to main.