Hello. We're trying to figure out our backfill st...
# help
j
Hello. We're trying to figure out our backfill strategy for a one time massive data import. Though, this also is sort of related to incremental updates going forward after the backfill is done. Let me preface this by saying i know this was by far not the ideal means for testing this, but it was done as a benchmark to see what kind of performance we'd get. I used the lakefs examples github repo, and ran the data-lineage.ipynb example for my testing, and modified the python code to work for our test case. This was all done locally running docker compose (as explained by the examples instructions), which ran a few docker images locally on my development box on linux (windows/WSL/Ubuntu 22.04). the docker images involved were lakefs, minio, and the examples image. What we're after here is best practices for loading TB of data as a backfill process. The process seemed to go well until the end. Our first test was with a 2GB file that contained a file with 30 million rows in it. These were our steps: • We loaded the file into the ingestion branch and committed it. • I merged the ingestion branch into the transformation branch • I then modified the spark code to partition the data to a new lakefs repo folder in the transformation branch, where the partitions were taken from a timestamp in each row of the file. The timestamp was in the format: 2018-10-10T170015.0000000Z. I set up spark to partition year, month, day, and hour, from the timestamp. • I had to adjust the spark and jvm memory to allow the partitioning part of it to succeed. The partitioning into sparks temporary folder took about 2 hours to run, which was not bad considering it was my local box. • The problem started when the partitioning completed, and spark tried to start writing all the data to the correct partitioned folder structure in lakefs. The spark commit operation started, and continued to run, writing files for over 16 hours....at which point i killed it, because clearly this was not going to work. Basically, the structure of the data in the test load file was: • 8 years of data for approximately 200 unique keys (one of the columns in the file). 8 years of data basically means spark created (if my math is correct), 50,880 partition folders. for which there was only 200 entities of data that needed to go into all of those folders. It's unclear where the bottleneck is there. All the data was going to the minio server. I was watching the monitoring page in the minio server UI, and basically, all that was going on was spark creating/putting files as fast as it could to it. My system is fairly beefy, and it's all SSD's. It's unclear if having this deployed in aws would be any faster writing to the real s3 or not. Our Ops guy says the performance would not be dramatically different there. I realize that the problem here is not necessarily Lakefs, as it's performing it's function correctly, and just responding to and forwarding the s3a requests to minio. How would you suggest that we alter the structure of the data in the backfill load files to get better performance through the whole ingestion process through lake fs? The ultimate goal here was to get the data generated in the partitioned structure into the transformation branch, and then commit it, and merge to main.
o
That’s pretty high cardinality for a partition key. would your queries often filter down to specific hours? (my quick napkin calculation tells me you’d have very small ~500 row per partition, which is going to hurt performance in most all query engines) My recommendation is to reduce partition cardinality (down to the day or even month level). regardless, to debug such issues I’d go for elimination: try reading/writing directly to minio to eliminate lakeFS. Otherwise, trying a “real” lakeFS setup (perhaps a free trial on lakeFS cloud or a tuned Kubernetes install?)
i
Hi @Joe M, Happy to try and help here. I honestly don’t know how much sense it makes to troubleshoot a 30M row backfill on a local docker compose environment that isn’t meant to handle load. Specifically, I don’t know how well MinIO and Spark work on this setup (regardless of lakeFS). If you feel this is the only way to test this - let’s jump on a call and investigate together. Otherwise, Do you have a spark / Databricks environment you can use? If so, I would suggest spinning up a lakefs cloud environment and running the same test.
j
that's for the feedback. ultimately will need to break the data down by hour. technically i guess we don't have to physically partition down to the hour. we could partition to the day and roll up by the hour in the queries we'll need to run against the data. also, that was just a small sample of the data. ultimately there will be millions of unique keys with their metadata spread across the partitions. We limited it to just 200 in the test file, cuz that's the format we got the data in.,
👍 1
and yeah, i figured the cardinality was the issue
o
yea, so I agree with @Iddo Avneri - the samples are just that - samples. Not sure I would trust that setup for benchmarking of any sort..
j
yeah my local testing was just meant as way to try to run something before attempting to setup resources in aws to do the work rather than the docker image samples. that's my next step. the one obvious problem with the local setup was all 3 docker images were running in the same WSL image, and thus all on the same ssd. so it was reading and writing the same drive
plus that sample python notebook i used was more or less doing exactly what i needed to do.
i
FWIW, it makes me happy that it was a useful way to learn :)
j
🙂 it helped a lot
☝️ 1
heart lakefs 1