Hi new to lakeFS Is there any articles in docs that talk abo lakeFS #help

Hi new to lakeFS. Is there any articles in docs t...

Carlton Ramsey

09/15/2024, 2:17 PM

Hi new to lakeFS. Is there any articles in docs that talk about limits and recommendations? I have a process where I start with an 18 GB file. Currently today we are using SSIS to reduce the dataset in around a million records. My initial thought is to start the pipeline by importing the 18 gb pipe delimited file and then enrich the file and during that process covert it into parquet files. The original file and the enriched parquet files could be filtered by state or even at a lower level if needed. What I’m unsure about is will I run into and limits if keep the files complete. Would it be a best practice to use smaller files instead and if so at what size?

Offir Cohen

09/15/2024, 2:34 PM

Hi @Carlton Ramsey and welcome to the lake lakefs Are you asking for best practices to migrate from SSIS to a datalake?

Carlton Ramsey

09/15/2024, 2:59 PM

Hi @Offir Cohen thank you. More just limits / best practices of working with large files in lakeFS.

Ariel Shaqed (Scolnicov)

09/15/2024, 8:27 PM

Hi @Carlton Ramsey! lakeFS itself doesn't really care about data - only metadata. But... The underlying storage and network layers usually do! • S3 really wants you to upload complete objects that are <5GiB. Any more than that and you need to use multipart uploads. • Many data processing frameworks work nicely with large numbers of objects. I usually find splitting objects into pieces of sizes around 10MiB works nicely. (Anything between 2MiB and 50MiB will probably give similar results.) You want a large number of pieces to distribute with, but you also don't want very small pieces. So I would probably go with multiple objects. Of course this will also depend on what software framework you use, and on specifics of your architecture.

Carlton Ramsey

09/15/2024, 11:41 PM

@Ariel Shaqed (Scolnicov) Thank you.

👍 1

2 Views

Open in Slack

Previous Next