Hi new to lakeFS. Is there any articles in docs t...
# help
c
Hi new to lakeFS. Is there any articles in docs that talk about limits and recommendations? I have a process where I start with an 18 GB file. Currently today we are using SSIS to reduce the dataset in around a million records. My initial thought is to start the pipeline by importing the 18 gb pipe delimited file and then enrich the file and during that process covert it into parquet files. The original file and the enriched parquet files could be filtered by state or even at a lower level if needed. What I’m unsure about is will I run into and limits if keep the files complete. Would it be a best practice to use smaller files instead and if so at what size?
o
Hi @Carlton Ramsey and welcome to the lake lakefs Are you asking for best practices to migrate from SSIS to a datalake?
c
Hi @Offir Cohen thank you. More just limits / best practices of working with large files in lakeFS.
a
Hi @Carlton Ramsey! lakeFS itself doesn't really care about data - only metadata. But... The underlying storage and network layers usually do! • S3 really wants you to upload complete objects that are <5GiB. Any more than that and you need to use multipart uploads. • Many data processing frameworks work nicely with large numbers of objects. I usually find splitting objects into pieces of sizes around 10MiB works nicely. (Anything between 2MiB and 50MiB will probably give similar results.) You want a large number of pieces to distribute with, but you also don't want very small pieces. So I would probably go with multiple objects. Of course this will also depend on what software framework you use, and on specifics of your architecture.
c
@Ariel Shaqed (Scolnicov) Thank you.
👍 1