Hi i m new to lakefs and just getting started with it I have lakeFS #help

Hi, i'm new to lakefs, and just getting started wi...

Joe M

05/02/2024, 3:58 PM

Hi, i'm new to lakefs, and just getting started with it. I have a question about bulk backfill strategy related to lakefs. Let's say i have 500GB of raw data that i want to use to start off with on a new ingestion branch to get it to start running through a data pipeline. And let's say that data is sitting in an s3 bucket in our aws account, in the correct folder structure we intend to use. I assume that the correct, and fastest way to get this data into lakefs is to "import" it to the new ingestion branch. Is that a correct statement? And if so, since you're doing a shallow copy, after i do the import, will lakefs change the structure of the s3 bucket so it's now in it's internal fs format? (ie: metadata folder and hash folders for the data)

Niro

05/02/2024, 4:13 PM

Hi @Joe M you are correct, import will be the right approach in your use case. lakeFS will not make any changes to the imported data on the source bucket, instead new data will be written to the lakeFS repository namespace. You can read more in this blog and our documentation

Joe M

05/02/2024, 4:15 PM

ah, ok, so then if i understand correctly, when you create a repo you give it an s3 location for the repo. then when you import the data, it basically just adds metadata to the repo s3 location, and points to the raw files in the original location where the data was imported from.

Joe M

05/02/2024, 4:15 PM

I was missing the fact that the repo s3 space is different than the import location

Niro

05/02/2024, 4:16 PM

Very accurate

Joe M

05/02/2024, 4:16 PM

great thanks

Niro

05/02/2024, 4:16 PM

YW jumping lakefs

Iddo Avneri

05/02/2024, 4:36 PM

This could be very helpful as well @Joe M

👍 3

2 Views

Open in Slack

Previous Next