Hi do we have any examples of copying from an S3 bucket into lakeFS #help

Hi, do we have any examples of copying from an S3 ...

Sid Senthilnathan

06/10/2022, 2:42 PM

Hi, do we have any examples of copying from an S3 bucket into Lakefs? • This describes copying data from S3 to local first and then to Lakefs. But I want to avoid the two-step copy, and would like to go directly from S3 to Lakefs • This describes zero-copy imports, but we want actual physical copies. • Will lakectl fs stage help us at all?

Sid Senthilnathan

06/10/2022, 2:56 PM

Is there a way to do

aws s3 cp

between two different endpoints?

Itai Admi

06/10/2022, 3:00 PM

Hey Sid! Out of curiosity, why would you like to copy the data if there’s an option not to? To answer your question, I think it depends on the volume and frequency of the copy. If it’s just a one-off, my personal favourite is rclone - you can configure two endpoints, one for lakeFS and the other for the source - s3. Then the copy/sync operation should do exactly what you wanted.

Sid Senthilnathan

06/10/2022, 3:02 PM

The underlying data could potentially be overwritten/removed. We want to use lakefs to archive the data in its original form. From my understanding, creating pointers to the data would not track changes to that data right?

Itai Admi

06/10/2022, 3:03 PM

Right - if the imported data is gone in the original bucket, then lakeFS can’t retrieve it either.

Ariel Shaqed (Scolnicov)

06/10/2022, 3:07 PM

Yeah,

aws s3

is annoying that way. I guess it's understandable why AWS would think there's only one AWS. Sounds like you might be better off using one of the integrations with copy tools. My favourite is RClone, or DistCp if you've got a lot of data.

Sid Senthilnathan

06/10/2022, 3:08 PM

Great, I can look into those tools. Are there any readily available examples?

Sid Senthilnathan

06/10/2022, 3:09 PM

Found it https://docs.lakefs.io/integrations/rclone.html

👍🏼 1

Ariel Shaqed (Scolnicov)

06/10/2022, 3:09 PM

Sorry for not being clear about this: those are links into our "Integrations" documentation, not to the tools themselves...

Ariel Shaqed (Scolnicov)

06/10/2022, 3:15 PM

Note that (as requested) these will actually copy data. That can be a nice workout for your system... So in order of performance, RClone will be slowest (data flows through single controller and lakeFS), DistCp faster (data flows through multiple executors and lakeFS), DistCp with LakeFSFS fastest (data flows through multiple executors but only metadata hits lakeFS). The last will be the hardest to set up, though, so I'd only go there if I wanted to copy terabytes).

Sid Senthilnathan

06/10/2022, 3:17 PM

Ok yeah, for now its very small data. It may scale up a bit in terms of higher frequency but the overall size shouldn't be very much.

Sid Senthilnathan

06/10/2022, 3:17 PM

Are there limits on the the ingest/commit frequency on the lakefs side?

Paul Singman

06/10/2022, 3:26 PM

Hey Sid! Want to also share this blog post that covers in detail a few of the ways to import data to lakeFS: https://lakefs.io/3-ways-to-add-data-to-lakefs/

Sander Hartlage

06/10/2022, 4:08 PM

is there a way to generate the location in the underlying s3 bucket for where the lakefs object would go? that is, can we register the metadata with lakefs and then do an

aws s3 cp

from the original location directly into the underlying lakefs s3 bucket?

Sander Hartlage

06/10/2022, 4:09 PM

it's my understanding that the lakefs hadoop filesystem does something similar to this to get around having to push all the writes through the lakefs frontends

Sander Hartlage

06/10/2022, 4:11 PM

for some more context, we're setting up s3 bucket notifications to populate an SQS queue with new uploaded files. every day there will be a few thousand files (mostly small log files and csvs) which we'll want to copy into lakefs for archival as sid described above

Sander Hartlage

06/10/2022, 4:12 PM

so each day we'll have, say, 2000 or so small files from a half-dozen different buckets that we want to centralize in lakefs

Sander Hartlage

06/10/2022, 4:14 PM

the biggest reason we need the external data in lakefs is to provide checkpoints on the state of the external data for ML-experiment reproducibility

Ariel Shaqed (Scolnicov)

06/10/2022, 4:27 PM

Yeah, so we're really limited by the s3 protocol here in 2 ways. The first and most important is that there is no way to send a redirect (and even if there were, AWS signing would break). That's what leaves us with the lakeFSFS technique 🤷🏼. If you want that you need to use a capable application. So we have our direct access filesystem for Hadoop (Spark), and you can use DistCp with that. I think that there's a "direct access" flag on

lakectl fs upload

, too (you'd still need to download from s3, of course). But I have to say I'd start with the s3 gateway and RClone, it just doesn't sound like enough data. And if I'm wrong you still have 2 huge performance bumps available: go to DistCp, and go to DistCp/lakeFSFS.

Sander Hartlage

06/10/2022, 4:31 PM

sounds good! eventually we might need to copy really large files (we get a few hundred multi-gigabyte pathology images per day) at which point we'll revisit this

Ariel Shaqed (Scolnicov)

06/10/2022, 4:34 PM

Cool! (I work on our fit into the data ecosystem, so I have a vested interest in helping you pump as much data as possible here 😃)

15 Views

Open in Slack

Previous Next