Hi, do we have any examples of copying from an S3 ...
# help
s
Hi, do we have any examples of copying from an S3 bucket into Lakefs? • This describes copying data from S3 to local first and then to Lakefs. But I want to avoid the two-step copy, and would like to go directly from S3 to Lakefs • This describes zero-copy imports, but we want actual physical copies. • Will lakectl fs stage help us at all?
Is there a way to do
aws s3 cp
between two different endpoints?
i
Hey Sid! Out of curiosity, why would you like to copy the data if there’s an option not to? To answer your question, I think it depends on the volume and frequency of the copy. If it’s just a one-off, my personal favourite is rclone - you can configure two endpoints, one for lakeFS and the other for the source - s3. Then the copy/sync operation should do exactly what you wanted.
s
The underlying data could potentially be overwritten/removed. We want to use lakefs to archive the data in its original form. From my understanding, creating pointers to the data would not track changes to that data right?
i
Right - if the imported data is gone in the original bucket, then lakeFS can’t retrieve it either.
a
Yeah,
aws s3
is annoying that way. I guess it's understandable why AWS would think there's only one AWS. Sounds like you might be better off using one of the integrations with copy tools. My favourite is RClone, or DistCp if you've got a lot of data.
s
Great, I can look into those tools. Are there any readily available examples?
a
Sorry for not being clear about this: those are links into our "Integrations" documentation, not to the tools themselves...
Note that (as requested) these will actually copy data. That can be a nice workout for your system... So in order of performance, RClone will be slowest (data flows through single controller and lakeFS), DistCp faster (data flows through multiple executors and lakeFS), DistCp with LakeFSFS fastest (data flows through multiple executors but only metadata hits lakeFS). The last will be the hardest to set up, though, so I'd only go there if I wanted to copy terabytes).
s
Ok yeah, for now its very small data. It may scale up a bit in terms of higher frequency but the overall size shouldn't be very much.
Are there limits on the the ingest/commit frequency on the lakefs side?
p
Hey Sid! Want to also share this blog post that covers in detail a few of the ways to import data to lakeFS: https://lakefs.io/3-ways-to-add-data-to-lakefs/
s
is there a way to generate the location in the underlying s3 bucket for where the lakefs object would go? that is, can we register the metadata with lakefs and then do an
aws s3 cp
from the original location directly into the underlying lakefs s3 bucket?
it's my understanding that the lakefs hadoop filesystem does something similar to this to get around having to push all the writes through the lakefs frontends
for some more context, we're setting up s3 bucket notifications to populate an SQS queue with new uploaded files. every day there will be a few thousand files (mostly small log files and csvs) which we'll want to copy into lakefs for archival as sid described above
so each day we'll have, say, 2000 or so small files from a half-dozen different buckets that we want to centralize in lakefs
the biggest reason we need the external data in lakefs is to provide checkpoints on the state of the external data for ML-experiment reproducibility
a
Yeah, so we're really limited by the s3 protocol here in 2 ways. The first and most important is that there is no way to send a redirect (and even if there were, AWS signing would break). That's what leaves us with the lakeFSFS technique 🤷🏼. If you want that you need to use a capable application. So we have our direct access filesystem for Hadoop (Spark), and you can use DistCp with that. I think that there's a "direct access" flag on
lakectl fs upload
, too (you'd still need to download from s3, of course). But I have to say I'd start with the s3 gateway and RClone, it just doesn't sound like enough data. And if I'm wrong you still have 2 huge performance bumps available: go to DistCp, and go to DistCp/lakeFSFS.
s
sounds good! eventually we might need to copy really large files (we get a few hundred multi-gigabyte pathology images per day) at which point we'll revisit this
a
Cool! (I work on our fit into the data ecosystem, so I have a vested interest in helping you pump as much data as possible here 😃)