Title
c

Conor Simmons

01/31/2023, 6:38 PM
Hey, I have another more general question: do you have some recommendation for the most performant way to upload objects? Right now I am looping through the desired files and it's estimating COCO 2017 val set (with 5000 images, 5000 small JSON) to take 3 hours to upload
i

Iddo Avneri

01/31/2023, 6:41 PM
HI @Conor Simmons - Have you considered importing instead of uploading?
c

Conor Simmons

01/31/2023, 6:46 PM
Yes I've considered it. But I was under the impression that you can't really do version control with just importing. See this thread: https://lakefs.slack.com/archives/C02CV7MUV4G/p1674842235272179?thread_ts=1674838530.053799&cid=C02CV7MUV4G
I want to be able to reference an old commit, even if files are modified, deleted, etc. I was under the impression this can't be done with a zero copy import
And not just reference it but sync the data with rclone
i

Iddo Avneri

01/31/2023, 6:58 PM
You can version control with an import (that’s actually the most common use). It is just that in your case, if I understand correctly, you want to later on make the changes directly on the object store (not through lakeFS) you imported form, and import again. Correct? I’m assuming that whatever writes to the object store can’t work directly against lakeFS after importing the data? i.e. import the data as a once time thing, and then work with the S3 gateway, as opposed to cloning over and over again?
c

Conor Simmons

01/31/2023, 7:00 PM
I’m assuming that whatever writes to the object store can’t work directly against lakeFS after importing the data?
I'm not sure what you mean by this
i

Iddo Avneri

01/31/2023, 7:02 PM
Usually, once you imported the data, you would work directly against lakeFS for version control. As opposed to making modifications directly on S3 without lakeFS and then reuploading / cloning / importing.
c

Conor Simmons

01/31/2023, 7:03 PM
I guess I'm confused by this phrase
work directly against lakeFS
Does this mean work with LakeFS? Or against meaning some other tool?
I'm having some trouble understanding what the S3 gateway is from the docs as well. Do you have any examples using this?
i

Iddo Avneri

01/31/2023, 8:33 PM
Sure!
This notebook, that is a part of our samples, presents how to configure lakeFS and work against the data using the gateway. You can also learn about it on the recording of

this webinar

.
c

Conor Simmons

02/01/2023, 12:38 AM
Doesn't this demo use
client.objects.upload_object
which is an upload not an import?
There's this too. But does it work for images?
i

Iddo Avneri

02/01/2023, 12:46 AM
It does. But it will work just the same (with the gateway) for an import
It works for images
The idea of the demo was to show how to configure the gateway and work directly with the object store via lakefs
c

Conor Simmons

02/01/2023, 12:47 AM
So I shouldn't need pyspark then?
i

Iddo Avneri

02/01/2023, 12:48 AM
(not to partition images, but tou can work with them the same as you would directly with the object store - via lakefs)
c

Conor Simmons

02/01/2023, 12:49 AM
i

Iddo Avneri

02/01/2023, 12:50 AM
Specifically Spark can work either with the gateway or via the lakeFS hadoop File system client (like the page you indicated).
c

Conor Simmons

02/01/2023, 12:51 AM
So Spark is inherently part of using the gateway, correct?
i

Iddo Avneri

02/01/2023, 12:52 AM
You can work with spark either with the gateway or with the lakeFS hadoop filesystem client
c

Conor Simmons

02/01/2023, 12:52 AM
I'm wondering whether or not I need to use Spark
i

Iddo Avneri

02/01/2023, 12:52 AM
You don’t have to use spark to benefit from lakeFS
If you use spark, those are the ways to configure it.
c

Conor Simmons

02/01/2023, 12:54 AM
Were you originally suggesting to use Spark? (here)
i

Iddo Avneri

02/01/2023, 12:55 AM
Let me try and put it some other way (from the origin of the conversation as I understand it) When working with big data, you typically want the data to stay in place. When writing code, it is common to “check out” a local version of the code to your desktop and work against a copy of the code. However, when working with big data, it is unreasonable to copy terabytes (or more) of files locally to experiment, develop, test or transform the data. Working with lakeFS, the data stays in place and the branch isolation is achieved via pointers manipulations (https://docs.lakefs.io/understand/model.html#objects), pointing to the data staying on the object store. Meaning, you won’t check out a local copy of the data, but rather create a separate branch and work against that branch. Once done, if you want to “push” the changes back, merge back your isolated branch.
I was suggesting that whatever accesses the data, will do so, via lakeFS and not directly to the object store
👍 1
So the modifications (adding or deleting files) will be done via lakeFS.
c

Conor Simmons

02/01/2023, 12:56 AM
whatever accesses the data, will do so, via lakeFS and not directly to the object store
got it, I am doing that already. But with upload
Is there a way to do it with import?
i

Iddo Avneri

02/01/2023, 12:56 AM
Yes. Just import
Import once - and then mofidy the files via lakeFS.
(as opposed to making changes on the object store and then reimporting)
c

Conor Simmons

02/01/2023, 12:58 AM
and then mofidy the files via lakeFS
What do you mean by modify in terms of CLI commands or Python SDK usage?
(and hopefully add, modify, delete, right?)
i

Iddo Avneri

02/01/2023, 1:00 AM
How do you work with your object store today? What changes the files there?
c

Conor Simmons

02/01/2023, 1:00 AM
I would say
lakectl fs upload
to add or modify and
lakectl fs rm
to remove
I've also tried re-importing to get changes to go through
However, when working with big data, it is unreasonable to copy terabytes (or more) of files locally to experiment, develop, test or transform the data.
I understand your point here - it's not an ideal scenario. But what if we need to modify every file in our dataset. Then 3 months later, we need to reproduce the training job that was done on a previous dataset commit for some important reason. We need to be able fall back to the old dataset version to get the desired reproducibility, no? But we still want to keep the newest version for future experiments. I may be misunderstanding the use case of LakeFS
i

Iddo Avneri

02/01/2023, 1:06 AM
You are not misunderstanding the use case. I think we might be misaligned on the implementation 🙂
👍 1
You can easily access a historical commit in lakeFS and get the full data set as it was at the time of that commit.
What creates the files initially on the object store?
(regardless to lakeFS)
c

Conor Simmons

02/01/2023, 1:09 AM
You can easily access a historical commit in lakeFS and get the full data set as it was at the time of that commit.
Got it. I've done it with upload but if there's a more efficient way I'd love to try it.
What creates the files initially on the object store?
So, the files are being created locally. Previously, before importing, I've synced them to S3 with
b2 sync
(BackBlaze)
i

Iddo Avneri

02/01/2023, 1:10 AM
If you import once and them use lakectl to upload / delete as you all out that will be available. That might be better in cases where the entire data set doesn’t change on every run
c

Conor Simmons

02/01/2023, 1:16 AM
I also have rclone in my tool set so maybe it's better to think using that. You're then suggesting 1. when initially creating the dataset,
rclone sync
to S3 store. Say there's 100,000 images 2. import to LakeFS 3. When adding, deleting, or modifying to the dataset, use
lakectl fs upload
or
lakectl fs rm
i

Iddo Avneri

02/01/2023, 1:54 AM
I might be missing something - but I would say assuming the files are already on S3, you can just import as opposed to rclone (which will copy the files). I’ll sync with @Ariel Shaqed (Scolnicov) & @Elad Lachmi tomorrow to get the full context of the conversation
👍 1
c

Conor Simmons

02/01/2023, 1:55 AM
(For rclone I meant using it independent of lakefs, so rclone from local storage to object store. Same functionality as
b2 sync
)
i

Iddo Avneri

02/01/2023, 1:56 AM
Ok.
That makes sense
So yes - the three steps you called out make sense.
c

Conor Simmons

02/01/2023, 1:58 AM
Ok I will try it. I was under the impression that with the 100,000 images I originally imported, if I modify them, I won't be able to go back to their original version. Maybe if I never modify them again in the S3 store directly?
i

Iddo Avneri

02/01/2023, 1:59 AM
Exactly - modify them via lakeFS (for example, with the lakectl command)
c

Conor Simmons

02/01/2023, 2:01 AM
Got it. What if I want to modify all 100,000? Would a re-import make sense? Or should I never import again?
i

Iddo Avneri

02/01/2023, 2:02 AM
This goes back to your original question - if you are planning to delete files from the original import, you will loose the ability to go back to them. Since an import only creates pointers to the files.
👍 1
c

Conor Simmons

02/01/2023, 2:03 AM
Ok I think I'm following then. But why should I import in the first place? To be quicker than upload?
i

Iddo Avneri

02/01/2023, 2:04 AM
Quicker and you don’t actually create a copy of the files.
c

Conor Simmons

02/01/2023, 2:04 AM
I thought upload could also be zero-copy
i

Iddo Avneri

02/01/2023, 2:08 AM
Upload physically uploads a file. You can branch later on and that will be a zero-copy operation. But the initial upload adds a file to the object storage.
c

Conor Simmons

02/01/2023, 2:09 AM
But by the object storage you just mean the S3 store right? Don't I need to do the same with import?
Don't I need to do the same with import?
e.g. using rclone
i

Iddo Avneri

02/01/2023, 2:10 AM
I understand you use rclone to bring the data to an objet store unrelated to lakeFS. What I mean is when you upload, the file will be placed on the object store in which the lakeFS repository sits. When you import, we will only create a pointed to the file, in it’s original place.
c

Conor Simmons

02/01/2023, 2:11 AM
an objet store unrelated to lakeFS.
the object store in which the lakeFS repository sits.
what if this is the same object store?
i

Iddo Avneri

02/01/2023, 2:12 AM
c

Conor Simmons

02/01/2023, 2:18 AM
I think I'm on the same understanding of import and upload... But my confusion is still on why upload is slower than rclone+import? It seems more user friendly to just stick with uploading everything
And if I am creating my object store for the first time. I don't already have the data to import
i

Iddo Avneri

02/01/2023, 2:21 AM
I’ll sync with Ariel and Elad tomorrow to understand that part better.
c

Conor Simmons

02/01/2023, 2:21 AM
Ok - thanks for your help and have a good night!
rclone sync seems like a potential alternative to upload. With 2.5 GB of data I'm estimating ~1 hr upload time vs. lakectl fs upload I was estimating 2-4 hrs. However, with upload speed of 31 mbps, it should theoretically be closer to 10 minutes?
e

Elad Lachmi

02/01/2023, 6:09 PM
Yeah, sounds about right, but that's the raw transfer Where you might see a difference is in the req/res overhead and processing
a

Ariel Shaqed (Scolnicov)

02/01/2023, 7:01 PM
Hi @Conor Simmons, One issue with the s3 gateway is, as you've noticed, that all data flows through lakeFS. That's because of the s3 protocol... There's no way around it with that protocol. You might see better performance with what I like to call a "direct" upload, using the lakeFS API. To do that you'd copy your file over on s3 to some new name (typically that name will be a uuid), then call an appropriate lakeFS API that links an existing object on the s3 backing store to wherever you want on your lakeFS branch. This will work! How much speedup you'll see will depend on how much faster you can perform the copies on s3 than on the lakeFS s3 gateway. For huge numbers of files, you can probably parallelize this and win big... Especially if it's also a lot of data to copy. However now you're managing the data objects yourself. That means lakeFS won't manage them. So for instance you'll not be able to get any garbage collection from lakeFS, because we cannot safely delete what we do not control. I suspect there might be a feature request hiding in here somewhere, about importing and copying the backing store. Justifying it would probably need a lot of numbers.
As an alternative, if your files don't start out on s3 but rather written to it as part of your pipeline, then you might win by uploading directly to lakeFS. Now you can do a direct upload to a path controlled by lakeFS, and most of my above objections vanish. If this seems relevant, you might benchmark
lakectl fs upload --direct
for copying from your local disk to lakeFS. That passes data directly to s3, and only performs metadata operations on lakeFS.
🙏 1
c

Conor Simmons

02/01/2023, 7:07 PM
Thanks, I appreciate all the input. I will take a look into some of these options
🙏 1
a

Ariel Shaqed (Scolnicov)

02/01/2023, 7:10 PM
Sorry about the number of options. It really is quite subtle, and each of the methods has unique advantages. I wish I could just tell you "this is the best way", but it really depends on so many parameters.
👍 1
👍🏻 1