https://lakefs.io/ logo
Title
s

Seungchan Lee

04/25/2023, 3:58 PM
Hi, it looks like the api allows for only one file upload at a time - what’s the best practice for uploading an image folder (i.e. with subfolders, multiple images, csv with image index, etc) via lakeFS?
i

Idan Novogroder

04/25/2023, 4:24 PM
Hi @Seungchan Lee, I can think of two alternatives you can use- 1. In case you want to import data from S3, you can use lakeFS import capability. https://docs.lakefs.io/howto/import.html 2. You can use lakeFS S3 gateway which supports multipart uploads (I'm assuming you are using S3 according to your last message). https://docs.lakefs.io/reference/s3.html Please let me know if it was helpful or if I can help somehow else 🙏🏽 .
s

Seungchan Lee

04/25/2023, 4:26 PM
OK thanks! Just a quick follow up - if lakeFS S3 gateway supports multipart uploads, why isn’t there client api for bulk upload?
o

Oz Katz

04/25/2023, 4:31 PM
Hey @Seungchan Lee - nice to meet you! I want to better understand what you're trying to do.. is it: 1. Upload a single file in smaller chunks, in parallel (i.e. a multipart upload) 2. Upload multiple items concurrently?
Also, which client are you looking to use? an SDK? CLI? a framework such as Spark?
s

Seungchan Lee

04/25/2023, 4:32 PM
I’m trying to upload multiple items concurrently - imagine creating an image dataset in s3 and having to add/update the dataset over time. We may use sdk or cli.
We will likely also extend the use case to something like Airbyte where multiple data sources are merged and output to s3 periodically.
o

Oz Katz

04/25/2023, 4:36 PM
Got it! Thanks. Follow up question: the list of files on S3 - is that a directory or common prefix that you'd like to have synced, or an arbitrary list of different files scattered throughout the bucket?
phrased differently, would the input be a root location that you'd like lakeFS to take anything inside it, or an array of paths?
s

Seungchan Lee

04/25/2023, 4:37 PM
Each dataset we create will have a unique prefix and we plan to match it to a lakeFS repo
Directory structure in each dataset may differ depending on data type, etc
Maybe using zero copy import is a better solution here?
👍 1
o

Oz Katz

04/25/2023, 4:41 PM
got it! thanks again. I think this is a good use case for the zero-copy import functionality in lakeFS - it allows you to have lakeFS "point" to a set of objects in another location without actually having to copy the files into the lakeFS repository. If you import path
foo/
and later import that same path again on the same lakeFS branch, lakeFS would actually show you in its diff the changes that occured since last import (what was removed from that directory, what was added, etc. One thing to keep in mind - if you delete anything from
foo/
, naturally you won't be able to access it even from historical commits on the lakeFS side
s

Seungchan Lee

04/25/2023, 4:43 PM
Yeah this is what I’m a bit concerned about - if I change the underlying bucket (for example, delete some files), then lakeFS won’t be able to rollback
That removes a big part of having lakeFS
Doesn’t it?
o

Oz Katz

04/25/2023, 4:43 PM
Copying in parallel is the alternative - you can use rclone / aws s3 sync / distcp or any other tool that allows parallel copying
If the input directory changes often, then yes, lakeFS would not be able to guarantee immutability.
s

Seungchan Lee

04/25/2023, 4:44 PM
But if I’m copying, I’m duplicating the dataset
Which is what lakeFS is trying to avoid too - in-place versioning
o

Oz Katz

04/25/2023, 4:45 PM
True - that's the tradeoff. import doesn't require copy but can't guarantee immutability. Copying can guarantee immutability but requires storage..
s

Seungchan Lee

04/25/2023, 4:45 PM
Strange lakeFS hasn’t run into this issue before - I consider dataset creation and versioning a pretty basic use case
o

Oz Katz

04/25/2023, 4:46 PM
Once the data is managed in lakeFS, you enjoy deduplication across branches (i.e. creating another branch doesn't copy the data again). You then define garbage collection rules to determine when files can get cleaned up
s

Seungchan Lee

04/25/2023, 4:46 PM
I understand, but without bulk upload, how does one create a large dataset via lakeFS in the first place?
o

Oz Katz

04/25/2023, 4:47 PM
a single lakeFS instance would let you upload many hundreds (and even thousands) of objects in parallel
same as S3 in that regard
Parallelizing is up to the client, and in fact, most s3 clients support doing so out of the box..
s

Seungchan Lee

04/25/2023, 4:49 PM
So I’m still a bit confused - your recommendation is then to use the lakeFS s3 gateway?
o

Oz Katz

04/25/2023, 4:50 PM
either to use the s3 gateway with the tools suggested above, or do so programatically with e.g. the lakeFS python SDK - set up multiple threads or routines that in parallel perform uploads.
(with the python SDK, no need to use the s3 gateway, it would use the lakeFS rest API instead)
s

Seungchan Lee

04/25/2023, 4:52 PM
OK so with the sdk, basically use:
client.objects.upload_object
But handle the concurrency myself
o

Oz Katz

04/25/2023, 4:53 PM
indeed
s

Seungchan Lee

04/25/2023, 4:53 PM
Or use the s3 gateway
👍 1
o

Oz Katz

04/25/2023, 4:53 PM
I'm wondering, what would an ideal solution look like for you? what is lakeFS missing in your eyes to make that easier?
s

Seungchan Lee

04/25/2023, 4:54 PM
I think s3 gateway api might work but it’s more that the quickstart tutorials do not provide any usable examples for this. If I were to create an image dataset using lakeFS, what’s the best way to do it?
It’d be nice to have something like this documented
Currently
I’m just using the typical workflow of allowing our app users to upload files from their local dir to our s3 bucket.
I wanted to use lakeFS to version it, but I wasn’t sure how.
I’ll try out the gateway option to replace our s3 upload code and see if that works
Thanks!
o

Oz Katz

04/25/2023, 4:57 PM
Thanks for the great feedback! we've been discussing this use case a lot lately and we do have some ideas on how to make it better. I'd love it if you could spare a few minutes some time this week for a quick session? would love to run some ideas by you 🙂
s

Seungchan Lee

04/25/2023, 4:58 PM
Oh sorry - one more question regarding the gateway
I’m looking at this page: https://docs.lakefs.io/reference/s3.html
But it’s not clear how to use it?
Can you point me an example or more documentation on this?
As for discussing this, sure, but I’m actually swamped this week - next week would work
Thu/Fri
Maybe send me your Calendly link or something and we can chat
o

Oz Katz

04/25/2023, 5:00 PM
All you have to do is configure your s3 endpoint to point to your lakeFS server. See this example.
s

Seungchan Lee

04/25/2023, 5:00 PM
OK thank you
o

Oz Katz

04/25/2023, 5:00 PM
For each s3 client it might be configured a bit differently, but most of them allow overriding the endpoint url
s

Seungchan Lee

04/25/2023, 5:01 PM
Hmm this one is python sdk though - how would I use multipart upload with this?
Oh or using boto section?
o

Oz Katz

04/25/2023, 5:01 PM
whenever works for you
Yea, I was referring to the boto example
s

Seungchan Lee

04/25/2023, 5:02 PM
Got it thanks
How’s 10am PT Thu next week (May 4)
If you give me your email, I can send you an invite with google meet link - we can leave the video off
o

Oz Katz

04/25/2023, 5:14 PM
Do keep in mind I'm in GMT+3 timezone 🙂
s

Seungchan Lee

04/25/2023, 5:16 PM
Do you prefer another time?
o

Oz Katz

04/25/2023, 5:21 PM
will your 9am work?
s

Seungchan Lee

04/25/2023, 5:21 PM
Sure - let’s do 9am
o

Oz Katz

04/25/2023, 5:21 PM
cool, thanks!
s

Seungchan Lee

04/25/2023, 5:21 PM
No problem!
Invite sent - it’s from my personal email address
Let me know if you got it
o

Oz Katz

04/25/2023, 5:24 PM
I did! looking forward to talking 🙂
👍 1
s

Seungchan Lee

04/25/2023, 5:24 PM
Great- talk to you then!
🤘 1