Hey, I need a way to use branches to access files ...
# help
u
Hey, I need a way to use branches to access files on s3 and to be sure that files are shared between branches without having to do any copy. However is there a way to do the same without having to create commits every time I add a file ? In short could I do the same with a dirty work tree like I would do in git ? I guess I would have to create a commit before I create any new branch, right ?
u
Hi! You would like to commit before branching out. Can you please share the use case? (i.e. why not commit?)
u
Hey @C. bon! You are correct. Branches in lakeFS share committed data - but uncommitted data is scoped per branch. this means you would have to commit in order to expose changes across branches. however, unlike Git (where changes are made to a local copy of the repo) - multiple consumers can read/write to the same branch concurrently. Depending on your use case, this might be a sensible solution.. Would love it if tou could share the use case you had in mind?
u
Sure, in my use case I have a piece of software that stores data in immutable files. Those files gets merged after some time to clean old data, make room for new data etc... What I have in mind is to be able to create branches but I don't want my users to have to manage files or my software to have to commit files every time it creates new files. So what I have in mind is to have a default branch when it starts, then when a new branch needs to get created create a new commit before creating it. Thinking about it I also want to do some incremental backup, probably I could create one commit for each incremental backup on each of my existing branch (I might not enable backup on all of my branches)
u
hmm I see. I guess some of this could be automated with the lakeFS api, exposing a function to consumers that when caleed, commits and branches from the existing branch. alternatively, have your code commit every time it finishes running and creating new files?
u
does these make sense?
u
My consumers don't interact with files at all, they interact with my APIs. My software creates files on demand behind the scene. That's why I'd like to avoid to touch that part to commit those files every time they get created/deleted but only on some occasions like snapshots and new branches. And yes if LakeFS exposes an API it's perfect, I suppose that's what the CLI uses ?
u
exactly! there are sdks for Python and Java, but it’s a REST API with an OpenAPI spec that you can call from pretty much any language
u
Cool @Oz Katz it's really a cool project and I had a similar idea in the past. I would love to leverage it instead of having to reinvent the wheel 😉
u
🙂 happy to hear! hope this gave you a path forward.
u
Hey @Oz Katz, regarding this, let's say I have many users and for each of them I manage a different repo in the same s3 bucket. Would it be a problem in term of performance ? I suppose not, knowing that I would most of the time push files but not manage repos at all until a branch has to be created. To remind you what I have in mind, it would be: • create repo which would automatically create main branch (I guess) • push new files / delete files for an x amount of time when a branch needs to be created: • update main branch with current folder content • create branch _new_branch_ off of main branch So I can't see how performance would be impacted
u
@Vino
u
Hey @C. bon! Thanks for reaching out. Yes, the workflow you’re thinking is totally possible. 1. Create several repos from same s3 bucket (but under different prefixes). Note that lakeFS doesn’t allow creating multiple repos on the same bucket. Because repos can’t share storage namespace. 2. While creating new branches, committing (updating the old data) to main branch and creating new branch is the recommended approach. I don’t see a performance bottleneck. However, I’ll let @Oz Katz to go over if there are any limitations on the size of uncommitted data lakeFS can handle.
u
Note that lakeFS doesn’t allow creating multiple repos on the same bucket
You mean
doesn't allow creating multiple repos on the same (bucket, prefix) couple
, right ? otherwise it contradicts what you just said before where it's allowed with different prefixes
u
Yep. Exactly. You can create new repos for every unique (bucket, prefix) pair.
u
Hey @C. bon! I agree with everything @Vino said 🙂 As long as prefixes within said bucket are unique per repository, you can host as many as you want on that same bucket, this should have no impact on performance.
u
Cool thanks for confirming that
u
happy to assist 🙂