https://lakefs.io/ logo
Title
y

Yusuf Khan

12/07/2021, 5:02 PM
Hey 🙂 I'm going to be showing lakefs to a couple of people today, one question I had was how can I show its roll-back ability? So for example 1. I ingest (track) a blob storage container that has data. 2. I make a commit (on the main branch) 3. I add new data to the blob storage container. 4. I ingest again 5. I commit again Now I want to revert back to the commit from step 2, which I'm doing with lakectl branch revert <repo> <commit id>, it seems to reset the state of the lakefs tracking, but my blob storage still has that new data from step 3.
y

Yoni Augarten

12/07/2021, 5:06 PM
Hey @Yusuf Khan!
...but my blob storage still has that new data from step 3.
What exactly do you mean by that? Where are you seeing the data?
y

Yusuf Khan

12/07/2021, 5:09 PM
Let me make a fresh repo and share screen grabs and steps.
y

Yoni Augarten

12/07/2021, 5:14 PM
Prefect. Just to clarify, if you are referring to the underlying storage in Azure, the data will not be deleted from there. Since lakeFS needs the data for keeping your history, it will not delete data from the storage. You can configure and run garbage collection to tell lakeFS to delete old files.
y

Yusuf Khan

12/07/2021, 5:37 PM
So this is a dummy blob container, it has this one file:
I ingest and commit those changes:
I create a new branch, add a new file to the blob storage (lets pretend this new file is the result of some operation that I was working on). I ingest to the new branch
Then I commit, and merge back into main
So now I have this in my log:
And my blob storage has file00 and file01
How do I revert back so that my blob storage only has file00 ?
y

Yoni Augarten

12/07/2021, 5:45 PM
Please see my message above
lakeFS does not delete objects from your underlying storage. To see the effect of your revert, you can use any S3 compatible client configured to use lakeFS, or the lakeFS UI
y

Yusuf Khan

12/07/2021, 5:51 PM
I see, if I don't want to hard delete, but lets say for example I want to work on some task using the data only from commit 1 or some specific branch
What is the equivalent of a git checkout <branch> or git checkout <commit> for lakefs?
I'm having a bit of trouble tying this to a project workflow. If I'm working on an ML project, and I want to get the data from branch X because it has some augmentations. How can I read from lakefs into my repo? -Assuming I'm in a VM or something
y

Yoni Augarten

12/07/2021, 5:55 PM
In general, you can access objects from a specific commit, branch or tag using the path
<s3://example-repo/example-ref/path/to/object|s3://example-repo/example-ref/path/to/object>
This is for S3 compatible clients like boto.
If you use the lakectl tool, the path will be the same, with the scheme changing to
lakefs://
You can find a lot of tools through which you can access your data with lakefs in the docs: https://docs.lakefs.io/integrations/
Once you choose your client, I can guide you further with the integration
y

Yusuf Khan

12/07/2021, 6:20 PM
Ah I see, I have databricks, and azure blob storage. But I guess my end client would just be an Azure VM
so without lakefs, on the azure vm I'd use the Azure SDK and create a blob client and use it to download whatever files I need. I'd like to do the same except use lakefs to point to a version of the files I need. Am I understanding this correctly? Is that possible?
y

Yoni Augarten

12/07/2021, 6:28 PM
To use the Azure SDK, you will have to use a tool like https://github.com/gaul/s3proxy to interact with lakefs via the S3 API
Alternatively, you can use boto3 instead of the Azure SDK
y

Yusuf Khan

12/08/2021, 3:38 AM
Ah interesting. I shared today with my group and everyone really loves the concept. I'd like to workout some of the details such that it can be used as similarly to git as possible. For the above, I'm looking at the s3proxy package, and I can also look into boto3 (I didn't realize it had azure methods). Is there anything else you'd recommend?
y

Yoni Augarten

12/08/2021, 3:44 AM
Hey Yusuf, boto doesn't have Azure. You can use it to interact with lakeFS since lakeFS implements an S3 compatible API
Regarding tools, it really depends on how your pipeline looks. Since your existing code uses the Azure SDK, I think boto is a good place to start
I'm glad the presentation was successful🙂
y

Yusuf Khan

12/08/2021, 3:58 AM
We don't use azure sdk formally, its just an example of what we could use to interact with blob storage. We are in an early stage still so we are building out our tooling
And okay so from a vm I'd use boto3 to control lakefs which would in turn operate against my azure blob?
y

Yoni Augarten

12/08/2021, 4:03 AM
You will configure boto to operate on lakeFS (instead of S3). lakeFS will operate on your storage if needed
y

Yusuf Khan

12/08/2021, 3:57 PM
Gotcha, thank you!