Hey :slightly_smiling_face: I'm going to be showin...
# help
u
Hey 🙂 I'm going to be showing lakefs to a couple of people today, one question I had was how can I show its roll-back ability? So for example 1. I ingest (track) a blob storage container that has data. 2. I make a commit (on the main branch) 3. I add new data to the blob storage container. 4. I ingest again 5. I commit again Now I want to revert back to the commit from step 2, which I'm doing with lakectl branch revert <repo> <commit id>, it seems to reset the state of the lakefs tracking, but my blob storage still has that new data from step 3.
u
Hey @Yusuf Khan!
...but my blob storage still has that new data from step 3.
What exactly do you mean by that? Where are you seeing the data?
u
Let me make a fresh repo and share screen grabs and steps.
u
Prefect. Just to clarify, if you are referring to the underlying storage in Azure, the data will not be deleted from there. Since lakeFS needs the data for keeping your history, it will not delete data from the storage. You can configure and run garbage collection to tell lakeFS to delete old files.
u
I create a new branch, add a new file to the blob storage (lets pretend this new file is the result of some operation that I was working on). I ingest to the new branch
u
Then I commit, and merge back into main
u
And my blob storage has file00 and file01
u
How do I revert back so that my blob storage only has file00 ?
u
Please see my message above
u
lakeFS does not delete objects from your underlying storage. To see the effect of your revert, you can use any S3 compatible client configured to use lakeFS, or the lakeFS UI
u
I see, if I don't want to hard delete, but lets say for example I want to work on some task using the data only from commit 1 or some specific branch
u
What is the equivalent of a git checkout <branch> or git checkout <commit> for lakefs?
u
I'm having a bit of trouble tying this to a project workflow. If I'm working on an ML project, and I want to get the data from branch X because it has some augmentations. How can I read from lakefs into my repo? -Assuming I'm in a VM or something
u
In general, you can access objects from a specific commit, branch or tag using the path
<s3://example-repo/example-ref/path/to/object|s3://example-repo/example-ref/path/to/object>
u
This is for S3 compatible clients like boto.
u
If you use the lakectl tool, the path will be the same, with the scheme changing to
lakefs://
u
You can find a lot of tools through which you can access your data with lakefs in the docs: https://docs.lakefs.io/integrations/
u
Once you choose your client, I can guide you further with the integration
u
Ah I see, I have databricks, and azure blob storage. But I guess my end client would just be an Azure VM
u
so without lakefs, on the azure vm I'd use the Azure SDK and create a blob client and use it to download whatever files I need. I'd like to do the same except use lakefs to point to a version of the files I need. Am I understanding this correctly? Is that possible?
u
To use the Azure SDK, you will have to use a tool like https://github.com/gaul/s3proxy to interact with lakefs via the S3 API
u
Alternatively, you can use boto3 instead of the Azure SDK
u
Ah interesting. I shared today with my group and everyone really loves the concept. I'd like to workout some of the details such that it can be used as similarly to git as possible. For the above, I'm looking at the s3proxy package, and I can also look into boto3 (I didn't realize it had azure methods). Is there anything else you'd recommend?
u
Hey Yusuf, boto doesn't have Azure. You can use it to interact with lakeFS since lakeFS implements an S3 compatible API
u
Regarding tools, it really depends on how your pipeline looks. Since your existing code uses the Azure SDK, I think boto is a good place to start
u
I'm glad the presentation was successful🙂
u
We don't use azure sdk formally, its just an example of what we could use to interact with blob storage. We are in an early stage still so we are building out our tooling
u
And okay so from a vm I'd use boto3 to control lakefs which would in turn operate against my azure blob?
u
You will configure boto to operate on lakeFS (instead of S3). lakeFS will operate on your storage if needed
u
Gotcha, thank you!