Hey slightly smiling face I m going to be showing lakefs to lakeFS #help

Hey :slightly_smiling_face: I'm going to be showin...

user

12/07/2021, 5:02 PM

Hey 🙂 I'm going to be showing lakefs to a couple of people today, one question I had was how can I show its roll-back ability? So for example 1. I ingest (track) a blob storage container that has data. 2. I make a commit (on the main branch) 3. I add new data to the blob storage container. 4. I ingest again 5. I commit again Now I want to revert back to the commit from step 2, which I'm doing with lakectl branch revert <repo> <commit id>, it seems to reset the state of the lakefs tracking, but my blob storage still has that new data from step 3.

user

12/07/2021, 5:06 PM

Hey @Yusuf Khan!

...but my blob storage still has that new data from step 3.

What exactly do you mean by that? Where are you seeing the data?

user

12/07/2021, 5:09 PM

Let me make a fresh repo and share screen grabs and steps.

user

12/07/2021, 5:14 PM

Prefect. Just to clarify, if you are referring to the underlying storage in Azure, the data will not be deleted from there. Since lakeFS needs the data for keeping your history, it will not delete data from the storage. You can configure and run garbage collection to tell lakeFS to delete old files.

user

12/07/2021, 5:41 PM

I create a new branch, add a new file to the blob storage (lets pretend this new file is the result of some operation that I was working on). I ingest to the new branch

user

12/07/2021, 5:42 PM

Then I commit, and merge back into main

user

12/07/2021, 5:43 PM

And my blob storage has file00 and file01

user

12/07/2021, 5:44 PM

How do I revert back so that my blob storage only has file00 ?

user

12/07/2021, 5:45 PM

Please see my message above

user

12/07/2021, 5:47 PM

lakeFS does not delete objects from your underlying storage. To see the effect of your revert, you can use any S3 compatible client configured to use lakeFS, or the lakeFS UI

user

12/07/2021, 5:51 PM

I see, if I don't want to hard delete, but lets say for example I want to work on some task using the data only from commit 1 or some specific branch

user

12/07/2021, 5:51 PM

What is the equivalent of a git checkout <branch> or git checkout <commit> for lakefs?

user

12/07/2021, 5:53 PM

I'm having a bit of trouble tying this to a project workflow. If I'm working on an ML project, and I want to get the data from branch X because it has some augmentations. How can I read from lakefs into my repo? -Assuming I'm in a VM or something

user

12/07/2021, 5:55 PM

In general, you can access objects from a specific commit, branch or tag using the path

<s3://example-repo/example-ref/path/to/object|s3://example-repo/example-ref/path/to/object>

user

12/07/2021, 5:56 PM

This is for S3 compatible clients like boto.

user

12/07/2021, 5:57 PM

If you use the lakectl tool, the path will be the same, with the scheme changing to

lakefs://

user

12/07/2021, 5:59 PM

You can find a lot of tools through which you can access your data with lakefs in the docs: https://docs.lakefs.io/integrations/

user

12/07/2021, 6:02 PM

Once you choose your client, I can guide you further with the integration

user

12/07/2021, 6:20 PM

Ah I see, I have databricks, and azure blob storage. But I guess my end client would just be an Azure VM

user

12/07/2021, 6:22 PM

so without lakefs, on the azure vm I'd use the Azure SDK and create a blob client and use it to download whatever files I need. I'd like to do the same except use lakefs to point to a version of the files I need. Am I understanding this correctly? Is that possible?

user

12/07/2021, 6:28 PM

To use the Azure SDK, you will have to use a tool like https://github.com/gaul/s3proxy to interact with lakefs via the S3 API

user

12/07/2021, 6:29 PM

Alternatively, you can use boto3 instead of the Azure SDK

user

12/08/2021, 3:38 AM

Ah interesting. I shared today with my group and everyone really loves the concept. I'd like to workout some of the details such that it can be used as similarly to git as possible. For the above, I'm looking at the s3proxy package, and I can also look into boto3 (I didn't realize it had azure methods). Is there anything else you'd recommend?

user

12/08/2021, 3:44 AM

Hey Yusuf, boto doesn't have Azure. You can use it to interact with lakeFS since lakeFS implements an S3 compatible API

user

12/08/2021, 3:49 AM

Regarding tools, it really depends on how your pipeline looks. Since your existing code uses the Azure SDK, I think boto is a good place to start

user

12/08/2021, 3:49 AM

I'm glad the presentation was successful🙂

user

12/08/2021, 3:58 AM

We don't use azure sdk formally, its just an example of what we could use to interact with blob storage. We are in an early stage still so we are building out our tooling

user

12/08/2021, 3:59 AM

And okay so from a vm I'd use boto3 to control lakefs which would in turn operate against my azure blob?

user

12/08/2021, 4:03 AM

You will configure boto to operate on lakeFS (instead of S3). lakeFS will operate on your storage if needed

user

12/08/2021, 3:57 PM

Gotcha, thank you!

2 Views

Open in Slack

Previous Next