• Yusuf Khan

    Yusuf Khan

    8 months ago
    Hey 🙂 I'm going to be showing lakefs to a couple of people today, one question I had was how can I show its roll-back ability? So for example1. I ingest (track) a blob storage container that has data. 2. I make a commit (on the main branch) 3. I add new data to the blob storage container. 4. I ingest again 5. I commit again Now I want to revert back to the commit from step 2, which I'm doing with lakectl branch revert <repo> <commit id>, it seems to reset the state of the lakefs tracking, but my blob storage still has that new data from step 3.
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    Hey @Yusuf Khan!
    ...but my blob storage still has that new data from step 3.
    What exactly do you mean by that? Where are you seeing the data?
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    Let me make a fresh repo and share screen grabs and steps.
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    Prefect. Just to clarify, if you are referring to the underlying storage in Azure, the data will not be deleted from there. Since lakeFS needs the data for keeping your history, it will not delete data from the storage. You can configure and run garbage collection to tell lakeFS to delete old files.
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    So this is a dummy blob container, it has this one file:
  • I ingest and commit those changes:
  • I create a new branch, add a new file to the blob storage (lets pretend this new file is the result of some operation that I was working on). I ingest to the new branch
  • message has been deleted
  • Then I commit, and merge back into main
  • So now I have this in my log:
  • And my blob storage has file00 and file01
  • How do I revert back so that my blob storage only has file00 ?
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    Please see my message above
  • lakeFS does not delete objects from your underlying storage. To see the effect of your revert, you can use any S3 compatible client configured to use lakeFS, or the lakeFS UI
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    I see, if I don't want to hard delete, but lets say for example I want to work on some task using the data only from commit 1 or some specific branch
  • What is the equivalent of a git checkout <branch> or git checkout <commit> for lakefs?
  • I'm having a bit of trouble tying this to a project workflow. If I'm working on an ML project, and I want to get the data from branch X because it has some augmentations. How can I read from lakefs into my repo? -Assuming I'm in a VM or something
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    In general, you can access objects from a specific commit, branch or tag using the path
    <s3://example-repo/example-ref/path/to/object|s3://example-repo/example-ref/path/to/object>
  • This is for S3 compatible clients like boto.
  • If you use the lakectl tool, the path will be the same, with the scheme changing to
    lakefs://
  • You can find a lot of tools through which you can access your data with lakefs in the docs: https://docs.lakefs.io/integrations/
  • Once you choose your client, I can guide you further with the integration
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    Ah I see, I have databricks, and azure blob storage. But I guess my end client would just be an Azure VM
  • so without lakefs, on the azure vm I'd use the Azure SDK and create a blob client and use it to download whatever files I need. I'd like to do the same except use lakefs to point to a version of the files I need. Am I understanding this correctly? Is that possible?
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    To use the Azure SDK, you will have to use a tool like https://github.com/gaul/s3proxy to interact with lakefs via the S3 API
  • Alternatively, you can use boto3 instead of the Azure SDK
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    Ah interesting. I shared today with my group and everyone really loves the concept. I'd like to workout some of the details such that it can be used as similarly to git as possible. For the above, I'm looking at the s3proxy package, and I can also look into boto3 (I didn't realize it had azure methods). Is there anything else you'd recommend?
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    Hey Yusuf, boto doesn't have Azure. You can use it to interact with lakeFS since lakeFS implements an S3 compatible API
  • Regarding tools, it really depends on how your pipeline looks. Since your existing code uses the Azure SDK, I think boto is a good place to start
  • I'm glad the presentation was successful🙂
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    We don't use azure sdk formally, its just an example of what we could use to interact with blob storage. We are in an early stage still so we are building out our tooling
  • And okay so from a vm I'd use boto3 to control lakefs which would in turn operate against my azure blob?
  • Yoni Augarten

    Yoni Augarten

    8 months ago
    You will configure boto to operate on lakeFS (instead of S3). lakeFS will operate on your storage if needed
  • Yusuf Khan

    Yusuf Khan

    8 months ago
    Gotcha, thank you!