https://lakefs.io/ logo
Title
d

Danyil Butkovskyi

05/16/2023, 5:24 PM
Hi team, Doing some research on lakefs for our project and have a few questions(I know those are silly, but couldn't find anything online) 1. Is there any way to pull data from lakefs into azure storage? 2. Is there anyway to import all containers from the storage account?
j

Jonathan Rosenberg

05/16/2023, 5:57 PM
Hi @Danyil Butkovskyi welcome to the lake :lakefs: 1. What do you mean by pulling data from lakeFS into Azure storage? Can you explain the use case? 2. You can zero-copy import your data from a given container using
lakectl import
or the UI as explained here.
d

Danyil Butkovskyi

05/16/2023, 6:28 PM
Thanks for the response. Let me go step back to how I understood it (please don't judge me if I say something stupid, I am just learning). LakeFS has a GIT representation of my storages, in my cause it is a Azure Gen 2. Lets say I have container A and I run transformation on it, but something went wrong and I messed up. In my understanding I would pull version of the data before this happened, right? Or another example I am trying to understand right now. I have storage container A and A_test. In lakefs I have master branch (data from A) and bracnh_test. I am running transformation on A and save it into A_test, then I import that into branch_test and it looks good to me, so I decide to merge master and branch_test, but now my master has data that is not in A, so how do I update my container A with most recent data?
j

Jonathan Rosenberg

05/16/2023, 6:31 PM
There are no stupid questions here šŸ™‚ Ok, so I’ll try to explain how it works:
lakeFS is an abstraction layer that sits in front of the storage, in your case Azure Gen 2. When you first start using lakeFS you need, just as you do with git, to create a repository. The repository is located in your container (or any sub ā€œdirectoryā€) in it. Now instead of directly sending requests to Azure, you will now interact with lakeFS’s repository and branch (which will store the data in the container you defined it with). You can read more on the lakeFS data model here. Now lets refer to your case: Lets say that you don’t have any data in your container yet. You just have an empty container. That’s container A. Now you need to configure a repository in lakeFS with a storage namespace that points to your container (A). That repo, again very much like git, will be created with a default branch (`main`/`master`). You can now branch out (on lakeFS- you don’t need to do any direct action on your storage) to a
test
branch and run experiments over it. Any transformation that you’ll apply on that branch will be totally isolated from any other branch, thus your data will not be harmed in any way. Needless to say that when you branch out it’s zero-copy, i.e. none of your data is copied. If any changes happen to your data it will be reflected only on the test branch. I advise you to follow the quickstart to feel how it works. Now, what if you already have data in a container and you want to import it to lakeFS? i.e. you have a container A with data in it, and you want lakeFS to manage it. Simply initializing lakeFS in the container won’t work- you need to import the data. To import it follow this guide. Let me emphasize that none of your original data is actually copied, rather pointers to the data are created. If you have any further question feel free to ask. Have a great time!
I would also advise you to watch

this videoā–¾

. It’s using AWS but the concepts are the same…
d

Danyil Butkovskyi

05/16/2023, 7:45 PM
Thank you! I missed the point that I need to interact with the data via lakefs now (I was trying to update it directly and wondering why if doesn't work šŸ˜•) It took me some time but I think I am getting it. To make it clear: I am using databricks and right now I use mount storage. I need to replace it with lakefs configuration (https://docs.lakefs.io/v0.52/integrations/databricks.html), correct? I will try a few things tomorrow and if I have questions can I leave a reply here or should I start a new thread?
j

Jonathan Rosenberg

05/16/2023, 7:48 PM
You can leave a reply here… No problem
h

HT

05/16/2023, 9:57 PM
The way I see LakeFS is : • It's S3 provider (that you configured behind the scene to use AWS S3, Azure blob, minio,, etc .... • It have a "snapshot" capability, not git versionning ! LakeFs merge is quite different from git merge as it will not try to merge the data together. LakeFS merge is basically take this "snapshot", aka commit, as the latest version of a given branch. Where git merge will try to take both changes from both branches and try to get them together.
i

Iddo Avneri

05/16/2023, 10:05 PM
@HT - not sure exactly what you mean by ā€œnot try to merge the data togetherā€ You are correct that we won’t merge together changes within a file. But lakeFS merges will in fact take changes from both branches and merge them together.
The reason for the difference is the different use case. When working with big data. Unlike git where you have a smaller number of files where you are interested in a single line difference within those files. With big data, you have millions or billions of files, and you won’t compare a specific line manually before a merge.
h

HT

05/16/2023, 10:07 PM
Let consider file X, different between branch A and branch B. When you merge A to B, what happen to file X ?
i

Iddo Avneri

05/16/2023, 10:07 PM
Exactly - that is within the same file - which is not the common use case for big data.
h

HT

05/16/2023, 10:08 PM
but with git, it will either merge or conflict
i

Iddo Avneri

05/16/2023, 10:08 PM
So will lakeFS…
Merge. Or conflict.
Here is a quick example:
h

HT

05/16/2023, 10:11 PM
what is the default merge strategy ? Raise to user ?
i

Iddo Avneri

05/16/2023, 10:12 PM
Merge.gif
h

HT

05/16/2023, 10:23 PM
so by default conflict will make your merge fail right ?
i

Iddo Avneri

05/16/2023, 10:27 PM
Right
h

HT

05/16/2023, 10:30 PM
Thank @Iddo Avneri for clearing this out. LakeFS is a bit better than just snapshot system. It's just not merging at file content level but do at file level
i

Iddo Avneri

05/16/2023, 10:42 PM
It is correct we do not merge at file content level. The use case is usually merging data though :). Happy to help!
h

HT

05/16/2023, 10:43 PM
I was going to use lakefs to manage dataset with images and Annotations The difficulty will be the annotations changes and merge ... We may need a different system for annotations ...
i

Iddo Avneri

05/16/2023, 10:57 PM
That’s actually a pretty common use case. Would be happy to present to you a notebook going through this example.
I believe @Amit Kesarwani prepared something like that before.
h

HT

05/16/2023, 10:58 PM
Do you need to manually merge annotation file before you do a lakefs merge ?
a

Amit Kesarwani

05/16/2023, 11:05 PM
@HT I will send you direct message regarding managing datasets with Images and Annotations. Original question of this thread was different so I don’t want to mix multiple conversations here.
šŸ‘ 1
d

Danyil Butkovskyi

05/17/2023, 5:37 PM
@Jonathan Rosenberg thank you again , lakeFS makes much more sense now. I have a few new questions. 1. When I am trying to write stream I am getting this error com.databricks.tahoe.store.S3LockBasedLogStore doesn't support createAtomicIfAbsent(path: Path) Is there any way to use lakeFS with sritestream? 2. Is there a way to "export" a branch? For our use case customer might want to export data and I believe lakeFS doesn't have this capabilities, correct? So our best way to do so is probably run a spark job to extract data into storage directly. And btw, I just started, but I really enjoy working with lakeFS, feels like something we needed for a long time :)
i

Iddo Avneri

05/17/2023, 6:09 PM
Regarding the second question You can export in a few ways. More in depth overview of the file structure and ways to export here.
ā˜ļø 1
j

Jonathan Rosenberg

05/17/2023, 7:30 PM
@Danyil Butkovskyi I’m glad that you find lakeFS helpful and enjoyable! Wait till you’ll find out about lakeFS hooks šŸ˜Ž but one step at a time… Can you share your code and Spark configurations so that I’ll have more context?
d

Danyil Butkovskyi

05/17/2023, 8:03 PM
@Jonathan Rosenberg sure, here it is. The problem occurs when using writeStream. var df = spark.readStream.format("cloudFiles") .option("cloudFiles.useNotifications", "false") .option("cloudFiles.format", "binaryFile") .load(s"s3a://${repo}/${branch}/orig/original/") .select(toStrUDF($"content").alias("text")) .select(from_xml($"text", payloadSchema).alias("Return")) .writeStream.format("delta").outputMode("append") .option("checkpointLocation", s"s3a://${repo}/bronze/orig/original/checkpointLocation") .start(s"s3a://${repo}/bronze/orig/bronze/")
j

Jonathan Rosenberg

05/17/2023, 8:04 PM
great, I’ll also need the Spark configurations that you run it with.
d

Danyil Butkovskyi

05/17/2023, 8:14 PM
There are not much, it's pretty much default databricks 12.2 LTS with 3.3.2 Spark and 2.12 Scala spark.conf.set("spark.databricks.delta.logStore.crossCloud.fatal", "false") spark.databricks.cluster.profile singleNode spark.master local[*, 4] spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "key") spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "key") spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "my_server") spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
j

Jonathan Rosenberg

05/17/2023, 8:15 PM
Ok great Let me go over it and get back to you
šŸ™ 1
Can you try adding the following property to your job:
spark.databricks.delta.multiClusterWrites.enabled false
?
d

Danyil Butkovskyi

05/18/2023, 3:16 PM
Same thing, and I did check to make sure the property is actually set to false. Any other ideas I could try?
j

Jonathan Rosenberg

05/18/2023, 3:19 PM
another config:
spark.databricks.tahoe.logStore.aws.class io.delta.storage.S3SingleDriverLogStore
instead of the above one
d

Danyil Butkovskyi

05/18/2023, 5:28 PM
Same.
j

Jonathan Rosenberg

05/18/2023, 5:41 PM
Ok So I guess I need a deeper dive into this… I’ll update you
d

Danyil Butkovskyi

05/18/2023, 5:45 PM
Whoa, I really appreciate it.
j

Jonathan Rosenberg

05/18/2023, 5:47 PM
sure thing
Hi @Danyil Butkovskyi, Would you mind opening an issue about his error so that we could prioritize it appropriately?