Hi <https docs lakefs io > suggests to version control files lakeFS #help

Hi, <https://docs.lakefs.io/> suggests to version ...

user

08/05/2022, 6:46 AM

Hi, https://docs.lakefs.io/ suggests to version control files like:

Copy code

df = spark.read.parquet("<s3a://my-repo/main-branch/collections/foo/>")

I wonder what are the implications of having one branch per per asset (table) vs. one centralized prefix per branch. What would you recommend? How does the branch prefix map to a database schema (for reasons of discoverability) i.e. when someone tries to read the data with plain spark-sql from i.e. perhaps databricks`s catalog?

user

08/05/2022, 6:50 AM

Hey @Georg Heiler and welcome!

user

08/05/2022, 6:53 AM

First, I'd recommend reading more about lakeFS model, it might help understand better the different components.

user

08/05/2022, 6:54 AM

Mind sharing more about your specific use case?

user

08/05/2022, 6:55 AM

You can also take a look at lakeFS use cases for example

user

08/05/2022, 7:01 AM

If you are using Hive Metastore for example, it's recommended to create a branch per scheme

user

08/08/2022, 8:47 AM

hey @Georg Heiler, how is it going? have you managed to find the right branching approach for you?

user

08/09/2022, 5:08 PM

I am still exploring. You mean one branch per schema? hm this is not really what I want to do. I would prefer to keep schemata separated from the branches.

user

08/09/2022, 5:08 PM

But this is still a very early stage exploration.

user

08/09/2022, 5:11 PM

I would want to keep the schemata for logical separation of data assets and have the branching as a sort of versioning

user

08/09/2022, 5:26 PM

Hey @Georg Heiler. One of the prominent items in our roadmap is to improve the experience of using lakeFS together with the metastore. The currently recommended approach is to use our

lakectl metastore copy

command to create a corresponding table pointing at your branch. If you make changes to the table's schema and want to merge them back, you would run the command again in the opposite direction to change your main schema again. For example, the following command will create the table for you:

Copy code

lakectl metastore copy --from-schema default --from-table inventory --to-branch example_branch

After you make the changes, running the following command will merge them into your original table:

Copy code

lakectl metastore copy --from-schema example_branch --from-table inventory --to-branch main --to-schema default

The first command will choose the destination schema and table name according to our suggested model, so be sure to read the docs if you want to override it.

user

08/09/2022, 5:29 PM

Like I said, we are actively looking to improve the lakeFS+metastore experience. While the

lakectl metastore

tool provides the basic capabilities, it is still far from perfect. Therefore, I'm happy to discuss the use case with you and see how lakeFS can improve.

2 Views

Open in Slack

Previous Next