Hi, <https://docs.lakefs.io/> suggests to version ...
# help
u
Hi, https://docs.lakefs.io/ suggests to version control files like:
Copy code
df = spark.read.parquet("<s3a://my-repo/main-branch/collections/foo/>")
I wonder what are the implications of having one branch per per asset (table) vs. one centralized prefix per branch. What would you recommend? How does the branch prefix map to a database schema (for reasons of discoverability) i.e. when someone tries to read the data with plain spark-sql from i.e. perhaps databricks`s catalog?
u
Hey @Georg Heiler and welcome!
u
First, I'd recommend reading more about lakeFS model, it might help understand better the different components.
u
Mind sharing more about your specific use case?
u
You can also take a look at lakeFS use cases for example
u
If you are using Hive Metastore for example, it's recommended to create a branch per scheme
u
hey @Georg Heiler, how is it going? have you managed to find the right branching approach for you?
u
I am still exploring. You mean one branch per schema? hm this is not really what I want to do. I would prefer to keep schemata separated from the branches.
u
But this is still a very early stage exploration.
u
I would want to keep the schemata for logical separation of data assets and have the branching as a sort of versioning
u
Hey @Georg Heiler. One of the prominent items in our roadmap is to improve the experience of using lakeFS together with the metastore. The currently recommended approach is to use our
lakectl metastore copy
command to create a corresponding table pointing at your branch. If you make changes to the table's schema and want to merge them back, you would run the command again in the opposite direction to change your main schema again. For example, the following command will create the table for you:
Copy code
lakectl metastore copy --from-schema default --from-table inventory --to-branch example_branch
After you make the changes, running the following command will merge them into your original table:
Copy code
lakectl metastore copy --from-schema example_branch --from-table inventory --to-branch main --to-schema default
The first command will choose the destination schema and table name according to our suggested model, so be sure to read the docs if you want to override it.
u
Like I said, we are actively looking to improve the lakeFS+metastore experience. While the
lakectl metastore
tool provides the basic capabilities, it is still far from perfect. Therefore, I'm happy to discuss the use case with you and see how lakeFS can improve.