When using the <metastore copy command >with Spark...
# help
s
When using the metastore copy command with Spark/Hive, can I assume that the created table is externally managed? Aka, if someone executed a drop table on a table in a DEV branch it would only drop the metadata, not the actual data objects in Lakefs
a
Hi Sid, Could you please clarify the question? Specifically, do you refer to Hive tables created with some other application, or using our Hive integration?
s
With the Glue/Hive metastore
a
Thanks! Checking...
s
Thanks - I have a hard time imagining that it wouldn't be an external table, because the lakefs branch operation already creates the data files. Lakefs lends itself to having externally managed tables
a
As you correctly mention, the tables probably should not own their data. Indeed, the example in https://docs.lakefs.io/integrations/glue_hive_metastore.html#copy uses
CREATE EXTERNAL TABLE
on the source table. I shall attempt to verify, but might have a definitive answer only tomorrow.
Sorry, false alarm. The metastore client does indeed simply copy the table type across. So it should be external, and then it stays external.
👍 1
Also opened https://github.com/treeverse/lakeFS/issues/2427 to improve the user experience by making it harder to copy a non-EXTERNAL table.
s
Ok cool, yeah we would actually want the ability to issue a
drop table
command on the dev schema and have it delete the dev data objects. Aka if there were a way to create internal tables using the metastore copy, but I'm not sure how that would even work.
a
We are actively working on improving Hive support, but Hive metastore (rather than Hive) is our primary focus right now. IIUC your use-case is an actual Hive, right? Rather than just overloading on the metastore semantics. I'd be interesting in learning more about it!
s
No, this is Hive metastore
a
Thinking about it, it might actually work, branches and all, with Hive metastore, because everything still goes through lakeFS. The Athena symlinks that we can generate would be dangerous... but Athena seems to have only EXTERNAL tables (https://docs.aws.amazon.com/athena/latest/ug/drop-table.html)... so we'd again be OK. I guess I would like to revise my original answer. It should work for internal tables too, and in a nice way: drop only the version on the branch, keep the object on other branches. I shall test it and get back to you on Monday, would that be OK?
s
We have a very SQL-first approach to data transformation in our organization. We want to create experimental/dev branches that are isolated from the master branch, but in order to manipulate the data in the newly created branches we need to create Hive metastore tables on top of them. And to use our particular dbt workflow those tables need to be Hive-managed tables, not external tables. The document describes our workaround for that.
o
Hey @Sid Senthilnathan - Sorry for the late reply! I went through the document you've attached and the workflow you've described! I'm trying to summarize the actions we can take on our end that would make this workflow easier. Let me know if I got it right: 1. Support recursive deletion with
lakectl fs rm
2. Support creation of hive-managed tables (as opposed to currently supporting only external tables)? 3. Extend
lakectl metastore copy
to support schema level copying, not only table level Did I get it right?
s
Yes, all those seem right. I would add to #3 a regex-pattern based selection as well. For #1 we've already opened a ticket.
More generally speaking, tighter integration with dbt is what we're looking for. Maybe a dbt-lakefs plugin where all the steps of branching/copying metastore data/etc can be simplified. For example, we select a model to run and pass in a branch name, dbt produces a dependency graph of all upstream and downstream tables that need registered in the metastore, lakefs branches and registers the metadata, etc. I understand this is ambitious, but it would be awesome for us and others I am sure.
👍🏼 1
👍 1