The Future of Metadata After Hive Metastore

What are your thoughts during and following the roundtable on metadata and Hive Metastore. What did you find most interesting? What did you agree with, or disagree? Let us know!

1 Like

Let’s use this space to address some of the questions asked during the event that there wasn’t time to respond to.

Question from Claudius Li:
Do you see this [a multi-metastore future] more as query engines just supporting a bunch of metastores and being able to coordinate between them or do you expect some sort of metastore aggregation technology (ie single point of metadata)?

Question from Alex Landa
Do you think cloud providers will adapt one of the open source solutions (hudi, iceberg, delta) in a native manner similar to Glue in AWS?

Question from Mehul Shah:
Does Hudi have an initial metastore?

Question from Pramit Mitra:
Is Databricks for cloud environment is pretty much mainstream now or do you see any major competitor (or early-stage exciting project) coming up?

Question from Denis Krivenko:
Do we have an open sourced and mature enough solution to run hive metastore on kubernetes? Ideally it should be helm chart. What do you think where this could be created as project if it doesn’t exists?

This round table has been a very interesting and relevant conversation. I totally agree with the undeniable trend to heterogeneous architectures for data lakes. Imho heterogeneity and openness and ability to mix and mesh different processing engines, data formats, file storage and metadata catalogs is a defining element for data lakes (and Lakehouses) going forward and what fundamentally differentiates them from a data warehouse like e.g., Snowflake (I don’t agree with Snowflakes self-declaration as a “data lake” solution).

The conversation did mainly focus on technical metadata for data at rest, which is of course the focus of hive metastore and table formats like Iceberg. I would like to propose to broaden the conversation further into two directions:

  1. Data Governance: Let’s not just think about the future of hive metastore, but also what is the future of ranger and atlas level metadata. I think there is need and opportunity by a next generation metastore to also solve FGAC, data protection policies mgmt and data lineage mgmt. I further think that table formats provide a good basis for a reliable cross-engine enforcement of data governance. E.g., it could provide reliable crypto-enforcement through Parquet encryption (parquet-format/ at master · apache/parquet-format · GitHub).

  2. Data in motion: I think everyone would agree that there is a strong trend to real-time analytics, which basically means to process and analyze data before or even without it being persisted into something like Iceberg at all. A future metastore needs to account for that and also support real-time tables for data in a Kafka topic. The question can be translated to: What is the future of Kafka Schema Registry and how does it relate to hive metastore and its future? I think there are strong benefits in a combined metastore that handles tables on data at rest and in motion together. This way the metastore can also become the central control point how data is persisted from in motion to a persistent table at rest, i.e. become a mechanism for data topologies.

At some point in the round table there was some mentioning of indexing as part of a future metastore. I think there is big opportunity and need to standardise indexing here. Either the metastore, or the table formats (or both together) should embrace open data lake indexing frameworks like e.g., Xskipper (GitHub - xskipper-io/xskipper: An Extensible Data Skipping Framework) to drive an indexing standard for open data lake architectures.