Hello I trying to use LakeFS as my Iceberg rest catalog Look lakeFS #help

Hello! I trying to use LakeFS as my Iceberg rest c...

Paco Ibañez

03/15/2024, 9:48 PM

Hello! I trying to use LakeFS as my Iceberg rest catalog. Looking at the documentation it seems that LakeFS provides and implementation of the Iceberg catalog. I am currently stuck trying to use the catalog from pyiceberg. Any suggestions or ideas? This is my pyiceberg.yaml file:

Copy code

catalog:
    default:
        uri: <http://lakefs:8000/my_repo>
        s3.endpoint: <http://minio:9000>
        s3.access-key-id: admin
        s3.secret-access-key: password

Jonathan Rosenberg

03/17/2024, 11:24 AM

Hi @Paco Ibañez The current Iceberg Catalog implementation is JVM based so it won’t work with PyIceberg. Did you try running with Spark instead?

Niro

03/17/2024, 1:37 PM

@Paco Ibañez PyIceberg currently supports a select number of catalog implementations:

Copy code

PyIceberg currently has native support for REST, SQL, Hive, Glue and DynamoDB.

You can work with lakeFS catalog using pyspark though

Paco Ibañez

03/18/2024, 3:43 PM

Hello! thank you very much for responding! I have not tried with Spark yet. I am using pyiceberg because starting in version 0.6.0 it supports writes, which for small use cases is really convenient (no Spark needed). The current setup I have working uses Tabulario's rest catalog implementation which also is JVM based and I can use from pyiceberg (the python client just makes api calls to the catalog service). I thought that LakeFS was also implementing the REST catalog which is in theory supported by pyiceberg. Am I mistaken? Is there a way to look at LakeFS api spec? Are these endpoints available in LakeFS?

Jonathan Rosenberg

03/18/2024, 3:52 PM

Hi @Paco Ibañez lakeFS’s Iceberg catalog isn’t a REST but rather a wrapper around Hadoop catalog. Although Tabular’s catalog is JVM based, it doesn’t run as part of the executable (well, it’s a REST server), so it doesn’t really matter… Would you mind sharing some context on your usage and scenario for using Iceberg?

Paco Ibañez

03/18/2024, 4:03 PM

Ohh I see. I'm using Iceberg to store time series data for a POC. So far, using Tabular's catalog implementation, I am able to ingest data with Spark and also with Prefect using pyiceberg (without requiring a Spark cluster). I am currently exploring if it is possible to replace Tabular's catalog with LakeFS or Nessie and still do ingestion from Spark and Prefect.

➕ 1

43 Views

Open in Slack

Previous Next