Hello does anyone have experience using Iceberg with Lakefs lakeFS #help

Hello, does anyone have experience using Iceberg w...

Sid Senthilnathan

04/28/2022, 1:11 PM

Hello, does anyone have experience using Iceberg with Lakefs? We are trying to use the Iceberg Hadoop catalog instead of the regular Hive metastore to avoid the complexities of having to manage table metadata separate from the underlying filesystem. We are following the iceberg documentation and we've set these additional configurations:

Copy code

--conf "spark.sql.catalog.feature=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.feature.type=hadoop" \
--conf "spark.sql.catalog.feature.warehouse=<lakefs://origin/>" \

We expect that this will create a metastore with its root at

<lakefs://origin/>

and then when we create feature branches we will be able to reference the iceberg tables within them like

feature.branch_name.schema_name.table_name

(`feature.

/branch_name

.schema_name.table_name` in Spark syntax). And indeed `SHOW tables in feature.

/sid-test

.common` returns the tables we want, but attempting to query from the table `SELECT * FROM feature.

/sid-test

.common.paige_dimension_ai_module LIMIT 10` returns a

table or view not found

error. This all appears to work fine when we set the warehouse path to a location on S3. Any ideas what could be going on? cc @Oz Katz @Sander Hartlage

Ariel Shaqed (Scolnicov)

04/28/2022, 1:25 PM

Hi Sid, That sounds like a pretty complex deployment! Please let me see if I can help, but some patience may be needed. How is your Hive metastore pointed at your lakeFS instance? Our current documentation is to use Hive to point at the S3 gateway implemented by lakeFS. This is the easier way to configure and use lakeFS, and I would recommend that you start with it. To do this you point Hive (or any other S3 client program) at your lakeFS instance by configuring the S3 endpoint URL. Then you use regular S3 URLs (or S3A URLs, in the Hadoop world) throughout. So your Icebergs using Hive Metastore would be receiving and using S3A URLs from Hive Metastore, and would also be configured to use the S3 endpoint URL provided by lakeFS. I would avoid the jump to using lakeFS FileSystem URLs, at least for now. Please let me know if you need more help, or how you get along!

Sander Hartlage

04/28/2022, 1:28 PM

hey @Ariel Shaqed (Scolnicov), thanks for the response. we've been using lakefs with the hive metastore for a while now, but we're running into issues where the metastore table definitions are getting out of date with what's on the filesystem

Sander Hartlage

04/28/2022, 1:29 PM

we've recently switched to using the iceberg format, which allows us to read tables as paths. since the iceberg table metadata is externalized, we would effectively be able to cut out the hive metastore entirely

Ariel Shaqed (Scolnicov)

04/28/2022, 1:41 PM

Cool! Could you share your current Iceberg configuration? Are you accessing lakeFS over the S3 gateway (and

s3a://...

URLs), or using LakeFSFileSystem (and

lakefs://...

URLs)?

Sander Hartlage

04/28/2022, 1:47 PM

we're using

s3a

but are currently working to get the

lakefs

filesystem configured

Sander Hartlage

04/28/2022, 1:47 PM

here's our current spark iceberg configuration (using

s3a

, credentials are in the

core-site.xml

)

Sander Hartlage

04/28/2022, 1:47 PM

Copy code

spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
  spark.sql.catalog.feature.type: hadoop
  spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
  spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Sander Hartlage

04/28/2022, 1:49 PM

after we get our aws permissions worked out so we can read/write to the underlying lakefs s3 bucket, the configuration will be something like this:

Sander Hartlage

04/28/2022, 1:49 PM

Copy code

spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
  spark.sql.catalog.feature.type: hadoop
  spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
  spark.sql.catalog.feature.io-impl: org.apache.iceberg.hadoop.HadoopFileIO
  spark.sql.catalog.feature.hadoop.fs.s3a.endpoint: "<https://s3>.{{ lakefs_datalake_s3_region }}.<http://amazonaws.com/|amazonaws.com/>"
  spark.sql.catalog.feature.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.InstanceProfileCredentialsProvider
  spark.sql.catalog.feature.hadoop.fs.lakefs.impl: io.lakefs.LakeFSFileSystem
  spark.sql.catalog.feature.hadoop.fs.lakefs.access.key: "{{ emr_lakefs_access_key }}"
  spark.sql.catalog.feature.hadoop.fs.lakefs.secret.key: "{{ emr_lakefs_secret_access_key }}"
  spark.sql.catalog.feature.hadoop.fs.lakefs.endpoint: "https://{{ lakefs_api_domain_name }}"
  spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Sander Hartlage

04/28/2022, 1:50 PM

(this is ansible code that configures and launches the spark sql thriftserver; these values become

--conf

options when we start the process)

Ariel Shaqed (Scolnicov)

04/28/2022, 2:38 PM

I agree that you should start off by using the S3 gateway on lakeFS, which seems like your current config is doing. But then I'm not sure that "lakefs://" urls should appear anywhere at all!

Sander Hartlage

04/28/2022, 2:39 PM

we're setting

fs.lakefs.impl

in the

core-site

to use the

s3a

filesystem

Sander Hartlage

04/28/2022, 2:39 PM

so our

lakefs://

urls are essentially

s3a://

Ariel Shaqed (Scolnicov)

04/28/2022, 3:06 PM

Oh, neat! Thanks for explaining, now I get it.

Ariel Shaqed (Scolnicov)

04/28/2022, 3:10 PM

I'm referring to you current configuration:

``` spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog

spark.sql.catalog.feature.type: hadoop

spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"

spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions```

IIUC. From where do you get the name of the branch to use on your lakeFS repo?

Sander Hartlage

04/28/2022, 3:13 PM

we would like the branch name to be part of the table name; so if there's a

feature1

branch in lakefs, we'd like to address the schemas and tables in that branch as

feature.feature1.schema.table

, which should correspond to directories under

<lakefs://origin/feature1/>

Sander Hartlage

04/28/2022, 3:13 PM

that is, we'll have the default catalog pointed at the

master

branch in lakefs, and the

feature

catalog should expose all the branches

Sander Hartlage

04/28/2022, 3:14 PM

SELECT * FROM schema.table

should read from

<lakefs://origin/master/schema/table>

and

SELECT * FROM feature.feature1.schema.table

should read from

<lakefs://origin/feature1/schema/table>

Sander Hartlage

04/28/2022, 3:15 PM

the use case here is that the

master

branch is what our BI tools are pointed at, and the

feature

branches allow isolated development and access to historical snapshots

Ariel Shaqed (Scolnicov)

04/28/2022, 3:48 PM

Hi Sander, Everything seems really well-organized and good. I think I need logs at this stage, ideally from all stages (so Iceberg, lakeFS, maybe Hive), to try and see some more. What I would like to do is to try and set up something similar and try to grab logs directly. Sorry: I am afraid it won't be this week (however the good news is that our week starts Sunday...). But if you already have some logs that you could send, it might speed things up.

Sander Hartlage

04/28/2022, 3:52 PM

sounds good! we'll collect some logs on our end and send them over to you

Sander Hartlage

04/28/2022, 3:57 PM

we're also trying to test out the lakefs hadoop filesystem before the end of the week, so we'll let you know if we have any success using that

Ariel Shaqed (Scolnicov)

04/28/2022, 4:00 PM

Wow, certainly sounds like you have a lot of data access going on! And we're looking forward to analyzing some logs.

Sander Hartlage

04/29/2022, 8:32 PM

@Ariel Shaqed (Scolnicov) we did a lot more testing yesterday and today and we discovered that our error doesn't actually involve lakefs, and is caused by an iceberg misconfiguration between hive and spark. we're still figuring out the specifics, but for now, it looks like lakefs is working perfectly. one upside of all this debugging is that we've successfully integrated the

lakefs

hadoop filesystem, which is something that's been on our backlog for a while. thanks for all your help!

sunglasses lakefs 3

10 Views

Open in Slack

Previous Next