Hello, does anyone have experience using Iceberg w...
# help
s
Hello, does anyone have experience using Iceberg with Lakefs? We are trying to use the Iceberg Hadoop catalog instead of the regular Hive metastore to avoid the complexities of having to manage table metadata separate from the underlying filesystem. We are following the iceberg documentation and we've set these additional configurations:
Copy code
--conf "spark.sql.catalog.feature=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.feature.type=hadoop" \
--conf "spark.sql.catalog.feature.warehouse=<lakefs://origin/>" \
We expect that this will create a metastore with its root at
<lakefs://origin/>
and then when we create feature branches we will be able to reference the iceberg tables within them like
feature.branch_name.schema_name.table_name
(`feature.
/branch_name
.schema_name.table_name` in Spark syntax). And indeed `SHOW tables in feature.
/sid-test
.common` returns the tables we want, but attempting to query from the table `SELECT * FROM feature.
/sid-test
.common.paige_dimension_ai_module LIMIT 10` returns a
table or view not found
error. This all appears to work fine when we set the warehouse path to a location on S3. Any ideas what could be going on? cc @Oz Katz @Sander Hartlage
a
Hi Sid, That sounds like a pretty complex deployment! Please let me see if I can help, but some patience may be needed. How is your Hive metastore pointed at your lakeFS instance? Our current documentation is to use Hive to point at the S3 gateway implemented by lakeFS. This is the easier way to configure and use lakeFS, and I would recommend that you start with it. To do this you point Hive (or any other S3 client program) at your lakeFS instance by configuring the S3 endpoint URL. Then you use regular S3 URLs (or S3A URLs, in the Hadoop world) throughout. So your Icebergs using Hive Metastore would be receiving and using S3A URLs from Hive Metastore, and would also be configured to use the S3 endpoint URL provided by lakeFS. I would avoid the jump to using lakeFS FileSystem URLs, at least for now. Please let me know if you need more help, or how you get along!
s
hey @Ariel Shaqed (Scolnicov), thanks for the response. we've been using lakefs with the hive metastore for a while now, but we're running into issues where the metastore table definitions are getting out of date with what's on the filesystem
we've recently switched to using the iceberg format, which allows us to read tables as paths. since the iceberg table metadata is externalized, we would effectively be able to cut out the hive metastore entirely
a
Cool! Could you share your current Iceberg configuration? Are you accessing lakeFS over the S3 gateway (and
s3a://...
URLs), or using LakeFSFileSystem (and
lakefs://...
URLs)?
s
we're using
s3a
but are currently working to get the
lakefs
filesystem configured
here's our current spark iceberg configuration (using
s3a
, credentials are in the
core-site.xml
)
Copy code
spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
  spark.sql.catalog.feature.type: hadoop
  spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
  spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
after we get our aws permissions worked out so we can read/write to the underlying lakefs s3 bucket, the configuration will be something like this:
Copy code
spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
  spark.sql.catalog.feature.type: hadoop
  spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
  spark.sql.catalog.feature.io-impl: org.apache.iceberg.hadoop.HadoopFileIO
  spark.sql.catalog.feature.hadoop.fs.s3a.endpoint: "<https://s3>.{{ lakefs_datalake_s3_region }}.<http://amazonaws.com/|amazonaws.com/>"
  spark.sql.catalog.feature.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.InstanceProfileCredentialsProvider
  spark.sql.catalog.feature.hadoop.fs.lakefs.impl: io.lakefs.LakeFSFileSystem
  spark.sql.catalog.feature.hadoop.fs.lakefs.access.key: "{{ emr_lakefs_access_key }}"
  spark.sql.catalog.feature.hadoop.fs.lakefs.secret.key: "{{ emr_lakefs_secret_access_key }}"
  spark.sql.catalog.feature.hadoop.fs.lakefs.endpoint: "https://{{ lakefs_api_domain_name }}"
  spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
(this is ansible code that configures and launches the spark sql thriftserver; these values become
--conf
options when we start the process)
a
I agree that you should start off by using the S3 gateway on lakeFS, which seems like your current config is doing. But then I'm not sure that "lakefs://" urls should appear anywhere at all!
s
we're setting
fs.lakefs.impl
in the
core-site
to use the
s3a
filesystem
so our
lakefs://
urls are essentially
s3a://
a
Oh, neat! Thanks for explaining, now I get it.
I'm referring to you current configuration:
``` spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.feature.type: hadoop
spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions```
IIUC. From where do you get the name of the branch to use on your lakeFS repo?
s
we would like the branch name to be part of the table name; so if there's a
feature1
branch in lakefs, we'd like to address the schemas and tables in that branch as
feature.feature1.schema.table
, which should correspond to directories under
<lakefs://origin/feature1/>
that is, we'll have the default catalog pointed at the
master
branch in lakefs, and the
feature
catalog should expose all the branches
so
SELECT * FROM schema.table
should read from
<lakefs://origin/master/schema/table>
and
SELECT * FROM feature.feature1.schema.table
should read from
<lakefs://origin/feature1/schema/table>
the use case here is that the
master
branch is what our BI tools are pointed at, and the
feature
branches allow isolated development and access to historical snapshots
a
Hi Sander, Everything seems really well-organized and good. I think I need logs at this stage, ideally from all stages (so Iceberg, lakeFS, maybe Hive), to try and see some more. What I would like to do is to try and set up something similar and try to grab logs directly. Sorry: I am afraid it won't be this week (however the good news is that our week starts Sunday...). But if you already have some logs that you could send, it might speed things up.
s
sounds good! we'll collect some logs on our end and send them over to you
we're also trying to test out the lakefs hadoop filesystem before the end of the week, so we'll let you know if we have any success using that
a
Wow, certainly sounds like you have a lot of data access going on! And we're looking forward to analyzing some logs.
s
@Ariel Shaqed (Scolnicov) we did a lot more testing yesterday and today and we discovered that our error doesn't actually involve lakefs, and is caused by an iceberg misconfiguration between hive and spark. we're still figuring out the specifics, but for now, it looks like lakefs is working perfectly. one upside of all this debugging is that we've successfully integrated the
lakefs
hadoop filesystem, which is something that's been on our backlog for a while. thanks for all your help!
sunglasses lakefs 3