Sid Senthilnathan
04/28/2022, 1:11 PM--conf "spark.sql.catalog.feature=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.feature.type=hadoop" \
--conf "spark.sql.catalog.feature.warehouse=<lakefs://origin/>" \
We expect that this will create a metastore with its root at <lakefs://origin/>
and then when we create feature branches we will be able to reference the iceberg tables within them like feature.branch_name.schema_name.table_name
(`feature./branch_name
.schema_name.table_name` in Spark syntax). And indeed `SHOW tables in feature./sid-test
.common` returns the tables we want, but attempting to query from the table `SELECT * FROM feature./sid-test
.common.paige_dimension_ai_module LIMIT 10` returns a table or view not found
error.
This all appears to work fine when we set the warehouse path to a location on S3. Any ideas what could be going on? cc @Oz Katz @Sander HartlageAriel Shaqed (Scolnicov)
04/28/2022, 1:25 PMSander Hartlage
04/28/2022, 1:28 PMSander Hartlage
04/28/2022, 1:29 PMAriel Shaqed (Scolnicov)
04/28/2022, 1:41 PMs3a://...
URLs), or using LakeFSFileSystem (and lakefs://...
URLs)?Sander Hartlage
04/28/2022, 1:47 PMs3a
but are currently working to get the lakefs
filesystem configuredSander Hartlage
04/28/2022, 1:47 PMs3a
, credentials are in the core-site.xml
)Sander Hartlage
04/28/2022, 1:47 PMspark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.feature.type: hadoop
spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Sander Hartlage
04/28/2022, 1:49 PMSander Hartlage
04/28/2022, 1:49 PMspark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.feature.type: hadoop
spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
spark.sql.catalog.feature.io-impl: org.apache.iceberg.hadoop.HadoopFileIO
spark.sql.catalog.feature.hadoop.fs.s3a.endpoint: "<https://s3>.{{ lakefs_datalake_s3_region }}.<http://amazonaws.com/|amazonaws.com/>"
spark.sql.catalog.feature.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.InstanceProfileCredentialsProvider
spark.sql.catalog.feature.hadoop.fs.lakefs.impl: io.lakefs.LakeFSFileSystem
spark.sql.catalog.feature.hadoop.fs.lakefs.access.key: "{{ emr_lakefs_access_key }}"
spark.sql.catalog.feature.hadoop.fs.lakefs.secret.key: "{{ emr_lakefs_secret_access_key }}"
spark.sql.catalog.feature.hadoop.fs.lakefs.endpoint: "https://{{ lakefs_api_domain_name }}"
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Sander Hartlage
04/28/2022, 1:50 PM--conf
options when we start the process)Ariel Shaqed (Scolnicov)
04/28/2022, 2:38 PMSander Hartlage
04/28/2022, 2:39 PMfs.lakefs.impl
in the core-site
to use the s3a
filesystemSander Hartlage
04/28/2022, 2:39 PMlakefs://
urls are essentially s3a://
Ariel Shaqed (Scolnicov)
04/28/2022, 3:06 PMAriel Shaqed (Scolnicov)
04/28/2022, 3:10 PM``` spark.sql.catalog.feature: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.feature.type: hadoop
spark.sql.catalog.feature.warehouse: "lakefs://{{ lakefs_repo_name }}/"
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions```IIUC. From where do you get the name of the branch to use on your lakeFS repo?
Sander Hartlage
04/28/2022, 3:13 PMfeature1
branch in lakefs, we'd like to address the schemas and tables in that branch as feature.feature1.schema.table
, which should correspond to directories under <lakefs://origin/feature1/>
Sander Hartlage
04/28/2022, 3:13 PMmaster
branch in lakefs, and the feature
catalog should expose all the branchesSander Hartlage
04/28/2022, 3:14 PMSELECT * FROM schema.table
should read from <lakefs://origin/master/schema/table>
and SELECT * FROM feature.feature1.schema.table
should read from <lakefs://origin/feature1/schema/table>
Sander Hartlage
04/28/2022, 3:15 PMmaster
branch is what our BI tools are pointed at, and the feature
branches allow isolated development and access to historical snapshotsAriel Shaqed (Scolnicov)
04/28/2022, 3:48 PMSander Hartlage
04/28/2022, 3:52 PMSander Hartlage
04/28/2022, 3:57 PMAriel Shaqed (Scolnicov)
04/28/2022, 4:00 PMSander Hartlage
04/29/2022, 8:32 PMlakefs
hadoop filesystem, which is something that's been on our backlog for a while. thanks for all your help!