Michael Gaebel
10/11/2023, 1:59 PMv1.metadata.json
? org.apache.iceberg.exceptions.CommitFailedException: Failed to commit changes using rename: <s3a://lakefs-poc/main/rl_dev_datastage_01_ma_snapshot/sys_audit_event/metadata/v1.metadata.json>
(more stacktrace in reply)org.apache.iceberg.exceptions.CommitFailedException: Failed to commit changes using rename: <s3a://lakefs-poc/main/rl_dev_datastage_01_ma_snapshot/sys_audit_event/metadata/v1.metadata.json>
at org.apache.iceberg.hadoop.HadoopTableOperations.renameToFinal(HadoopTableOperations.java:378) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
at org.apache.iceberg.hadoop.HadoopTableOperations.commit(HadoopTableOperations.java:162) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
at io.lakefs.iceberg.LakeFSTableOperations.commit(LakeFSTableOperations.java:37) ~[lakefs-iceberg-0.1.3.jar:0.1.3]
at org.apache.iceberg.BaseTransaction.commitCreateTransaction(BaseTransaction.java:311) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
at org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:290) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
at org.apache.iceberg.spark.source.StagedSparkTable.commitStagedChanges(StagedSparkTable.java:34) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
.
.
.
Caused by: java.io.FileNotFoundException: No such file or directory: <lakefs://lakefs-poc/main/rl_dev_datastage_01_ma_snapshot/sys_audit_event/metadata/25193118-7546-49de-b229-ef0f039bc2d9.metadata.json>
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3866) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.initiateRename(S3AFileSystem.java:1887) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:1988) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$rename$7(S3AFileSystem.java:1846) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) ~[hadoop-client-api-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) ~[hadoop-client-api-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:1844) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at io.lakefs.routerfs.RouterFileSystem.rename(RouterFileSystem.java:197) ~[hadoop-router-fs-hadoop-2.9.2-assembly-0.1.0.jar:?]
at org.apache.iceberg.hadoop.HadoopTableOperations.renameToFinal(HadoopTableOperations.java:368) ~[iceberg-spark-runtime-3.3_2.12-1.3.1.jar:?]
Isan Rivkin
10/11/2023, 2:24 PMMichael Gaebel
10/11/2023, 2:25 PM#General Spark configs
("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"),
("spark.sql.sources.partitionOverwriteMode", "dynamic"),
("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED"),
#LakeFS configuration for Iceberg
("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1,io.lakefs:lakefs-iceberg:v0.1.3,io.lakefs:hadoop-router-fs-hadoop-2.9.2-assembly:0.1.0"),
("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog"),
("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog"),
("spark.sql.catalog.lakefs.warehouse", f"lakefs://{lakefs_repo}"),
("spark.sql.catalog.lakefs.uri", lakefs_endpoint),
("spark.sql.catalog.lakefs.cache-enabled", "false"),
#LakeFs filesystem
("spark.hadoop.fs.s3a.impl", "io.lakefs.routerfs.RouterFileSystem"),
("spark.hadoop.routerfs.mapping.s3a.1.replace", f"s3a://{lakefs_repo}"),
("spark.hadoop.routerfs.mapping.s3a.1.with", f"lakefs://{lakefs_repo}"),
("spark.hadoop.routerfs.default.fs.s3a", "org.apache.hadoop.fs.s3a.S3AFileSystem"),
("spark.hadoop.fs.lakefs.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"),
#LakeFS S3 access
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.endpoint", f"{lakefs_endpoint}"),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.access.key", lakefs_access_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.secret.key", lakefs_secret_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.path.style.access", "true"),
#Regular S3 access
("spark.hadoop.fs.s3a.endpoint.region", "ca-central-1"),
("spark.hadoop.fs.s3a.endpoint", "<https://s3.ca-central-1.amazonaws.com>"),
("spark.hadoop.fs.s3a.path.style.access", "true"),
#Configs needed for Iceberg
("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"),
lakefs.commits.commit(repo.id, lakefs_branch, CommitCreation(
message=f"Initial load table 'lakefs.{lakefs_branch}.{target_database}.{table}' for schemas '{schemaList}'",
metadata={'author': "glue"}
))
where lakefs
is the configured client and the repo is fetched from that client
lakefs = LakeFSClient(lakefs_config)
repo = lakefs.repositories.get_repository(lakefs_repo)
Isan Rivkin
10/11/2023, 2:34 PMlakefs.commits.commit
) but the stack trace is Java / Spark / Iceberg - What’s the connection between them?Michael Gaebel
10/11/2023, 2:37 PMIsan Rivkin
10/11/2023, 2:56 PM("spark.hadoop.fs.s3a.impl", "io.lakefs.routerfs.RouterFileSystem"),
You should try ("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
Check this doc for more infoMichael Gaebel
10/11/2023, 3:04 PMS3AFileSystem
after routing...
at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:1844) ~[hadoop-aws-3.3.3-amzn-0.jar:?]
at io.lakefs.routerfs.RouterFileSystem.rename(RouterFileSystem.java:197) ~[hadoop-router-fs-hadoop-2.9.2-assembly-0.1.0.jar:?]
at
Isan Rivkin
10/11/2023, 3:27 PMMichael Gaebel
10/11/2023, 3:30 PMIsan Rivkin
10/11/2023, 3:30 PMJonathan Rosenberg
10/11/2023, 3:40 PM#LakeFS S3 access
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.endpoint", f"{lakefs_endpoint}"),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.access.key", lakefs_access_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.secret.key", lakefs_secret_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.path.style.access", "true"),
The renaming of the scheme from s3a
to lakefs
is what’s causing the problem:
Caused by: java.io.FileNotFoundException: No such file or directory: <lakefs://lakefs-poc/main/rl_dev_datastage_01_ma_snapshot/sys_audit_event/metadata/25193118-7546-49de-b229-ef0f039bc2d9.metadata.json>
Spark doesn’t know what to do with it (because no Filesystem was configured to handle it in the lakeFS catalog’s context). This is fine.
Can you please change your configurations to:
#General Spark configs
("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"),
("spark.sql.sources.partitionOverwriteMode", "dynamic"),
("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED"),
#LakeFS configuration for Iceberg
("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1,io.lakefs:lakefs-iceberg:v0.1.3"),
("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog"),
("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog"),
("spark.sql.catalog.lakefs.warehouse", f"lakefs://{lakefs_repo}"),
("spark.sql.catalog.lakefs.uri", lakefs_endpoint),
("spark.sql.catalog.lakefs.cache-enabled", "false"),
#LakeFS S3 access
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.endpoint", f"{lakefs_endpoint}"),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.access.key", lakefs_access_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.secret.key", lakefs_secret_key),
(f"spark.hadoop.fs.s3a.bucket.{lakefs_repo}.path.style.access", "true"),
#Regular S3 access
("spark.hadoop.fs.s3a.endpoint.region", "ca-central-1"),
("spark.hadoop.fs.s3a.endpoint", "<https://s3.ca-central-1.amazonaws.com>"),
("spark.hadoop.fs.s3a.path.style.access", "true"),
#Configs needed for Iceberg
("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"),
and test again?Michael Gaebel
10/11/2023, 3:48 PMJonathan Rosenberg
10/11/2023, 3:49 PM