Error in SQL statement AnalysisException org apache hadoop h lakeFS #help

Error in SQL statement: AnalysisException: org.apa...

user

11/21/2021, 11:59 AM

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.FileNotFoundException PUT 0-byte object on main/: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found; request: PUT https://lakefs.k8s.yotpo.xyz dbt-chaim/main/ {} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1058-aws OpenJDK_64-Bit_Server_VM/25.275-b01 java/1.8.0_275 scala/2.12.10 kotlin/1.2.71 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: null, Extended Request ID: null, Cloud Provider: AWS, Instance ID: i-02fde599ab86b480c (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:404 Not Found);

user

11/21/2021, 12:11 PM

Hey @Chaim Turkel from the message I'm guessing the following is true: 1. you're attempting to either create a table or write to an existing one? 2. https://lakefs.k8s.yotpo.xyz is your lakeFS endpoint 3.

dbt-chaim

is an existing repo in the lakeFS installation 4.

main

is an existing branch in the

dbt-chaim

repository Is this true? Could you maybe share the code that resulted in this error?

user

11/21/2021, 12:12 PM

also, if you could share the spark configuration you've set for the Spark cluster in Databricks, that would be really helpful

user

11/21/2021, 12:12 PM

your assumtions are correct

user

11/21/2021, 12:12 PM

CREATE DATABASE IF NOT EXISTS dbt_test1 LOCATION 's3a://dbt-chaim/main';

user

11/21/2021, 12:12 PM

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem hoodie.compaction.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload spark.hadoop.fs.s3a.connection.maximum 100 spark.ui.prometheus.enabled true spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl spark.hadoop.mapreduce.input.pathFilter.class org.apache.hudi.hadoop.HoodieROTablePathFilter spark.databricks.conda.condaMagic.enabled true spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem spark.sql.hive.metastore.jars builtin spark.sql.hive.metastore.version 2.3.7 spark.hadoop.hive.metastore.uris thrift://10.101.125.5:25701 spark.hadoop.fs.s3a.path.style.access true spark.hadoop.fs.s3a.bucket.dbt-chaim.endpoint https://lakefs.k8s.yotpo.xyz spark.hadoop.fs.s3a.bucket.dbt-chaim.access.key --- spark.hadoop.fs.s3a.bucket.dbt-chaim.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider spark.hadoop.fs.s3a.bucket.dbt-chaim.session.token lakefs spark.hadoop.fs.s3a.bucket.dbt-chaim.secret.key ---

user

11/21/2021, 12:14 PM

Thanks - let me look into it and try to reproduce

user

11/21/2021, 12:24 PM

can you help me understand what the error might be, and how i can check it? is there a cli that i can check simular to what databricks runs?

user

11/21/2021, 12:29 PM

From the looks of it, it seems that

CREATE DATABASE

tries to ensure the path used is writable: it attempts to write (see

PUT 0-byte object on main/

in the error you sent) to the destination you've passed. For some reason it gets back a 404 error, which usually happens when trying to write to a path or repo that doesn't exist. The s3a driver implementation you're using is the proprietary Databricks one (see

spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem

in the Spark config), so not sure whether there are any good tools to debug it from CLI. At the moment, I'm trying to recreate this setup and turn on debug logging on a lakeFS server: with

logging.level: DEBUG

, lakeFS should print a summary of every HTTP request it receives, so hopefully this will give me a clue of what their s3a implementation is attempting to do.

user

11/21/2021, 12:30 PM

user

11/21/2021, 12:30 PM

could you also share your

hive-site.xml

file you're using?

user

11/21/2021, 3:51 PM

@Chaim Turkel can you please try replacing the

LOCATION

field to

<s3a://dbt-chaim/main/test1>

? There's perhaps an edge case related to creating tables or databases at the root of a branch (opening an issue for it regardless).

user

11/21/2021, 4:09 PM

i had an issue with the hive file. fixed that

user

11/21/2021, 4:09 PM

now i get

user

11/21/2021, 4:09 PM

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.io.IOException: Bucket dbt-chaim does not exist at com.databricks.s3a.S3AFileSystem.initialize(S3AFileSystem.java:480) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:206) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:244) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:84) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:235) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:203) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:235) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:105) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:102) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:676) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:672) at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37) at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:451) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)

user

11/21/2021, 4:09 PM

how does it know to go to lakefs or aws?

user

11/21/2021, 4:17 PM

Hive metastore uses the

hive-site.xml

to get the location (Hive metastore only validates access to the path) Databricks uses the s3a configurations

user

11/21/2021, 4:19 PM

It seems that, when using

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem

the per bucket configuration is not supported. Can you please try removing that configuration just for validation?

user

11/21/2021, 4:22 PM

Another option would be to not use per bucket configuration spark.hadoop.fs.s3a.bucket.dbt-chaim.endpoint --> spark.hadoop.fs.s3a.endpoint spark.hadoop.fs.s3a.bucket.dbt-chaim.access.key --> spark.hadoop.fs.s3a.access.key spark.hadoop.fs.s3a.bucket.dbt-chaim.secret.key --> spark.hadoop.fs.s3a.secret.key

user

11/21/2021, 4:36 PM

Another option for allowing access to both S3 and lakeFS (without the need of per bucket configuration) would be using our lakeFS Hadoop FileSystem

user

11/22/2021, 9:34 AM

were can i read about the different options?

user

11/22/2021, 9:35 AM

this is probably a good starting point: https://docs.lakefs.io/integrations/spark.html

user

11/22/2021, 10:38 AM

i decided to go with hadoop solution. i added the lake fs jar to databricks my config is: spark.ui.prometheus.enabled true spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl spark.hadoop.fs.s3a.connection.maximum 100 spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3a.path.style.access true spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY spark.hadoop.mapreduce.input.pathFilter.class org.apache.hudi.hadoop.HoodieROTablePathFilter spark.databricks.conda.condaMagic.enabled true spark.sql.hive.metastore.jars builtin spark.sql.hive.metastore.version 2.3.7 spark.hadoop.hive.metastore.uris thrift://--- spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem fs.lakefs.access.key --- fs.lakefs.secret.key --- fs.lakefs.endpoint --- i am running: CREATE DATABASE IF NOT EXISTS dbt_test1 LOCATION 'lakefs://dbt-chaim/main/test1'; the error is Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException No FileSystem for scheme: lakefs);

user

11/22/2021, 11:03 AM

Hi @Chaim Turkel, I believe you need to register the lakeFS jar. Let me look for an example and I'll get back to you :)

user

11/22/2021, 11:53 AM

In a second look I think my first suggestion wasn't right, so I'll try to reproduce the error.

user

11/22/2021, 11:54 AM

ok, waiting for your answer

user

11/22/2021, 12:13 PM

i added the jar, and not i get the error:

user

11/22/2021, 12:13 PM

Error in SQL statement: NoSuchMethodError: kotlin.collections.ArraysKt.copyInto([B[BIII)[B com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.lang.NoSuchMethodError: kotlin.collections.ArraysKt.copyInto([B[BIII)[B at io.lakefs.shaded.okio.Segment.writeTo(Segment.kt:169) at io.lakefs.shaded.okio.Segment.compact(Segment.kt:152) at io.lakefs.shaded.okio.Buffer.write(Buffer.kt:1842) at io.lakefs.shaded.okio.Buffer.read(Buffer.kt:1854) at io.lakefs.shaded.okio.Buffer.writeAll(Buffer.kt:1642) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive(Options.kt:187) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive(Options.kt:174) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive$default(Options.kt:113) at io.lakefs.shaded.okio.Options$Companion.of(Options.kt:72) at io.lakefs.shaded.okhttp3.internal.Util.<clinit>(Util.kt:71) at io.lakefs.shaded.okhttp3.internal.concurrent.TaskRunner.<clinit>(TaskRunner.kt:309) at io.lakefs.shaded.okhttp3.ConnectionPool.<init>(ConnectionPool.kt:41) at io.lakefs.shaded.okhttp3.ConnectionPool.<init>(ConnectionPool.kt:47) at io.lakefs.shaded.okhttp3.OkHttpClient$Builder.<init>(OkHttpClient.kt:471) at io.lakefs.clients.api.ApiClient.initHttpClient(ApiClient.java:117) at io.lakefs.clients.api.ApiClient.initHttpClient(ApiClient.java:113) at io.lakefs.clients.api.ApiClient.<init>(ApiClient.java:86) at io.lakefs.clients.api.Configuration.<clinit>(Configuration.java:18) at io.lakefs.LakeFSClient.<init>(LakeFSClient.java:32) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)

user

11/22/2021, 12:48 PM

Which Databricks version you have?

user

11/22/2021, 12:51 PM

custom:custom-local__7.x-snapshot-scala2.12__unknown__head__8437e42__e60bf1b__toddgreenstein__2822c5f__format-2.lz4'

user

11/22/2021, 7:38 PM

Great, thanks! We'll look at it.

user

11/22/2021, 7:39 PM

thanks

user

11/22/2021, 7:43 PM

is there a version you suggest i use?

user

11/23/2021, 5:39 PM

new error

user

11/23/2021, 5:39 PM

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException Bucket dbt-chaim does not exist) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$2(HiveExternalCatalog.scala:161) at org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$1(HiveExternalCatalog.scala:150) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:377) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:363) at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:149) at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:268) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createDatabase(ExternalCatalogWithListener.scala:55) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.createDatabase(SessionCatalog.scala:599) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.createDatabase(ManagedCatalogSessionCatalog.scala:265) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:85) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:235) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3825) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)

user

11/23/2021, 6:03 PM

Thanks for the update. Can you please try changing the configuration of

hive-site.xml

to:

Copy code

<property>
        <name>fs.s3a.bucket.dbt-chaim.endpoint</name>
        <value><http://lakefs:8000></value>
    </property>

(with the relevant lakeFS endpoint) Also, I believe the docs wasn't completely clear, I opened an issue to fix it. Looking forward to hear if it worked. 🙂

user

11/24/2021, 8:54 AM

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://10.15.91.199:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>vladopen</value> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value></value> <description></description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>fs.default.name</name> <value>file:///</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value></value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value></value> </property> <property> <name>hive.security.authorization.enabled</name> <value>false</value> </property> <property> <name>hive.metastore.disallow.incompatible.col.type.changes</name> <value>false</value> <description>If true (default is false), ALTER TABLE operations which change the type of a column (say STRING) to an incompatible type (say MAP<STRING, STRING>) are disallowed. RCFile default SerDe (ColumnarSerDe) serializes the values in such a way that the datatypes can be converted from string to any type. The map is also serialized as a string, which can be read as a string as well. However, with any binary serialization, this is not true. Blocking the ALTER TABLE prevents ClassCastExceptions when subsequently trying to access old partitions. Primitive types like INT, STRING, BIGINT, etc are compatible with each other and are not blocked. See HIVE-4409 for more details. </description> </property> <property> <name>hive.async.log.enabled</name> <value>false</value> </property> <property> <name>hive.server2.thrift.max.worker.threads</name> <value>2000</value> </property> <pr…

user

11/24/2021, 8:54 AM

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException Bucket dbt-chaim does not exist)

user

11/24/2021, 9:48 AM

lakeFS 0.54.0

user

11/24/2021, 9:49 AM

Thanks!

user

11/24/2021, 2:05 PM

any news?

user

11/24/2021, 2:06 PM

Working on it, I'll tell you when we are done 🙂

2 Views

Open in Slack

Previous Next