Error in SQL statement: AnalysisException: org.apa...
# help
u
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.FileNotFoundException PUT 0-byte object on main/: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found; request: PUT https://lakefs.k8s.yotpo.xyz dbt-chaim/main/ {} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1058-aws OpenJDK_64-Bit_Server_VM/25.275-b01 java/1.8.0_275 scala/2.12.10 kotlin/1.2.71 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: null, Extended Request ID: null, Cloud Provider: AWS, Instance ID: i-02fde599ab86b480c (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:404 Not Found);
u
Hey @Chaim Turkel from the message I'm guessing the following is true: 1. you're attempting to either create a table or write to an existing one? 2. https://lakefs.k8s.yotpo.xyz is your lakeFS endpoint 3.
dbt-chaim
is an existing repo in the lakeFS installation 4.
main
is an existing branch in the
dbt-chaim
repository Is this true? Could you maybe share the code that resulted in this error?
u
also, if you could share the spark configuration you've set for the Spark cluster in Databricks, that would be really helpful
u
your assumtions are correct
u
CREATE DATABASE IF NOT EXISTS dbt_test1 LOCATION 's3a://dbt-chaim/main';
u
spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem hoodie.compaction.payload.class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload spark.hadoop.fs.s3a.connection.maximum 100 spark.ui.prometheus.enabled true spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl spark.hadoop.mapreduce.input.pathFilter.class org.apache.hudi.hadoop.HoodieROTablePathFilter spark.databricks.conda.condaMagic.enabled true spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem spark.sql.hive.metastore.jars builtin spark.sql.hive.metastore.version 2.3.7 spark.hadoop.hive.metastore.uris thrift://10.101.125.5:25701 spark.hadoop.fs.s3a.path.style.access true spark.hadoop.fs.s3a.bucket.dbt-chaim.endpoint https://lakefs.k8s.yotpo.xyz spark.hadoop.fs.s3a.bucket.dbt-chaim.access.key --- spark.hadoop.fs.s3a.bucket.dbt-chaim.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider spark.hadoop.fs.s3a.bucket.dbt-chaim.session.token lakefs spark.hadoop.fs.s3a.bucket.dbt-chaim.secret.key ---
u
Thanks - let me look into it and try to reproduce
u
can you help me understand what the error might be, and how i can check it? is there a cli that i can check simular to what databricks runs?
u
From the looks of it, it seems that
CREATE DATABASE
tries to ensure the path used is writable: it attempts to write (see
PUT 0-byte object on main/
in the error you sent) to the destination you've passed. For some reason it gets back a 404 error, which usually happens when trying to write to a path or repo that doesn't exist. The s3a driver implementation you're using is the proprietary Databricks one (see
spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
in the Spark config), so not sure whether there are any good tools to debug it from CLI. At the moment, I'm trying to recreate this setup and turn on debug logging on a lakeFS server: with
logging.level: DEBUG
, lakeFS should print a summary of every HTTP request it receives, so hopefully this will give me a clue of what their s3a implementation is attempting to do.
u
ok
u
could you also share your
hive-site.xml
file you're using?
u
@Chaim Turkel can you please try replacing the
LOCATION
field to
<s3a://dbt-chaim/main/test1>
? There's perhaps an edge case related to creating tables or databases at the root of a branch (opening an issue for it regardless).
u
i had an issue with the hive file. fixed that
u
now i get
u
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.io.IOException: Bucket dbt-chaim does not exist at com.databricks.s3a.S3AFileSystem.initialize(S3AFileSystem.java:480) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:206) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:244) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:84) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:235) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:203) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:235) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:105) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:102) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:676) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:672) at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37) at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:451) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)
u
how does it know to go to lakefs or aws?
u
Hive metastore uses the
hive-site.xml
to get the location (Hive metastore only validates access to the path) Databricks uses the s3a configurations
u
It seems that, when using
spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
the per bucket configuration is not supported. Can you please try removing that configuration just for validation?
u
Another option would be to not use per bucket configuration spark.hadoop.fs.s3a.bucket.dbt-chaim.endpoint --> spark.hadoop.fs.s3a.endpoint spark.hadoop.fs.s3a.bucket.dbt-chaim.access.key --> spark.hadoop.fs.s3a.access.key spark.hadoop.fs.s3a.bucket.dbt-chaim.secret.key --> spark.hadoop.fs.s3a.secret.key
u
Another option for allowing access to both S3 and lakeFS (without the need of per bucket configuration) would be using our lakeFS Hadoop FileSystem
u
were can i read about the different options?
u
this is probably a good starting point: https://docs.lakefs.io/integrations/spark.html
u
i decided to go with hadoop solution. i added the lake fs jar to databricks my config is: spark.ui.prometheus.enabled true spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl spark.hadoop.fs.s3a.connection.maximum 100 spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3a.path.style.access true spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY spark.hadoop.mapreduce.input.pathFilter.class org.apache.hudi.hadoop.HoodieROTablePathFilter spark.databricks.conda.condaMagic.enabled true spark.sql.hive.metastore.jars builtin spark.sql.hive.metastore.version 2.3.7 spark.hadoop.hive.metastore.uris thrift://--- spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem fs.lakefs.access.key --- fs.lakefs.secret.key --- fs.lakefs.endpoint --- i am running: CREATE DATABASE IF NOT EXISTS dbt_test1 LOCATION 'lakefs://dbt-chaim/main/test1'; the error is Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException No FileSystem for scheme: lakefs);
u
Hi @Chaim Turkel, I believe you need to register the lakeFS jar. Let me look for an example and I'll get back to you :)
u
In a second look I think my first suggestion wasn't right, so I'll try to reproduce the error.
u
ok, waiting for your answer
u
i added the jar, and not i get the error:
u
Error in SQL statement: NoSuchMethodError: kotlin.collections.ArraysKt.copyInto([B[BIII)[B com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.lang.NoSuchMethodError: kotlin.collections.ArraysKt.copyInto([B[BIII)[B at io.lakefs.shaded.okio.Segment.writeTo(Segment.kt:169) at io.lakefs.shaded.okio.Segment.compact(Segment.kt:152) at io.lakefs.shaded.okio.Buffer.write(Buffer.kt:1842) at io.lakefs.shaded.okio.Buffer.read(Buffer.kt:1854) at io.lakefs.shaded.okio.Buffer.writeAll(Buffer.kt:1642) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive(Options.kt:187) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive(Options.kt:174) at io.lakefs.shaded.okio.Options$Companion.buildTrieRecursive$default(Options.kt:113) at io.lakefs.shaded.okio.Options$Companion.of(Options.kt:72) at io.lakefs.shaded.okhttp3.internal.Util.<clinit>(Util.kt:71) at io.lakefs.shaded.okhttp3.internal.concurrent.TaskRunner.<clinit>(TaskRunner.kt:309) at io.lakefs.shaded.okhttp3.ConnectionPool.<init>(ConnectionPool.kt:41) at io.lakefs.shaded.okhttp3.ConnectionPool.<init>(ConnectionPool.kt:47) at io.lakefs.shaded.okhttp3.OkHttpClient$Builder.<init>(OkHttpClient.kt:471) at io.lakefs.clients.api.ApiClient.initHttpClient(ApiClient.java:117) at io.lakefs.clients.api.ApiClient.initHttpClient(ApiClient.java:113) at io.lakefs.clients.api.ApiClient.<init>(ApiClient.java:86) at io.lakefs.clients.api.Configuration.<clinit>(Configuration.java:18) at io.lakefs.LakeFSClient.<init>(LakeFSClient.java:32) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:67)
u
Which Databricks version you have?
u
custom:custom-local__7.x-snapshot-scala2.12__unknown__head__8437e42__e60bf1b__toddgreenstein__2822c5f__format-2.lz4'
u
Great, thanks! We'll look at it.
u
thanks
u
is there a version you suggest i use?
u
new error
u
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException Bucket dbt-chaim does not exist) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$2(HiveExternalCatalog.scala:161) at org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$1(HiveExternalCatalog.scala:150) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:377) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:363) at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:149) at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:268) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createDatabase(ExternalCatalogWithListener.scala:55) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.createDatabase(SessionCatalog.scala:599) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.createDatabase(ManagedCatalogSessionCatalog.scala:265) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:85) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:235) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3825) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)
u
Thanks for the update. Can you please try changing the configuration of
hive-site.xml
to:
Copy code
<property>
        <name>fs.s3a.bucket.dbt-chaim.endpoint</name>
        <value><http://lakefs:8000></value>
    </property>
(with the relevant lakeFS endpoint) Also, I believe the docs wasn't completely clear, I opened an issue to fix it. Looking forward to hear if it worked. 🙂
u
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://10.15.91.199:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>vladopen</value> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value></value> <description></description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>fs.default.name</name> <value>file:///</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value></value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value></value> </property> <property> <name>hive.security.authorization.enabled</name> <value>false</value> </property> <property> <name>hive.metastore.disallow.incompatible.col.type.changes</name> <value>false</value> <description>If true (default is false), ALTER TABLE operations which change the type of a column (say STRING) to an incompatible type (say MAP<STRING, STRING>) are disallowed. RCFile default SerDe (ColumnarSerDe) serializes the values in such a way that the datatypes can be converted from string to any type. The map is also serialized as a string, which can be read as a string as well. However, with any binary serialization, this is not true. Blocking the ALTER TABLE prevents ClassCastExceptions when subsequently trying to access old partitions. Primitive types like INT, STRING, BIGINT, etc are compatible with each other and are not blocked. See HIVE-4409 for more details. </description> </property> <property> <name>hive.async.log.enabled</name> <value>false</value> </property> <property> <name>hive.server2.thrift.max.worker.threads</name> <value>2000</value> </property> <pr…
u
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException Bucket dbt-chaim does not exist)
u
u
Thanks!
u
any news?
u
Working on it, I'll tell you when we are done 🙂