getting this error lakeFS #help

Join Slack

getting this error

# help

setu suyagya

09/29/2022, 3:04 PM

getting this error

Barak Amar

09/29/2022, 3:06 PM

can you share your full stack as text here I would like to see the initial error reported

setu suyagya

09/29/2022, 3:09 PM

lakeerror.txt

setu suyagya

09/29/2022, 3:11 PM

when i am running this command it is just creating a folder with delta log ,not with parquet files

Barak Amar

09/29/2022, 3:11 PM

I see that it fails to write the data using the s3a filesystem

setu suyagya

09/29/2022, 3:12 PM

yes, i guess

Barak Amar

09/29/2022, 3:12 PM

checking

Barak Amar

09/29/2022, 3:13 PM

Did you added s3a or any other jar to your environment?

setu suyagya

09/29/2022, 3:15 PM

image.png

setu suyagya

09/29/2022, 3:16 PM

do i need to add anything else other than this

Barak Amar

09/29/2022, 3:17 PM

I think you are fine as in settings the right properties

setu suyagya

09/29/2022, 3:19 PM

Ya, i can't understand

setu suyagya

09/29/2022, 3:19 PM

why it is failing

Barak Amar

09/29/2022, 3:20 PM

from the stack it looks like the version of s3a implementation that comes from the aws hadoop package is not compatible with the spark we are running

Barak Amar

09/29/2022, 3:20 PM

this is one of the main reasons it fail to find a method it is trying to execute

Barak Amar

09/29/2022, 3:21 PM

https://github.com/aws/aws-sdk-java/issues/2510

Barak Amar

09/29/2022, 3:21 PM

like in this issue

setu suyagya

09/29/2022, 3:21 PM

but when i am doing operations directly with parquet file or csv file it is working file

Barak Amar

09/29/2022, 3:22 PM

The write operation you perform uses

org.apache.hadoop.util.SemaphoredDelegatingExecutor

Barak Amar

09/29/2022, 3:23 PM

The jar.packages in the screenshot above it include jars for delta?

Barak Amar

09/29/2022, 3:25 PM

it may explain why when you write delta it goes through the specific implementation - and this is the integration point where we have two incompatible versions run. This is just a thought.

Barak Amar

09/29/2022, 3:25 PM

Can you list which jar/version you import?

setu suyagya

09/29/2022, 3:32 PM

image.png

Barak Amar

09/29/2022, 3:39 PM

which delta-lake version do you import in "spark.jars.packages"?

Barak Amar

09/29/2022, 3:41 PM

Can you try to write the same information using s3a endpoint to a bucket in aws?

setu suyagya

09/29/2022, 3:41 PM

spark.jars.packages io.deltadelta core 2.122.0.0rc1

setu suyagya

09/29/2022, 3:42 PM

loke how?

Barak Amar

09/29/2022, 3:43 PM

Instead of giving an address to lakefs bucket, write the same data to a bucket at aws s3 (regular bucket).

Barak Amar

09/29/2022, 3:44 PM

it will verify that we can write delta file using s3a in this env.

setu suyagya

09/29/2022, 3:45 PM

ya it is working

setu suyagya

09/29/2022, 3:45 PM

did you mean on my s3 without point to lakefs?

Barak Amar

09/29/2022, 3:46 PM

Yes, but use 's3a://...' scheme

Barak Amar

09/29/2022, 3:46 PM

it should go through the same implementation to write the data.

Barak Amar

09/29/2022, 3:46 PM

its not the same as 's3://'

setu suyagya

09/29/2022, 3:48 PM

there i don,t need to give s3 or s3a just the db name.table

Barak Amar

09/29/2022, 3:49 PM

It is just for testing the why we write using s3a

Barak Amar

09/29/2022, 3:52 PM

I would like to know it we are getting the same error, to understand if it is related to lakeFS s3 gateway.

setu suyagya

09/29/2022, 3:55 PM

it is working with s3

setu suyagya

09/29/2022, 3:56 PM

using as s3 and giving error with s3a

setu suyagya

09/29/2022, 3:56 PM

same error

Barak Amar

09/29/2022, 3:57 PM

good - so it leaves lakefs out of the loop as we try to fix the s3a compatibility.

Barak Amar

09/29/2022, 3:57 PM

one option we can test before trying this path - can you configure the s3 properties like you did to s3a

Barak Amar

09/29/2022, 3:57 PM

not sure it supports bucket level settings - so just for the test - apply it globally and try to write to lakefs using 's3://...'

setu suyagya

09/29/2022, 3:58 PM

i guess same in conf file instead of s3a just us s3

setu suyagya

09/29/2022, 3:58 PM

use*

Barak Amar

09/29/2022, 3:58 PM

spark.hadoop.fs.s3.access...

without the 'bucket' parts

Barak Amar

09/29/2022, 4:18 PM

Going over the delta docs - https://docs.delta.io/2.0.0/delta-storage.html#quickstart-s3-single-cluster which uses s3a in the examples, they also specify the hadoop aws package version required. Example from the doc:

io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1

Barak Amar

09/29/2022, 4:20 PM

Copy code

Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class from the hadoop-aws package, which implements Hadoop's FileSystem API for S3. Make sure the version of this package matches the Hadoop version with which Spark was built.

This is the part I think our execution breaks while we use s3a

setu suyagya

09/29/2022, 4:33 PM

its still not working

Barak Amar

09/29/2022, 4:34 PM

can you paste the error using S3:// ?

setu suyagya

09/29/2022, 4:37 PM

Py4JJavaError: An error occurred while calling o115.parquet. : java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: TGXZRY0RB5RDFZPC; S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY=; Proxy: null), S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY= at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:423) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolderUsingFolderObject(Jets3tNativeFileSystemStore.java:249) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolder(Jets3tNativeFileSystemStore.java:212) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:515) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1690) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:399) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:779) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:777) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: TGXZRY0RB5RDFZPC; S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY=; Proxy: null), S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5394) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:26) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:12) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:111) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:138) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.listObjectsV2(AmazonS3LiteClient.java:75) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:414) ... 20 more (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred while calling o115.parquet.\n', JavaObject id=o118), <traceback object at 0xffffae8b9d70>)

setu suyagya

09/29/2022, 4:37 PM

when doing with s3a in conf, then is working fine

Barak Amar

09/29/2022, 4:46 PM

From the stack trace it looks like it goes to AWS.

Barak Amar

09/29/2022, 4:47 PM

it will be possible to load the hadoop-aws 3.3.1 as specified in the docs and write using s3a to lakefs or AWS?

setu suyagya

09/29/2022, 5:30 PM

still not working ,tried changing the package and also with s3, s3a but still it only creates folder ,not writing the df as parquet file

Barak Amar

09/29/2022, 5:31 PM

It fails with the same (original) exception - missing method signature?

Barak Amar

09/29/2022, 5:33 PM

I can try to setup a similar environment - If I'll following this documentation https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-EC2-notebook.html will I have an environment like yours? or is it a different aws service I need to look for?

setu suyagya

09/29/2022, 5:37 PM

yes same error

setu suyagya

09/29/2022, 5:37 PM

Py4JJavaError: An error occurred while calling o259.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:252) at org.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$3(TransactionalWrite.scala:342) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:297) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:245) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:100) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:212) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:209) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:100) at org.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:249) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:97) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:90) at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:224) at org.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:90) at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:161) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:303) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 (TID 174) (ip-10-11-241-141.ec2.internal executor 3): java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:329) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:482) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2215) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:219) ... 70 more Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:329) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:482) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)

setu suyagya

09/29/2022, 5:38 PM

yes you can try, to use it through EMR cluster

Barak Amar

09/29/2022, 5:41 PM

ok, I'll try to setup and try the above - I'll get back to you by tomorrow (around 24h) - hope you have good news before

setu suyagya

09/30/2022, 4:52 AM

Ye please, thank you, it will be great help for me

Barak Amar

09/30/2022, 3:45 PM

Here is the setup and what I've run on my env: New cluster configuration

Copy code

Release label:emr-6.8.0
Hadoop distribution:Amazon
Applications:Spark 3.3.0, Zeppelin 0.10.1

New Zeppelin notebook with the following steps:

Copy code

%spark.conf
spark.jars.packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.3.1
spark.sql.extensions io.delta.sqlDeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.s3a.endpoint <https://relaxed-crow.lakefs-demo.io>
spark.hadoop.fs.s3a.secret.key <secret>
spark.hadoop.fs.s3a.access.key <key>
spark.hadoop.fs.s3a.path.style.access true

Code that write some data to lakeFS (but it can be any bucket):

Copy code

%spark.pyspark
data = [("23",),("3",),("1977",)]
df = sc.parallelize(data).toDF()
df.write.format("delta").save("<s3a://my-repo/main/tests/delta7>")

And screenshot from lakeFS after write is completed. Hope the above is helpful in getting you working with emr/delta/lakefs.

setu suyagya

10/03/2022, 10:29 AM

Thanks for you help, but in my emr 6.8.0 its showing hadoop 3.2.1 only

Jonathan Rosenberg

10/03/2022, 10:37 AM

Hi @setu suyagya Can you tell which

hadoop-aws

jar version you’re using?

setu suyagya

10/03/2022, 11:44 AM

Hadoop 3.2.1

Jonathan Rosenberg

10/03/2022, 11:59 AM

Can you trying using these configs:

Copy code

%spark.conf
spark.jars.packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.2.1
spark.sql.extensions io.delta.sqlDeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.s3a.endpoint <https://relaxed-crow.lakefs-demo.io>
spark.hadoop.fs.s3a.secret.key <secret>
spark.hadoop.fs.s3a.access.key <key>
spark.hadoop.fs.s3a.path.style.access true

setu suyagya

10/03/2022, 1:04 PM

already tried

setu suyagya

10/03/2022, 1:04 PM

its not working

Jonathan Rosenberg

10/03/2022, 1:14 PM

this is not the same as the configurations above. Mind that in the last configs the version of

hadoop-aws

org.apache.hadoop:hadoop-aws:3.2.1

and not

org.apache.hadoop:hadoop-aws:3.3.1

. Did you try it with the 3.2.1 version?

setu suyagya

10/03/2022, 2:52 PM

yes

setu suyagya

10/03/2022, 2:52 PM

i tried it again with same as you sent

setu suyagya

10/03/2022, 2:56 PM

its just doing same, it just creating folder not writing anything

27 Views

Open in Slack

Previous Next