Title
#help
Barak Amar

Barak Amar

09/29/2022, 3:06 PM
can you share your full stack as text here I would like to see the initial error reported
s

setu suyagya

09/29/2022, 3:11 PM
when i am running this command it is just creating a folder with delta log ,not with parquet files
Barak Amar

Barak Amar

09/29/2022, 3:11 PM
I see that it fails to write the data using the s3a filesystem
s

setu suyagya

09/29/2022, 3:12 PM
yes, i guess
Barak Amar

Barak Amar

09/29/2022, 3:12 PM
checking
3:13 PM
Did you added s3a or any other jar to your environment?
s

setu suyagya

09/29/2022, 3:16 PM
do i need to add anything else other than this
Barak Amar

Barak Amar

09/29/2022, 3:17 PM
I think you are fine as in settings the right properties
s

setu suyagya

09/29/2022, 3:19 PM
Ya, i can't understand
3:19 PM
why it is failing
Barak Amar

Barak Amar

09/29/2022, 3:20 PM
from the stack it looks like the version of s3a implementation that comes from the aws hadoop package is not compatible with the spark we are running
3:20 PM
this is one of the main reasons it fail to find a method it is trying to execute
3:21 PM
like in this issue
s

setu suyagya

09/29/2022, 3:21 PM
but when i am doing operations directly with parquet file or csv file it is working file
Barak Amar

Barak Amar

09/29/2022, 3:22 PM
The write operation you perform uses
org.apache.hadoop.util.SemaphoredDelegatingExecutor
3:23 PM
The jar.packages in the screenshot above it include jars for delta?
3:25 PM
it may explain why when you write delta it goes through the specific implementation - and this is the integration point where we have two incompatible versions run. This is just a thought.
3:25 PM
Can you list which jar/version you import?
3:39 PM
which delta-lake version do you import in "spark.jars.packages"?
3:41 PM
Can you try to write the same information using s3a endpoint to a bucket in aws?
s

setu suyagya

09/29/2022, 3:41 PM
spark.jars.packages io.delta:delta-core_2.12:2.0.0rc1
3:42 PM
loke how?
Barak Amar

Barak Amar

09/29/2022, 3:43 PM
Instead of giving an address to lakefs bucket, write the same data to a bucket at aws s3 (regular bucket).
3:44 PM
it will verify that we can write delta file using s3a in this env.
s

setu suyagya

09/29/2022, 3:45 PM
ya it is working
3:45 PM
did you mean on my s3 without point to lakefs?
Barak Amar

Barak Amar

09/29/2022, 3:46 PM
Yes, but use 's3a๐Ÿ˜•/...' scheme
3:46 PM
it should go through the same implementation to write the data.
3:46 PM
its not the same as 's3๐Ÿ˜•/'
s

setu suyagya

09/29/2022, 3:48 PM
there i don,t need to give s3 or s3a just the db name.table
Barak Amar

Barak Amar

09/29/2022, 3:49 PM
It is just for testing the why we write using s3a
3:52 PM
I would like to know it we are getting the same error, to understand if it is related to lakeFS s3 gateway.
s

setu suyagya

09/29/2022, 3:55 PM
it is working with s3
3:56 PM
using as s3 and giving error with s3a
3:56 PM
same error
Barak Amar

Barak Amar

09/29/2022, 3:57 PM
good - so it leaves lakefs out of the loop as we try to fix the s3a compatibility.
3:57 PM
one option we can test before trying this path - can you configure the s3 properties like you did to s3a
3:57 PM
not sure it supports bucket level settings - so just for the test - apply it globally and try to write to lakefs using 's3๐Ÿ˜•/...'
s

setu suyagya

09/29/2022, 3:58 PM
i guess same in conf file instead of s3a just us s3
3:58 PM
use*
Barak Amar

Barak Amar

09/29/2022, 3:58 PM
spark.hadoop.fs.s3.access...
without the 'bucket' parts
4:18 PM
Going over the delta docs - https://docs.delta.io/2.0.0/delta-storage.html#quickstart-s3-single-cluster which uses s3a in the examples, they also specify the hadoop aws package version required. Example from the doc:
io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1
4:20 PM
Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class from the hadoop-aws package, which implements Hadoop's FileSystem API for S3. Make sure the version of this package matches the Hadoop version with which Spark was built.
This is the part I think our execution breaks while we use s3a
s

setu suyagya

09/29/2022, 4:33 PM
its still not working
Barak Amar

Barak Amar

09/29/2022, 4:34 PM
can you paste the error using S3๐Ÿ˜•/ ?
s

setu suyagya

09/29/2022, 4:37 PM
Py4JJavaError: An error occurred while calling o115.parquet. : java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: TGXZRY0RB5RDFZPC; S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY=; Proxy: null), S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY= at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:423) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolderUsingFolderObject(Jets3tNativeFileSystemStore.java:249) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolder(Jets3tNativeFileSystemStore.java:212) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:515) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1690) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:399) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:779) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:777) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: TGXZRY0RB5RDFZPC; S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY=; Proxy: null), S3 Extended Request ID: IxDB0qz11fORj0XVzq6BuCUeDZsShfF9743De9geAa/hHhJocizW4t10x6Gp55u7FrWKSMv2aSY= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5394) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:26) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:12) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:111) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:138) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.listObjectsV2(AmazonS3LiteClient.java:75) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:414) ... 20 more (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred while calling o115.parquet.\n', JavaObject id=o118), <traceback object at 0xffffae8b9d70>)
4:37 PM
when doing with s3a in conf, then is working fine
Barak Amar

Barak Amar

09/29/2022, 4:46 PM
From the stack trace it looks like it goes to AWS.
4:47 PM
it will be possible to load the hadoop-aws 3.3.1 as specified in the docs and write using s3a to lakefs or AWS?
s

setu suyagya

09/29/2022, 5:30 PM
still not working ,tried changing the package and also with s3, s3a but still it only creates folder ,not writing the df as parquet file
Barak Amar

Barak Amar

09/29/2022, 5:31 PM
It fails with the same (original) exception - missing method signature?
5:33 PM
I can try to setup a similar environment - If I'll following this documentation https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-EC2-notebook.html will I have an environment like yours? or is it a different aws service I need to look for?
s

setu suyagya

09/29/2022, 5:37 PM
yes same error
5:37 PM
Py4JJavaError: An error occurred while calling o259.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:252) at org.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$3(TransactionalWrite.scala:342) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:297) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:245) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:100) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:212) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:209) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:100) at org.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:249) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:97) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:90) at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:224) at org.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:90) at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:161) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:303) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 (TID 174) (ip-10-11-241-141.ec2.internal executor 3): java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:329) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:482) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2215) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:219) ... 70 more Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:329) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:482) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
5:38 PM
yes you can try, to use it through EMR cluster
Barak Amar

Barak Amar

09/29/2022, 5:41 PM
ok, I'll try to setup and try the above - I'll get back to you by tomorrow (around 24h) - hope you have good news before
s

setu suyagya

09/30/2022, 4:52 AM
Ye please, thank you, it will be great help for me
10:29 AM
Thanks for you help, but in my emr 6.8.0 its showing hadoop 3.2.1 only
Jonathan Rosenberg

Jonathan Rosenberg

10/03/2022, 10:37 AM
Hi @setu suyagya Can you tell which
hadoop-aws
jar version youโ€™re using?
s

setu suyagya

10/03/2022, 11:44 AM
Hadoop 3.2.1
Jonathan Rosenberg

Jonathan Rosenberg

10/03/2022, 11:59 AM
Can you trying using these configs:
%spark.conf
spark.jars.packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.2.1
spark.sql.extensions io.delta.sqlDeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.s3a.endpoint <https://relaxed-crow.lakefs-demo.io>
spark.hadoop.fs.s3a.secret.key <secret>
spark.hadoop.fs.s3a.access.key <key>
spark.hadoop.fs.s3a.path.style.access true
?
s

setu suyagya

10/03/2022, 1:04 PM
already tried
1:04 PM
its not working
Jonathan Rosenberg

Jonathan Rosenberg

10/03/2022, 1:14 PM
this is not the same as the configurations above. Mind that in the last configs the version of
hadoop-aws
is
org.apache.hadoop:hadoop-aws:3.2.1
and not
org.apache.hadoop:hadoop-aws:3.3.1
. Did you try it with the 3.2.1 version?
s

setu suyagya

10/03/2022, 2:52 PM
yes
2:52 PM
i tried it again with same as you sent
2:56 PM
its just doing same, it just creating folder not writing anything