setu suyagya
09/29/2022, 3:04 PMBarak Amar
setu suyagya
09/29/2022, 3:09 PMsetu suyagya
09/29/2022, 3:11 PMBarak Amar
setu suyagya
09/29/2022, 3:12 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 3:15 PMsetu suyagya
09/29/2022, 3:16 PMBarak Amar
setu suyagya
09/29/2022, 3:19 PMsetu suyagya
09/29/2022, 3:19 PMBarak Amar
Barak Amar
Barak Amar
Barak Amar
setu suyagya
09/29/2022, 3:21 PMBarak Amar
org.apache.hadoop.util.SemaphoredDelegatingExecutor
Barak Amar
Barak Amar
Barak Amar
setu suyagya
09/29/2022, 3:32 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 3:41 PMsetu suyagya
09/29/2022, 3:42 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 3:45 PMsetu suyagya
09/29/2022, 3:45 PMBarak Amar
Barak Amar
Barak Amar
setu suyagya
09/29/2022, 3:48 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 3:55 PMsetu suyagya
09/29/2022, 3:56 PMsetu suyagya
09/29/2022, 3:56 PMBarak Amar
Barak Amar
Barak Amar
setu suyagya
09/29/2022, 3:58 PMsetu suyagya
09/29/2022, 3:58 PMBarak Amar
spark.hadoop.fs.s3.access...
without the 'bucket' partsBarak Amar
io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1
Barak Amar
Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class from the hadoop-aws package, which implements Hadoop's FileSystem API for S3. Make sure the version of this package matches the Hadoop version with which Spark was built.
This is the part I think our execution breaks while we use s3asetu suyagya
09/29/2022, 4:33 PMBarak Amar
setu suyagya
09/29/2022, 4:37 PMsetu suyagya
09/29/2022, 4:37 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 5:30 PMBarak Amar
Barak Amar
setu suyagya
09/29/2022, 5:37 PMsetu suyagya
09/29/2022, 5:37 PMsetu suyagya
09/29/2022, 5:38 PMBarak Amar
setu suyagya
09/30/2022, 4:52 AMBarak Amar
Release label:emr-6.8.0
Hadoop distribution:Amazon
Applications:Spark 3.3.0, Zeppelin 0.10.1
New Zeppelin notebook with the following steps:
%spark.conf
spark.jars.packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.3.1
spark.sql.extensions io.delta.sqlDeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.s3a.endpoint <https://relaxed-crow.lakefs-demo.io>
spark.hadoop.fs.s3a.secret.key <secret>
spark.hadoop.fs.s3a.access.key <key>
spark.hadoop.fs.s3a.path.style.access true
Code that write some data to lakeFS (but it can be any bucket):
%spark.pyspark
data = [("23",),("3",),("1977",)]
df = sc.parallelize(data).toDF()
df.write.format("delta").save("<s3a://my-repo/main/tests/delta7>")
And screenshot from lakeFS after write is completed.
Hope the above is helpful in getting you working with emr/delta/lakefs.setu suyagya
10/03/2022, 10:29 AMJonathan Rosenberg
10/03/2022, 10:37 AMhadoop-aws
jar version you’re using?setu suyagya
10/03/2022, 11:44 AMJonathan Rosenberg
10/03/2022, 11:59 AM%spark.conf
spark.jars.packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.2.1
spark.sql.extensions io.delta.sqlDeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.s3a.endpoint <https://relaxed-crow.lakefs-demo.io>
spark.hadoop.fs.s3a.secret.key <secret>
spark.hadoop.fs.s3a.access.key <key>
spark.hadoop.fs.s3a.path.style.access true
?setu suyagya
10/03/2022, 1:04 PMsetu suyagya
10/03/2022, 1:04 PMJonathan Rosenberg
10/03/2022, 1:14 PMhadoop-aws
is org.apache.hadoop:hadoop-aws:3.2.1
and not org.apache.hadoop:hadoop-aws:3.3.1
.
Did you try it with the 3.2.1 version?setu suyagya
10/03/2022, 2:52 PMsetu suyagya
10/03/2022, 2:52 PMsetu suyagya
10/03/2022, 2:56 PM