Cristian Caloian
10/28/2022, 10:10 AM<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.8/hadoop-lakefs-assembly-0.1.8.jar>
<https://repo1.maven.org/maven2/io/lakefs/lakefs-spark-client-301_2.12/0.5.1/lakefs-spark-client-301_2.12-0.5.1.jar>
I am running following command
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:3.3.1 \
-c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1" \
-c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
-c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
-c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
-c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
$SPARK_HOME/jars/lakefs-spark-client-301_2.12-0.5.1.jar \
<my-repo> eu-west-1
and I get the following error
Exception in thread "main" com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at io.treeverse.clients.ApiClient$.get(ApiClient.scala:43)
at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:273)
at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Idan Novogroder
10/28/2022, 10:18 AMYoni Augarten
10/28/2022, 11:45 AM<s3://treeverse-clients-us-east/lakefs-spark-client-301/0.5.1/lakefs-spark-client-301-assembly-0.5.1.jar>
Jonathan Rosenberg
10/28/2022, 11:45 AM$SPARK_HOME/jars
)?
We can start with validating that the hadoop-aws
version matches the hadoop versions in your jars
dir…Cristian Caloian
10/28/2022, 1:01 PM$SPARK_HOME/jars
avro-mapred-1.8.2-hadoop2.jar
dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
hadoop-annotations-3.2.0.jar
hadoop-auth-3.2.0.jar
hadoop-aws-3.3.1.jar
hadoop-client-3.2.0.jar
hadoop-common-3.2.0.jar
hadoop-hdfs-client-3.2.0.jar
hadoop-lakefs-assembly-0.1.8.jar
hadoop-mapreduce-client-common-3.2.0.jar
hadoop-mapreduce-client-core-3.2.0.jar
hadoop-mapreduce-client-jobclient-3.2.0.jar
hadoop-yarn-api-3.2.0.jar
hadoop-yarn-client-3.2.0.jar
hadoop-yarn-common-3.2.0.jar
hadoop-yarn-registry-3.2.0.jar
hadoop-yarn-server-common-3.2.0.jar
hadoop-yarn-server-web-proxy-3.2.0.jar
parquet-hadoop-1.10.1.jar
also all lakefs realted jars
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301_2.12-0.5.1.jar
spark-submit
command above but using the uber-jar instead. I am getting the same error.
This might be completely nonsense, but when I look at the GarbageCollector.scala file, trying to run all imports, I noticed import org.json4s.native.JsonMethods._
fails
scala> import org.json4s.native.JsonMethods._
<console>:23: error: object native is not a member of package org.json4s
import org.json4s.native.JsonMethods._
Tal Sofer
10/28/2022, 4:52 PM--packages org.apache.hadoop:hadoop-aws:3.3.1
into
--packages org.apache.hadoop:hadoop-aws:3.2.0
?
Your spark-submit command will look as follows:
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:3.2.0 \
-c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1" \
-c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
-c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
-c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
-c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
PATH_TO_UBER_JAR \
<my-repo> eu-west-1
And when you say that you are getting the same error, are you getting the exact stack trace or the same exception? if you can share the stacktrace it will be great 🙂Cristian Caloian
10/31/2022, 9:57 AM(base) jovyan@lfs-client-test-0:~$ spark-submit --class io.treeverse.clients.GarbageCollector \
> --packages org.apache.hadoop:hadoop-aws:3.2.0 \
> -c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1" \
> -c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
> -c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
> -c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
> -c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
> -c spark.executor.extraLibraryPath=/home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar \
> -c spark.executor.extraLibraryPath=/home/jovyan/jars/hadoop-aws-3.2.0.jar \
> /home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar \
> <my-repo> eu-west-1
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d996c5e0-6b56-43e8-8eff-3c80a1970dd0;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.2.0 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
:: resolution report :: resolve 139ms :: artifacts dl 3ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-d996c5e0-6b56-43e8-8eff-3c80a1970dd0
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/4ms)
22/10/31 08:53:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/31 08:53:13 INFO SparkContext: Running Spark version 3.0.1
22/10/31 08:53:13 INFO ResourceUtils: ==============================================================
22/10/31 08:53:13 INFO ResourceUtils: Resources for spark.driver:
22/10/31 08:53:13 INFO ResourceUtils: ==============================================================
22/10/31 08:53:13 INFO SparkContext: Submitted application: GarbageCollector
22/10/31 08:53:13 INFO SecurityManager: Changing view acls to: jovyan
22/10/31 08:53:13 INFO SecurityManager: Changing modify acls to: jovyan
22/10/31 08:53:13 INFO SecurityManager: Changing view acls groups to:
22/10/31 08:53:13 INFO SecurityManager: Changing modify acls groups to:
22/10/31 08:53:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jovyan); groups with view permissions: Set(); users with modify permissions: Set(jovyan); groups with modify permissions: Set()
22/10/31 08:53:14 INFO Utils: Successfully started service 'sparkDriver' on port 40379.
22/10/31 08:53:14 INFO SparkEnv: Registering MapOutputTracker
22/10/31 08:53:14 INFO SparkEnv: Registering BlockManagerMaster
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/31 08:53:14 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/31 08:53:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-54820e8e-476c-420d-873a-00c02cfac502
22/10/31 08:53:14 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/10/31 08:53:14 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/31 08:53:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/31 08:53:14 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at <http://lfs-client-test-0:4040>
22/10/31 08:53:14 INFO SparkContext: Added JAR file:///home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar at <spark://lfs-client-test-0:40379/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar> with timestamp 1667206394368
22/10/31 08:53:14 INFO SparkContext: Added JAR file:///home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar at <spark://lfs-client-test-0:40379/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar> with timestamp 1667206394368
22/10/31 08:53:14 INFO SparkContext: Added JAR file:/home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar at <spark://lfs-client-test-0:40379/jars/lakefs-spark-client-301-assembly-0.5.1.jar> with timestamp 1667206394369
22/10/31 08:53:14 INFO Executor: Starting executor ID driver on host lfs-client-test-0
22/10/31 08:53:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34953.
22/10/31 08:53:14 INFO NettyBlockTransferService: Server created on lfs-client-test-0:34953
22/10/31 08:53:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/31 08:53:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: Registering block manager lfs-client-test-0:34953 with 434.4 MiB RAM, BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, lfs-client-test-0, 34953, None)
Exception in thread "main" com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at io.treeverse.clients.ApiClient$.get(ApiClient.scala:43)
at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:273)
at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
at io.treeverse.clients.ApiClient$$anon$1.call(ApiClient.scala:44)
at io.treeverse.clients.ApiClient$$anon$1.call(ApiClient.scala:43)
at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 17 more
Caused by: java.lang.ClassNotFoundException: dev.failsafe.function.CheckedSupplier
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 24 more
22/10/31 08:53:14 INFO SparkContext: Invoking stop() from shutdown hook
22/10/31 08:53:14 INFO SparkUI: Stopped Spark web UI at <http://lfs-client-test-0:4040>
22/10/31 08:53:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/31 08:53:14 INFO MemoryStore: MemoryStore cleared
22/10/31 08:53:14 INFO BlockManager: BlockManager stopped
22/10/31 08:53:14 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/31 08:53:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/31 08:53:14 INFO SparkContext: Successfully stopped SparkContext
22/10/31 08:53:14 INFO ShutdownHookManager: Shutdown hook called
22/10/31 08:53:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-68224eda-9da1-4aa4-b316-dec0b0003d9d
22/10/31 08:53:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-fc8a6690-aa83-4347-b3da-7cc9d048b591
Or Tzabary
10/31/2022, 10:52 AMJonathan Rosenberg
10/31/2022, 12:05 PMlakefs-spark-client-301_2.12-0.5.1.jar
from the $SPARK_HOME/jars
directory, and run it again only with the assembly jar? (the assembly jar contains a shaded version of com.google.common.*
which means that you probably ran the non-assembly jar)Cristian Caloian
11/01/2022, 8:45 AM$SPARK_HOME/jars
both
(base) jovyan@lfs-client-test-1-0:~$ ls $SPARK_HOME/jars | grep lakefs
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
Is that the correct setup? Or only lakefs-spark-client-301-assembly-0.5.1.jar
should be enough?Jonathan Rosenberg
11/01/2022, 9:25 AMhadoop-lakefs-assembly-0.1.8.jar
. It’s not necessary for the GC.
A question: did you create a lakeFS user other than the admin, and are using its credentials as the lakeFS key and secret input for the GC process?
Also, it would be very helpful if you’ll attach the logs from the lakeFS server after running GC.
Thanks!Cristian Caloian
11/01/2022, 9:37 AMYou can removeGot it! Would I need it if I want to use the Spark client to work with files on LakeFS?. It’s not necessary for the GC.hadoop-lakefs-assembly-0.1.8.jar
did you create a lakeFS user other than the admin, and are using its credentials as the lakeFS key and secret input for the GC process?I am using my user credentials, which also has admin rights. It also has policies attached for the repo I am trying to run the GC.
it would be very helpful if you’ll attach the logs from the lakeFS server after running GC.Working on it. Let me get back to you later.
Jonathan Rosenberg
11/01/2022, 9:45 AMGot it! Would I need it if I want to use the Spark client to work with files on LakeFS?You can work with Spark and lakeFS without the lakeFS Hadoop Filesystem jar using lakeFS’s S3 gateway, but from performance POV it would be better if you’ve used the lakeFS Hadoop Filesystem.
I am using my user credentials, which also has admin rights. It also has policies attached for the repo I am trying to run the GC.
Working on it. Let me get back to you later.Great!
Cristian Caloian
11/04/2022, 10:38 AM<https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz>
Regarding the dependencies, do you have an explanation why both of these are needed? It actually doesn’t work otherwise. Not that it bothers me, but it looks a bit strange, especially the versions 301 vs 312. I have attached the full list.
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
The command to run the GC
spark-submit --class io.treeverse.clients.GarbageCollector \
-c spark.hadoop.lakefs.api.url="******" \
-c spark.hadoop.lakefs.api.access_key="******" \
-c spark.hadoop.lakefs.api.secret_key="******" \
-c spark.hadoop.fs.s3a.access.key="******" \
-c spark.hadoop.fs.s3a.secret.key="******" \
$SPARK_HOME/jars/lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar <repo-name> eu-west-1
The output (stdout) - not much info here because I think the GC policy wasn’t set up properly, but I will test different configurations in the next days.
gcCommitsLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/commits/run_id=981b2004-d427-4038-8d2f-20f100f73eab/commits.csv
gcAddressesLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/addresses/
apiURL: https://<lakefs-url>/api/v1
getAddressesToDelete: use 50 partitions for ranges
getAddressesToDelete: use 200 partitions for addresses
Expired addresses:
+-------+-------+
|address|mark_id|
+-------+-------+
+-------+-------+
addressDFLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/addresses/
And the logs (stderr) very long file with about 13k lines - there are no errors, but I want to share the warnings just to check with you if there is anything I need to take care
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/11/03 17:05:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/03 17:06:00 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/11/03 17:06:14 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
22/11/03 17:06:15 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
You can work with Spark and lakeFS without the lakeFS Hadoop Filesystem jar using lakeFS’s S3 gateway, but from performance POV it would be better if you’ve used the lakeFS Hadoop Filesystem.I have already set up the S3 gateway, but I thought it’s better to have the Hadoop Filesystem set up and create some boilerplate for our users.
Or Tzabary
11/04/2022, 10:46 AMCristian Caloian
11/04/2022, 11:09 AMdf.show()
. However, when I start the spark-shell
I get the following error message
:: problems summary ::
:::: ERRORS
Server access error at url <https://dl.bintray.com/spark-packages/maven/io/lakefs/lakefs-parent/0/lakefs-parent-0.jar|https://dl.bintray.com/spark-packages/maven/io/lakefs/lakefs-parent/0/lakefs-parent-0.jar> (javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake)
And when I’m trying to read the file I get
scala> val df = spark.read.option("delimiter", ",").option("header", "true").csv(dataPath)
22/11/04 10:48:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
df: org.apache.spark.sql.DataFrame = [sepal.length: string, sepal.width: string ... 3 more fields]
Again, it seems that the data is successfully read
scala> df.show()
+------------+-----------+------------+-----------+-------+
|sepal.length|sepal.width|petal.length|petal.width|variety|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| .2| Setosa|
| 4.9| 3| 1.4| .2| Setosa|
...
Do you have a suggestion on how to fix this? The error and the warning, that is.Or Tzabary
11/04/2022, 11:28 AMCristian Caloian
11/04/2022, 12:12 PMspark-shell --conf spark.hadoop.fs.s3a.access.key='="******"' \
--conf spark.hadoop.fs.s3a.secret.key='="******"' \
--conf spark.hadoop.fs.s3a.endpoint='<https://s3.eu-west-1.amazonaws.com>' \
--conf spark.hadoop.fs.lakefs.impl='io.lakefs.LakeFSFileSystem' \
--conf spark.hadoop.fs.lakefs.access.key='="******"' \
--conf spark.hadoop.fs.lakefs.secret.key='="******"' \
--conf spark.hadoop.fs.lakefs.endpoint='="******"' \
--packages io.lakefs:hadoop-lakefs-assembly:0.1.8
For the image, I am using an existing jupyter image, just so I can experiment quickly and because it had Spark installed, but I will create separate images, one for running the GC and one for the LakeFS Spark client.
Do you have a suggestion on how I can fix the bintray issue? Do I need to install some dependency?
I sincerely apologize for all these questions, but I am new to the Spark/Scala world.Or Tzabary
11/04/2022, 12:15 PMCristian Caloian
11/04/2022, 12:46 PMOr Tzabary
11/04/2022, 1:32 PMspark-shell
command you sent without any errorsCristian Caloian
11/04/2022, 2:47 PMTal Sofer
11/06/2022, 6:17 AMRegarding the dependencies, do you have an explanation why both of these are needed? It actually doesn’t work otherwise. Not that it bothers me, but it looks a bit strange, especially the versions 301 vs 312. I have attached the full list.
```hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar```What do you mean? do you mean that you must have both
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
In your jars folder for GC to work?Cristian Caloian
11/07/2022, 7:35 AMlakefs-spark-client-301-assembly-0.5.1.jar
as I felt like it must be redundant, but it didn’t work after. I then thought it must have to do with the fact that I am using Spark 3.0.1. But then I tried with Spark 3.1.2 and only lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
but still not working. Let me double check just to be 💯 Do you have a recommendation for what versions (Spark and jar) should I be using?Tal Sofer
11/08/2022, 7:55 AMlakefs-spark-client-301-assembly-0.5.1.jar
indeed should be redundant. If you share with us the error you are getting as a result of removing it we will be happy to help troubleshooting!
have a recommendation for what versions (Spark and jar) should I be using?Since your are using hadoop 3, I would recommend using this version of the jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
. This version is validated with Spark 3.1.2, but it may work with other Spark versions as I think it worked for you. Running it with Spark 3.1.2 would be best if this work for you 🙂Cristian Caloian
11/09/2022, 7:10 AMhadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
I must’ve gotten confused while trying various versions combinations. Sorry for the false alarm, and thank you for the support 🙏
With the dependencies nailed down, I now need to figure out how to create an image that I can use for running a Kubernetes job of kind SparkApplication
. I might need to ask more questions here if I run into trouble.Tal Sofer
11/09/2022, 7:16 AMCristian Caloian
11/09/2022, 7:45 AMTal Sofer
11/09/2022, 8:45 AM