Hi, I am trying to run garbage collector following...
# help
c
Hi, I am trying to run garbage collector following the docs instructions and unfortunately I cannot progress. Please note that I am not a Scala developer, so it is very likely that I made some stupid mistakes in my setup. Kindly appreciate any help 🙌 I am using Spark 3.0.1 and I installed the following dependencies
Copy code
<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.8/hadoop-lakefs-assembly-0.1.8.jar>
<https://repo1.maven.org/maven2/io/lakefs/lakefs-spark-client-301_2.12/0.5.1/lakefs-spark-client-301_2.12-0.5.1.jar>
I am running following command
Copy code
spark-submit --class io.treeverse.clients.GarbageCollector \
  --packages org.apache.hadoop:hadoop-aws:3.3.1 \
  -c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1"  \
  -c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
  -c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
  -c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
  -c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
  $SPARK_HOME/jars/lakefs-spark-client-301_2.12-0.5.1.jar \
  <my-repo> eu-west-1
and I get the following error
Copy code
Exception in thread "main" com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
	at io.treeverse.clients.ApiClient$.get(ApiClient.scala:43)
	at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:273)
	at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Does anyone have a Docker image with all the setup for running GC?
i
Hi Cristian, I'm trying to look at the logs and find the issue. Hope to come up with a good answer by the end of the day.
y
Hey @Cristian Caloian, transitive dependencies in the world of Hadoop can prove tricky. I suggest using our uber-jar which contains all the dependencies required for garbage collection. You can find it in our public S3 bucket, here:
Copy code
<s3://treeverse-clients-us-east/lakefs-spark-client-301/0.5.1/lakefs-spark-client-301-assembly-0.5.1.jar>
👍 2
j
Hi @Cristian Caloian 🙂 Can you list the Hadoop dependencies you have in your Spark jars directory (
$SPARK_HOME/jars
)? We can start with validating that the
hadoop-aws
version matches the hadoop versions in your
jars
dir…
c
Thank you all for you replies 🙌 Below are all jars that are hadoop related in my
$SPARK_HOME/jars
Copy code
avro-mapred-1.8.2-hadoop2.jar
dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
hadoop-annotations-3.2.0.jar
hadoop-auth-3.2.0.jar
hadoop-aws-3.3.1.jar
hadoop-client-3.2.0.jar
hadoop-common-3.2.0.jar
hadoop-hdfs-client-3.2.0.jar
hadoop-lakefs-assembly-0.1.8.jar
hadoop-mapreduce-client-common-3.2.0.jar
hadoop-mapreduce-client-core-3.2.0.jar
hadoop-mapreduce-client-jobclient-3.2.0.jar
hadoop-yarn-api-3.2.0.jar
hadoop-yarn-client-3.2.0.jar
hadoop-yarn-common-3.2.0.jar
hadoop-yarn-registry-3.2.0.jar
hadoop-yarn-server-common-3.2.0.jar
hadoop-yarn-server-web-proxy-3.2.0.jar
parquet-hadoop-1.10.1.jar
also all lakefs realted jars
Copy code
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301_2.12-0.5.1.jar
I will try now with the uber-jar suggested by @Yoni Augarten
I tried the
spark-submit
command above but using the uber-jar instead. I am getting the same error. This might be completely nonsense, but when I look at the GarbageCollector.scala file, trying to run all imports, I noticed
import org.json4s.native.JsonMethods._
fails
Copy code
scala> import org.json4s.native.JsonMethods._
<console>:23: error: object native is not a member of package org.json4s
       import org.json4s.native.JsonMethods._
t
Thanks @Cristian Caloian for using the uber-jar and sharing the dependency list! Can you please try to also change this line
--packages org.apache.hadoop:hadoop-aws:3.3.1
into
--packages org.apache.hadoop:hadoop-aws:3.2.0
? Your spark-submit command will look as follows:
Copy code
spark-submit --class io.treeverse.clients.GarbageCollector \
  --packages org.apache.hadoop:hadoop-aws:3.2.0 \
  -c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1"  \
  -c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
  -c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
  -c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
  -c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
 PATH_TO_UBER_JAR \
  <my-repo> eu-west-1
And when you say that you are getting the same error, are you getting the exact stack trace or the same exception? if you can share the stacktrace it will be great 🙂
c
Hi @Tal Sofer thank you for the help! I tried your suggestion, but I get the same exception and stack trace. See the long output below:
Copy code
(base) jovyan@lfs-client-test-0:~$ spark-submit --class io.treeverse.clients.GarbageCollector \
>   --packages org.apache.hadoop:hadoop-aws:3.2.0 \
>   -c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1"  \
>   -c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
>   -c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
>   -c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
>   -c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
>   -c spark.executor.extraLibraryPath=/home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar \
>   -c spark.executor.extraLibraryPath=/home/jovyan/jars/hadoop-aws-3.2.0.jar \
>   /home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar \
>   <my-repo> eu-west-1
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d996c5e0-6b56-43e8-8eff-3c80a1970dd0;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.2.0 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
:: resolution report :: resolve 139ms :: artifacts dl 3ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-d996c5e0-6b56-43e8-8eff-3c80a1970dd0
	confs: [default]
	0 artifacts copied, 2 already retrieved (0kB/4ms)
22/10/31 08:53:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/31 08:53:13 INFO SparkContext: Running Spark version 3.0.1
22/10/31 08:53:13 INFO ResourceUtils: ==============================================================
22/10/31 08:53:13 INFO ResourceUtils: Resources for spark.driver:

22/10/31 08:53:13 INFO ResourceUtils: ==============================================================
22/10/31 08:53:13 INFO SparkContext: Submitted application: GarbageCollector
22/10/31 08:53:13 INFO SecurityManager: Changing view acls to: jovyan
22/10/31 08:53:13 INFO SecurityManager: Changing modify acls to: jovyan
22/10/31 08:53:13 INFO SecurityManager: Changing view acls groups to:
22/10/31 08:53:13 INFO SecurityManager: Changing modify acls groups to:
22/10/31 08:53:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jovyan); groups with view permissions: Set(); users  with modify permissions: Set(jovyan); groups with modify permissions: Set()
22/10/31 08:53:14 INFO Utils: Successfully started service 'sparkDriver' on port 40379.
22/10/31 08:53:14 INFO SparkEnv: Registering MapOutputTracker
22/10/31 08:53:14 INFO SparkEnv: Registering BlockManagerMaster
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/31 08:53:14 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/31 08:53:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-54820e8e-476c-420d-873a-00c02cfac502
22/10/31 08:53:14 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/10/31 08:53:14 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/31 08:53:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/31 08:53:14 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at <http://lfs-client-test-0:4040>
22/10/31 08:53:14 INFO SparkContext: Added JAR file:///home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar at <spark://lfs-client-test-0:40379/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar> with timestamp 1667206394368
22/10/31 08:53:14 INFO SparkContext: Added JAR file:///home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar at <spark://lfs-client-test-0:40379/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar> with timestamp 1667206394368
22/10/31 08:53:14 INFO SparkContext: Added JAR file:/home/jovyan/jars/lakefs-spark-client-301-assembly-0.5.1.jar at <spark://lfs-client-test-0:40379/jars/lakefs-spark-client-301-assembly-0.5.1.jar> with timestamp 1667206394369
22/10/31 08:53:14 INFO Executor: Starting executor ID driver on host lfs-client-test-0
22/10/31 08:53:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34953.
22/10/31 08:53:14 INFO NettyBlockTransferService: Server created on lfs-client-test-0:34953
22/10/31 08:53:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/31 08:53:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManagerMasterEndpoint: Registering block manager lfs-client-test-0:34953 with 434.4 MiB RAM, BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, lfs-client-test-0, 34953, None)
22/10/31 08:53:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, lfs-client-test-0, 34953, None)
Exception in thread "main" com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
	at io.treeverse.clients.ApiClient$.get(ApiClient.scala:43)
	at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:273)
	at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
	at io.treeverse.clients.ApiClient$$anon$1.call(ApiClient.scala:44)
	at io.treeverse.clients.ApiClient$$anon$1.call(ApiClient.scala:43)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
	... 17 more
Caused by: java.lang.ClassNotFoundException: dev.failsafe.function.CheckedSupplier
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	... 24 more
22/10/31 08:53:14 INFO SparkContext: Invoking stop() from shutdown hook
22/10/31 08:53:14 INFO SparkUI: Stopped Spark web UI at <http://lfs-client-test-0:4040>
22/10/31 08:53:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/10/31 08:53:14 INFO MemoryStore: MemoryStore cleared
22/10/31 08:53:14 INFO BlockManager: BlockManager stopped
22/10/31 08:53:14 INFO BlockManagerMaster: BlockManagerMaster stopped
22/10/31 08:53:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/10/31 08:53:14 INFO SparkContext: Successfully stopped SparkContext
22/10/31 08:53:14 INFO ShutdownHookManager: Shutdown hook called
22/10/31 08:53:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-68224eda-9da1-4aa4-b316-dec0b0003d9d
22/10/31 08:53:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-fc8a6690-aa83-4347-b3da-7cc9d048b591
o
hey @Cristian Caloian, thanks for the information! we'll take a look and update
🙌 1
j
Hi @Cristian Caloian This looks like a good ol’ dependency fun… To make things a bit cleaner, can you delete the
lakefs-spark-client-301_2.12-0.5.1.jar
from the
$SPARK_HOME/jars
directory, and run it again only with the assembly jar? (the assembly jar contains a shaded version of
com.google.common.*
which means that you probably ran the non-assembly jar)
c
Thanks for the suggestion @Jonathan Rosenberg I get a different error now, which I take as we made some progress. Please see below the output (this time as a file so it can be collapsed)
I now have in my
$SPARK_HOME/jars
both
Copy code
(base) jovyan@lfs-client-test-1-0:~$ ls $SPARK_HOME/jars | grep lakefs
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
Is that the correct setup? Or only
lakefs-spark-client-301-assembly-0.5.1.jar
should be enough?
j
You can remove
hadoop-lakefs-assembly-0.1.8.jar
. It’s not necessary for the GC. A question: did you create a lakeFS user other than the admin, and are using its credentials as the lakeFS key and secret input for the GC process? Also, it would be very helpful if you’ll attach the logs from the lakeFS server after running GC. Thanks!
c
You can remove
hadoop-lakefs-assembly-0.1.8.jar
. It’s not necessary for the GC.
Got it! Would I need it if I want to use the Spark client to work with files on LakeFS?
did you create a lakeFS user other than the admin, and are using its credentials as the lakeFS key and secret input for the GC process?
I am using my user credentials, which also has admin rights. It also has policies attached for the repo I am trying to run the GC.
it would be very helpful if you’ll attach the logs from the lakeFS server after running GC.
Working on it. Let me get back to you later.
j
Got it! Would I need it if I want to use the Spark client to work with files on LakeFS?
You can work with Spark and lakeFS without the lakeFS Hadoop Filesystem jar using lakeFS’s S3 gateway, but from performance POV it would be better if you’ve used the lakeFS Hadoop Filesystem.
I am using my user credentials, which also has admin rights. It also has policies attached for the repo I am trying to run the GC.
Working on it. Let me get back to you later.
Great!
c
Hi! Sorry for the late reply… By the trial-and-error principle I think I managed to get a working version. At least no errors in the output. There are a couple of things I would like to run by you, just to double check if everything looks good. I am pasting/attaching below the dependencies, the command, the output, and the warning log messages. I used Spark 3.0.1 with Hadoop 3.2, downloaded from here:
Copy code
<https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz>
Regarding the dependencies, do you have an explanation why both of these are needed? It actually doesn’t work otherwise. Not that it bothers me, but it looks a bit strange, especially the versions 301 vs 312. I have attached the full list.
Copy code
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
The command to run the GC
Copy code
spark-submit --class io.treeverse.clients.GarbageCollector \
    -c spark.hadoop.lakefs.api.url="******"  \
    -c spark.hadoop.lakefs.api.access_key="******" \
    -c spark.hadoop.lakefs.api.secret_key="******" \
    -c spark.hadoop.fs.s3a.access.key="******" \
    -c spark.hadoop.fs.s3a.secret.key="******" \
    $SPARK_HOME/jars/lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar <repo-name> eu-west-1
The output (stdout) - not much info here because I think the GC policy wasn’t set up properly, but I will test different configurations in the next days.
Copy code
gcCommitsLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/commits/run_id=981b2004-d427-4038-8d2f-20f100f73eab/commits.csv
gcAddressesLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/addresses/
apiURL: https://<lakefs-url>/api/v1
getAddressesToDelete: use 50 partitions for ranges
getAddressesToDelete: use 200 partitions for addresses
Expired addresses:
+-------+-------+
|address|mark_id|
+-------+-------+
+-------+-------+

addressDFLocation: s3a://<bucket-name>/data//_lakefs/retention/gc/addresses/
And the logs (stderr) very long file with about 13k lines - there are no errors, but I want to share the warnings just to check with you if there is anything I need to take care
Copy code
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.0.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/11/03 17:05:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/03 17:06:00 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/11/03 17:06:14 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
22/11/03 17:06:15 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
You can work with Spark and lakeFS without the lakeFS Hadoop Filesystem jar using lakeFS’s S3 gateway, but from performance POV it would be better if you’ve used the lakeFS Hadoop Filesystem.
I have already set up the S3 gateway, but I thought it’s better to have the Hadoop Filesystem set up and create some boilerplate for our users.
o
@Cristian Caloian thanks for the information, we'll have a look and get back to you
🙌 1
c
Regarding the Hadoop Filesystem Spark lakefs client, I’m trying to use the same Docker image to load some data, and it seems that it works, i.e. I can read a csv file in a dataframe and run
df.show()
. However, when I start the
spark-shell
I get the following error message
Copy code
:: problems summary ::
:::: ERRORS
        Server access error at url <https://dl.bintray.com/spark-packages/maven/io/lakefs/lakefs-parent/0/lakefs-parent-0.jar|https://dl.bintray.com/spark-packages/maven/io/lakefs/lakefs-parent/0/lakefs-parent-0.jar> (javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake)
And when I’m trying to read the file I get
Copy code
scala> val df = spark.read.option("delimiter", ",").option("header", "true").csv(dataPath)
22/11/04 10:48:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
df: org.apache.spark.sql.DataFrame = [sepal.length: string, sepal.width: string ... 3 more fields]
Again, it seems that the data is successfully read
Copy code
scala> df.show()
+------------+-----------+------------+-----------+-------+
|sepal.length|sepal.width|petal.length|petal.width|variety|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|         .2| Setosa|
|         4.9|          3|         1.4|         .2| Setosa|
...
Do you have a suggestion on how to fix this? The error and the warning, that is.
👀 1
o
@Cristian Caloian it seems like an issue with bintray's serving the file
(dl.bintray.com unexpectedly closed the connection)
which docker image are you using?
and can you share the spark-shell command you're running?
c
This is the command I use the start the shell
Copy code
spark-shell --conf spark.hadoop.fs.s3a.access.key='="******"' \
            --conf spark.hadoop.fs.s3a.secret.key='="******"' \
            --conf spark.hadoop.fs.s3a.endpoint='<https://s3.eu-west-1.amazonaws.com>' \
            --conf spark.hadoop.fs.lakefs.impl='io.lakefs.LakeFSFileSystem' \
            --conf spark.hadoop.fs.lakefs.access.key='="******"' \
            --conf spark.hadoop.fs.lakefs.secret.key='="******"' \
            --conf spark.hadoop.fs.lakefs.endpoint='="******"' \
            --packages io.lakefs:hadoop-lakefs-assembly:0.1.8
For the image, I am using an existing jupyter image, just so I can experiment quickly and because it had Spark installed, but I will create separate images, one for running the GC and one for the LakeFS Spark client. Do you have a suggestion on how I can fix the bintray issue? Do I need to install some dependency? I sincerely apologize for all these questions, but I am new to the Spark/Scala world.
o
Nothing to be sorry about, we're happy to have you here. I'll try to reproduce it soon and see if I can provide a quick mitigation.
c
Thank you @Or Tzabary I just wanted to clarify something about the image - I created it with the dependencies I mentioned above.
👍 1
o
fyi, bintray closed their services a year and a half ago maybe the jupyter image you're using is configured to use bintray as its maven repository?
trying using the latest jupyter version to see if it helps
I managed to run the
spark-shell
command you sent without any errors
please let me know if it helps
c
Then it is probably from my end. I will try to create a slimmer image and re-run the steps and let you know how it goes.
👍 1
t
Hi @Cristian Caloian, happy that things are working for you! I have one clarification question - when you say:
Regarding the dependencies, do you have an explanation why both of these are needed? It actually doesn’t work otherwise. Not that it bothers me, but it looks a bit strange, especially the versions 301 vs 312. I have attached the full list.
```hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar```
What do you mean? do you mean that you must have both
Copy code
lakefs-spark-client-301-assembly-0.5.1.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
In your jars folder for GC to work?
c
Hi @Tal Sofer exactly! I tried removing
lakefs-spark-client-301-assembly-0.5.1.jar
as I felt like it must be redundant, but it didn’t work after. I then thought it must have to do with the fact that I am using Spark 3.0.1. But then I tried with Spark 3.1.2 and only
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
but still not working. Let me double check just to be 💯 Do you have a recommendation for what versions (Spark and jar) should I be using?
t
Thanks @Cristian Caloian!
lakefs-spark-client-301-assembly-0.5.1.jar
indeed should be redundant. If you share with us the error you are getting as a result of removing it we will be happy to help troubleshooting!
have a recommendation for what versions (Spark and jar) should I be using?
Since your are using hadoop 3, I would recommend using this version of the jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
. This version is validated with Spark 3.1.2, but it may work with other Spark versions as I think it worked for you. Running it with Spark 3.1.2 would be best if this work for you 🙂
🙌 1
c
@Tal Sofer it is actually working as expected - Spark 3.1.2, Hadoop 3.2, and the following lakefs jars
Copy code
hadoop-lakefs-assembly-0.1.8.jar
lakefs-spark-client-312-hadoop3-assembly-0.5.1.jar
I must’ve gotten confused while trying various versions combinations. Sorry for the false alarm, and thank you for the support 🙏 With the dependencies nailed down, I now need to figure out how to create an image that I can use for running a Kubernetes job of kind
SparkApplication
. I might need to ask more questions here if I run into trouble.
❤️ 1
t
Great news! Our pleasure, we are here to support you.
Mind sharing your use case and infra? Also are you locked to k8s to run your Spark apps there?
c
Yes, pretty much all our services run on Kubernetes, and I think it would be good to have a recurrent Spark job running the garbage collection on LakeFS for our users. Do you have a suggestion or alternative solution?
t
Thanks for sharing the details. Where is your infra running? k8s is a valid option although you will have to make sure your cluster has sufficient resources to run against your env. If this is an option for you, easiest may be to use a managed Spark env like databricks or EMR.