Dieu M. Nguyen
09/05/2023, 8:37 PMspark-submit
command as directed by the documentation. As far as I can tell, the command ran and finished without errors. But in S3, I don’t see the list of objects removed in _lakefs/retention/gc/unified/<RUN_ID>/deleted/
and my storage didn’t go down, so I assume objects from old versions haven’t been deleted. Do you have any ideas about this? Note: I just set the GC policy after already writing all my versions - Is this why?Iddo Avneri
09/05/2023, 8:43 PMDieu M. Nguyen
09/05/2023, 8:57 PMYoni Augarten
09/05/2023, 9:11 PM_lakefs/retention/gc/unified
and let me know what they are?Dieu M. Nguyen
09/05/2023, 9:13 PMlakefs/retention/gc/
I actually don't see unified
and only rules
.Iddo Avneri
09/05/2023, 9:16 PMYoni Augarten
09/05/2023, 9:17 PMDieu M. Nguyen
09/05/2023, 9:29 PM23/09/05 212351 INFO SharedState: Warehouse path is 'file:/home/ssm-user/spark-warehouse'.
Exception in thread "main" io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.StorageConfig
Yoni Augarten
09/05/2023, 9:31 PMspark.hadoop.lakefs.api.url
you're passing?Dieu M. Nguyen
09/05/2023, 9:34 PM<http://127.0.0.1:8000>
which is my endpoint URL. Would that be correct?Yoni Augarten
09/05/2023, 9:34 PM/api/v1
to that 🙂Dieu M. Nguyen
09/05/2023, 9:34 PMYoni Augarten
09/05/2023, 9:35 PMDieu M. Nguyen
09/05/2023, 9:39 PM23/09/05 213534 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.lang.NumberFormatException: For input string: "64M"and probably the same error:
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
Yoni Augarten
09/05/2023, 9:40 PMDieu M. Nguyen
09/05/2023, 9:41 PMYoni Augarten
09/05/2023, 9:43 PMDieu M. Nguyen
09/05/2023, 9:46 PMYoni Augarten
09/05/2023, 9:47 PM--packages org.apache.hadoop:hadoop-aws:2.7.7
flag suggested by our docs?Dieu M. Nguyen
09/05/2023, 9:47 PMYoni Augarten
09/05/2023, 9:49 PMDieu M. Nguyen
09/05/2023, 9:49 PMjava.lang.NumberFormatException
went away but we are blessed with a new one!
23/09/05 215007 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
Yoni Augarten
09/05/2023, 9:55 PMspark-shell
)org.apache.hadoop.util.VersionInfo.getVersion()
(or maybe you know the Hadoop version already?)Dieu M. Nguyen
09/05/2023, 9:58 PMres0: String = 3.3.4
.Yoni Augarten
09/05/2023, 9:59 PMhadoop-aws
version in the command to the same version as Hadoop.Dieu M. Nguyen
09/05/2023, 10:02 PM23/09/05 220033 ERROR FileFormatWriter: Aborting job db8e4543-d572-4931-86c1-91590bcd8098.
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Yoni Augarten
09/05/2023, 10:03 PM_lakefs/retention/gc/unified
and see if something is already there?Dieu M. Nguyen
09/05/2023, 10:05 PM_lakefs/retention/gc/unified/glbls1cs3mqm2j9qvc60/deleted/
. Inside is empty but at least it's making these folders.
Thanks very much for your patient help! GC is one of the features that would be amazing for our use case with lakeFS so we are hoping it will work 🙂 as the way we write data we have terrabytes in our S3 bucket and only need the latest version at the end.Yoni Augarten
09/05/2023, 10:06 PMDieu M. Nguyen
09/05/2023, 10:07 PMglbls1cs3mqm2j9qvc60
Yoni Augarten
09/05/2023, 10:08 PMDieu M. Nguyen
09/05/2023, 10:11 PMYoni Augarten
09/05/2023, 10:11 PMDieu M. Nguyen
09/05/2023, 10:14 PMspark-submit
command ran super quick so I assume it didn't begin to delete anything.Yoni Augarten
09/05/2023, 10:15 PMDieu M. Nguyen
09/05/2023, 10:16 PMYoni Augarten
09/05/2023, 10:23 PMDieu M. Nguyen
09/05/2023, 10:23 PMYoni Augarten
09/06/2023, 7:16 AMspark.sql.parquet.compression.codec
to none
.Dieu M. Nguyen
09/06/2023, 6:04 PMspark-submit
command? E.g.,
spark-submit --class io.treeverse.gc.GarbageCollection \ --packages org.apache.hadoop:hadoop-aws:3.3.4 \
-c spark.sql.parquet.compression.codec=none \
# Followed by the rest of the command in your GC documentation
Yoni Augarten
09/06/2023, 6:14 PMDieu M. Nguyen
09/06/2023, 6:20 PMYoni Augarten
09/07/2023, 5:01 PMDieu M. Nguyen
09/07/2023, 6:24 PMYoni Augarten
09/07/2023, 9:08 PMDieu M. Nguyen
09/08/2023, 4:25 PMYoni Augarten
09/08/2023, 10:17 PMDieu M. Nguyen
09/11/2023, 8:07 PM23/09/08 20:59:38 WARN DependencyUtils: Local jar /mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist, skipping.
23/09/08 20:59:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/08 20:59:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-5-165-163.us-east-1.compute.internal/10.5.165.163:8032
23/09/08 20:59:40 INFO Configuration: resource-types.xml not found
23/09/08 20:59:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/09/08 20:59:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
23/09/08 20:59:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
23/09/08 20:59:40 INFO Client: Setting up container launch context for our AM
23/09/08 20:59:40 INFO Client: Setting up the launch environment for our AM container
23/09/08 20:59:40 INFO Client: Preparing resources for our AM container
23/09/08 20:59:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/09/08 20:59:47 INFO Client: Uploading resource file:/mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7/__spark_libs__4873711776678516465.zip -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/__spark_libs__4873711776678516465.zip>
23/09/08 20:59:48 INFO Client: Uploading resource file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/>\
23/09/08 20:59:48 INFO Client: Deleted staging directory <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001>
Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:390)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:461)
at org.apache.spark.deploy.yarn.Client.distribute$1(Client.scala:557)
at org.apache.spark.deploy.yarn.Client.$anonfun$prepareLocalResources$23(Client.scala:686)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:685)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:984)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:221)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1322)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1770)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
23/09/08 20:59:48 INFO ShutdownHookManager: Shutdown hook called
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f40895bb-ef48-4521-9a9f-d8d3fb92e2d7
Command exiting with ret '1'
Barak Amar
Dieu M. Nguyen
09/12/2023, 7:16 PM--jars
option to pass this custom jar path. However, I'm still getting the same error. Do you have any idea what might be incorrect in my spark submission?
spark-submit --deploy-mode cluster \
--class io.treeverse.gc.GarbageCollection \
--packages org.apache.hadoop:hadoop-aws:3.3.4 \
-c spark.hadoop.lakefs.api.url=[url]/api/v1 \
-c spark.hadoop.lakefs.api.access_key= \
-c spark.hadoop.lakefs.api.secret_key= \
-c spark.hadoop.fs.s3a.access.key= \
-c spark.hadoop.fs.s3a.secret.key= \
--jars <s3://my-bucket/lakefs-spark-client-assembly-0.10.0.jar> \
small-test us-east-1
Yoni Augarten
09/13/2023, 6:51 AMDieu M. Nguyen
09/13/2023, 2:31 PMYoni Augarten
09/13/2023, 2:34 PMDieu M. Nguyen
09/13/2023, 2:34 PMYoni Augarten
09/13/2023, 3:28 PM\
characters in my command. Try to remove all backslashes and line breaks from your command.Dieu M. Nguyen
09/13/2023, 5:55 PM/api/v1
part in the url in the command. Maybe I'm missing something simple here. May I ask where you are running lakeFS?Yoni Augarten
09/13/2023, 6:06 PMDieu M. Nguyen
09/13/2023, 6:20 PM<http://127.0.0.1:8000/api/v1>
.Yoni Augarten
09/13/2023, 6:26 PMDieu M. Nguyen
09/13/2023, 6:53 PMhttp://[IP of EC2 instance]/api/v1
?Yoni Augarten
09/13/2023, 6:54 PMDieu M. Nguyen
09/13/2023, 7:03 PMYoni Augarten
09/14/2023, 8:14 AM