Dieu M. Nguyen09/05/2023, 8:37 PM
command as directed by the documentation. As far as I can tell, the command ran and finished without errors. But in S3, I don’t see the list of objects removed in
and my storage didn’t go down, so I assume objects from old versions haven’t been deleted. Do you have any ideas about this? Note: I just set the GC policy after already writing all my versions - Is this why?
Iddo Avneri09/05/2023, 8:43 PM
Dieu M. Nguyen09/05/2023, 8:57 PM
Yoni Augarten09/05/2023, 9:11 PM
and let me know what they are?
Dieu M. Nguyen09/05/2023, 9:13 PM
I actually don't see
Iddo Avneri09/05/2023, 9:16 PM
Yoni Augarten09/05/2023, 9:17 PM
Dieu M. Nguyen09/05/2023, 9:29 PM
23/09/05 212351 INFO SharedState: Warehouse path is 'file:/home/ssm-user/spark-warehouse'.
Exception in thread "main" io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.StorageConfig
Yoni Augarten09/05/2023, 9:31 PM
Dieu M. Nguyen09/05/2023, 9:34 PM
which is my endpoint URL. Would that be correct?
Yoni Augarten09/05/2023, 9:34 PM
to that 🙂
Dieu M. Nguyen09/05/2023, 9:34 PM
Yoni Augarten09/05/2023, 9:35 PM
Dieu M. Nguyen09/05/2023, 9:39 PM
23/09/05 213534 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.lang.NumberFormatException: For input string: "64M"and probably the same error:
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
Yoni Augarten09/05/2023, 9:40 PM
Dieu M. Nguyen09/05/2023, 9:41 PM
Yoni Augarten09/05/2023, 9:43 PM
Dieu M. Nguyen09/05/2023, 9:46 PM
Yoni Augarten09/05/2023, 9:47 PM
flag suggested by our docs?
Dieu M. Nguyen09/05/2023, 9:47 PM
Yoni Augarten09/05/2023, 9:49 PM
Dieu M. Nguyen09/05/2023, 9:49 PM
went away but we are blessed with a new one!
23/09/05 215007 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
Yoni Augarten09/05/2023, 9:55 PM
(or maybe you know the Hadoop version already?)
Dieu M. Nguyen09/05/2023, 9:58 PM
res0: String = 3.3.4
Yoni Augarten09/05/2023, 9:59 PM
version in the command to the same version as Hadoop.
Dieu M. Nguyen09/05/2023, 10:02 PM
23/09/05 220033 ERROR FileFormatWriter: Aborting job db8e4543-d572-4931-86c1-91590bcd8098.
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Yoni Augarten09/05/2023, 10:03 PM
and see if something is already there?
Dieu M. Nguyen09/05/2023, 10:05 PM
. Inside is empty but at least it's making these folders. Thanks very much for your patient help! GC is one of the features that would be amazing for our use case with lakeFS so we are hoping it will work 🙂 as the way we write data we have terrabytes in our S3 bucket and only need the latest version at the end.
Yoni Augarten09/05/2023, 10:06 PM
Dieu M. Nguyen09/05/2023, 10:07 PM
Yoni Augarten09/05/2023, 10:08 PM
Dieu M. Nguyen09/05/2023, 10:11 PM
Yoni Augarten09/05/2023, 10:11 PM
Dieu M. Nguyen09/05/2023, 10:14 PM
command ran super quick so I assume it didn't begin to delete anything.
Yoni Augarten09/05/2023, 10:15 PM
Dieu M. Nguyen09/05/2023, 10:16 PM
Yoni Augarten09/05/2023, 10:23 PM
Dieu M. Nguyen09/05/2023, 10:23 PM
Yoni Augarten09/06/2023, 7:16 AM
Dieu M. Nguyen09/06/2023, 6:04 PM
spark-submit --class io.treeverse.gc.GarbageCollection \ --packages org.apache.hadoop:hadoop-aws:3.3.4 \ -c spark.sql.parquet.compression.codec=none \ # Followed by the rest of the command in your GC documentation
Yoni Augarten09/06/2023, 6:14 PM
Dieu M. Nguyen09/06/2023, 6:20 PM
Yoni Augarten09/07/2023, 5:01 PM
Dieu M. Nguyen09/07/2023, 6:24 PM
Yoni Augarten09/07/2023, 9:08 PM
Dieu M. Nguyen09/08/2023, 4:25 PM
Yoni Augarten09/08/2023, 10:17 PM
Dieu M. Nguyen09/11/2023, 8:07 PM
23/09/08 20:59:38 WARN DependencyUtils: Local jar /mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist, skipping. 23/09/08 20:59:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/09/08 20:59:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-5-165-163.us-east-1.compute.internal/10.5.165.163:8032 23/09/08 20:59:40 INFO Configuration: resource-types.xml not found 23/09/08 20:59:40 INFO ResourceUtils: Unable to find 'resource-types.xml'. 23/09/08 20:59:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container) 23/09/08 20:59:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead 23/09/08 20:59:40 INFO Client: Setting up container launch context for our AM 23/09/08 20:59:40 INFO Client: Setting up the launch environment for our AM container 23/09/08 20:59:40 INFO Client: Preparing resources for our AM container 23/09/08 20:59:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 23/09/08 20:59:47 INFO Client: Uploading resource file:/mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7/__spark_libs__4873711776678516465.zip -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/__spark_libs__4873711776678516465.zip> 23/09/08 20:59:48 INFO Client: Uploading resource file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/>\ 23/09/08 20:59:48 INFO Client: Deleted staging directory <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001> Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:390) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:461) at org.apache.spark.deploy.yarn.Client.distribute$1(Client.scala:557) at org.apache.spark.deploy.yarn.Client.$anonfun$prepareLocalResources$23(Client.scala:686) at scala.Option.foreach(Option.scala:407) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:685) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:984) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:221) at org.apache.spark.deploy.yarn.Client.run(Client.scala:1322) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1770) at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 23/09/08 20:59:48 INFO ShutdownHookManager: Shutdown hook called 23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7 23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f40895bb-ef48-4521-9a9f-d8d3fb92e2d7 Command exiting with ret '1'
Dieu M. Nguyen09/12/2023, 7:16 PM
option to pass this custom jar path. However, I'm still getting the same error. Do you have any idea what might be incorrect in my spark submission?
spark-submit --deploy-mode cluster \ --class io.treeverse.gc.GarbageCollection \ --packages org.apache.hadoop:hadoop-aws:3.3.4 \ -c spark.hadoop.lakefs.api.url=[url]/api/v1 \ -c spark.hadoop.lakefs.api.access_key= \ -c spark.hadoop.lakefs.api.secret_key= \ -c spark.hadoop.fs.s3a.access.key= \ -c spark.hadoop.fs.s3a.secret.key= \ --jars <s3://my-bucket/lakefs-spark-client-assembly-0.10.0.jar> \ small-test us-east-1
Yoni Augarten09/13/2023, 6:51 AM
Dieu M. Nguyen09/13/2023, 2:31 PM
Yoni Augarten09/13/2023, 2:34 PM
Dieu M. Nguyen09/13/2023, 2:34 PM
Yoni Augarten09/13/2023, 3:28 PM
characters in my command. Try to remove all backslashes and line breaks from your command.
Dieu M. Nguyen09/13/2023, 5:55 PM
part in the url in the command. Maybe I'm missing something simple here. May I ask where you are running lakeFS?
Yoni Augarten09/13/2023, 6:06 PM
Dieu M. Nguyen09/13/2023, 6:20 PM
Yoni Augarten09/13/2023, 6:26 PM
Dieu M. Nguyen09/13/2023, 6:53 PM
http://[IP of EC2 instance]/api/v1
Yoni Augarten09/13/2023, 6:54 PM
Dieu M. Nguyen09/13/2023, 7:03 PM
Yoni Augarten09/14/2023, 8:14 AM