Hi, lakeFS! I’d like to try out the garbage collec...
# help
d
Hi, lakeFS! I’d like to try out the garbage collection feature with your help. I set the GC policy to have 1 retention day as a default and for all my branches. A few days ago, I wrote many versions of a dataset on a branch and see high storage in my S3 bucket as intended in my test. My questions: 1. Do I understand it correctly that S3 data from the latest version/commit will be retained and data from all other older versions will be deleted per the GC policy? 2. I ran the
spark-submit
command as directed by the documentation. As far as I can tell, the command ran and finished without errors. But in S3, I don’t see the list of objects removed in
_lakefs/retention/gc/unified/<RUN_ID>/deleted/
and my storage didn’t go down, so I assume objects from old versions haven’t been deleted. Do you have any ideas about this? Note: I just set the GC policy after already writing all my versions - Is this why?
i
Hi @Dieu M. Nguyen. Data that is in a “head” of any branch will not be deleted. Could it be that you have branches pointing to these files?
+ please make sure that versioning is not enabled on your object store (otherwise, storage will not go down)
d
Hi @Iddo Avneri - my bucket versioning is not enabled since bucket creation. Hm I only wrote data to one branch and have not merged to main (I only have these two branches).
y
Hey @Dieu M. Nguyen, can you recursively list all the files under
_lakefs/retention/gc/unified
and let me know what they are?
d
Hi @Yoni Augarten, under
lakefs/retention/gc/
I actually don't see
unified
and only
rules
.
i
Is the head of the branch pointing to any of the data?
Or all the data?
y
If this path is empty, it means something went wrong very early in the GC process. Even if no objects were to be deleted, this path would still have new objects in it. Do you see any interesting logs printed by the Spark job?
d
@Iddo Avneri Can you please let me know how I can check? @Yoni Augarten Ah ok. The logs are INFO and I can't identify anything wrong, but there is one Exception that I can see:
23/09/05 212351 INFO SharedState: Warehouse path is 'file:/home/ssm-user/spark-warehouse'.
Exception in thread "main" io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.StorageConfig
y
This error may indicate that the lakeFS server URI given to the GC job is malformed. Can you share the
spark.hadoop.lakefs.api.url
you're passing?
d
I have it as
<http://127.0.0.1:8000>
which is my endpoint URL. Would that be correct?
y
You would have to add
/api/v1
to that 🙂
d
Ok let me try that!
y
Keep us posted!
👍 1
d
Ok, that exception seems to have gone away and I get a new exception which maybe is good or bad 😅:
23/09/05 213534 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.lang.NumberFormatException: For input string: "64M"
and probably the same error:
Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"
y
Looks like we're making progress. This one I really I hate, and it has to do with Spark/Hadoop compatibility issues. Do you know which Spark version you are running with?
d
Sounds like you know this one. I'm on Spark 3.4.1.
y
It's the issue described here. I'm checking what our options are.
👍 1
I understand you are running this locally on your machine?
d
I'm running this on an AWS EC2 instance in us-east-1.
y
Oh, gotcha. And are you adding the
--packages org.apache.hadoop:hadoop-aws:2.7.7
flag suggested by our docs?
I'm suspecting this addition may be outdated
d
Yes, I have that!
y
Maybe we can try bumping this version to something more recent, say 3.2.4
I wouldn't get too optimistic but let's see where that brings us
d
Alright, let me try that!
🙏🏻 1
Well the
java.lang.NumberFormatException
went away but we are blessed with a new one!
23/09/05 215007 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
Something about my S3 key and secret?
y
I'm wondering if we went backwards here. Complex Spark programs can pose a challenge because one needs to find out which version of each dependency to use. One moment please
👍 1
Can you access a Spark shell? (
spark-shell
)
If so, could you try to find the Hadoop version using the following command in the shell:
Copy code
org.apache.hadoop.util.VersionInfo.getVersion()
(or maybe you know the Hadoop version already?)
d
I got into a Spark shell and this is what I got
res0: String = 3.3.4
.
y
Ok so first of all we should change the
hadoop-aws
version in the command to the same version as Hadoop.
d
Ok I changed it to 3.3.4 but getting yet another error:
23/09/05 220033 ERROR FileFormatWriter: Aborting job db8e4543-d572-4931-86c1-91590bcd8098.
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
y
Now I think we're making a progress. Can you peek in
_lakefs/retention/gc/unified
and see if something is already there?
Sorry for the long process by the way, GC is an external process that adds some complexity, but we will come up with the solution
d
Yes! Now I see
_lakefs/retention/gc/unified/glbls1cs3mqm2j9qvc60/deleted/
. Inside is empty but at least it's making these folders. Thanks very much for your patient help! GC is one of the features that would be amazing for our use case with lakeFS so we are hoping it will work 🙂 as the way we write data we have terrabytes in our S3 bucket and only need the latest version at the end.
😍 1
y
Are there other files there? (not under deleted maybe?)
I think what happens here is that we were able to calculate which objects to delete, but for some reason couldn't write the list to S3.
d
Under gc/ I see these.
Under unified/ I just have
glbls1cs3mqm2j9qvc60
y
And under this one only deleted?
Can you search for the words "Report summary" in the logs?
d
Yes, only deleted/ under glbls1cs3mqm2j9qvc60/.
y
I think we've reached a reporting error. It may be the case that objects have already been deleted by the job. Do you have an object that you know should have been deleted?
If so, trying to download it from the lakeFS UI should result in an error (it would still appear in the list of objects when looking at older commits where it existed).
d
I'm trying to delete all previous versions of the data except the latest version. I can't tell if something has been deleted or not and the bucket size hasn't gone down. The
spark-submit
command ran super quick so I assume it didn't begin to delete anything.
y
That makes sense. Can you share the Spark logs here? I'll try to pin down where we are stuck.
d
Ok one second!
I'm sharing it as a txt since it's large. Now looking at the full logs, I do see that it finished some tasks if that indicates deleting some objects.
And under retention/gc/commits/ I see folders for each run and these commits.csv files. I'm attaching one as an example.
I gotta sign off for some errands and will resume tomorrow. Please let me know if you have any insights!
y
Thanks for the info! I'll keep digging and will update here tomorrow
d
Thank you so much!
y
Hey again @Dieu M. Nguyen, to overcome this Snappy problem, let's try to set
spark.sql.parquet.compression.codec
to
none
.
d
@Yoni Augarten Ok! I haven't used Spark before so can you please let me know how to set it? Can I do this as part of the
spark-submit
command? E.g.,
Copy code
spark-submit --class io.treeverse.gc.GarbageCollection \ --packages org.apache.hadoop:hadoop-aws:3.3.4 \ 
-c spark.sql.parquet.compression.codec=none \
# Followed by the rest of the command in your GC documentation
y
Yes, exactly like that
d
Ok, I have run it but I'm still getting the exception: "Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy".
y
@Dieu M. Nguyen Sorry for the late response. Let me try to explain what's happening here: while deleting the files, we also write a report of what we deleted. This report is saved in Parquet format, which by default uses Snappy for compression. So what we're experiencing here is definitely not part of the critical path of GC. The configuration you just added was supposed to tell Spark not to use Snappy for compression anymore, so I'm surprised we're still getting the same error. Can you share the full stacktrace?
d
Hi @Yoni Augarten thank you for the explanation and that makes sense. I'm also surprised it's still complaining about Snappy. I double checked that I'm running the correct command. Attached is the stracktrace. Thanks so much for your effort!
y
Ok so apparently I was wrong before. I forgot that the lakeFS metadata is also Snappy-compressed.
I think at this point I would recommend using a managed Spark service like AWS EMR to run the job. Is that an option for you? This way, you will have many dependencies installed OOTB.
d
Thanks @Yoni Augarten! I might be able to give AWS EMR a try. I see that I can get the Spark application with dependencies. Please let me know if there's anything specific to configure on EMR for this.
y
I suggest creating an EMR serverless app, and submitting to it the same Spark job you just did.
d
Hi @Yoni Augarten I'm waiting for agency's permissions to use EMR Serverless. In the mean time, I created an EMR cluster with Spark 3.4.1 and tried running the Spark job as a step. Is this Spark version compatible with your code? I'm getting an exception that might relate to versions (not entirely sure): Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist. The full stack trace is below. Thank you for any insights!
Copy code
23/09/08 20:59:38 WARN DependencyUtils: Local jar /mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist, skipping.
23/09/08 20:59:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/08 20:59:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-5-165-163.us-east-1.compute.internal/10.5.165.163:8032
23/09/08 20:59:40 INFO Configuration: resource-types.xml not found
23/09/08 20:59:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/09/08 20:59:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
23/09/08 20:59:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
23/09/08 20:59:40 INFO Client: Setting up container launch context for our AM
23/09/08 20:59:40 INFO Client: Setting up the launch environment for our AM container
23/09/08 20:59:40 INFO Client: Preparing resources for our AM container
23/09/08 20:59:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/09/08 20:59:47 INFO Client: Uploading resource file:/mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7/__spark_libs__4873711776678516465.zip -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/__spark_libs__4873711776678516465.zip>
23/09/08 20:59:48 INFO Client: Uploading resource file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/>\
23/09/08 20:59:48 INFO Client: Deleted staging directory <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001>
Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:390)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:461)
	at org.apache.spark.deploy.yarn.Client.distribute$1(Client.scala:557)
	at org.apache.spark.deploy.yarn.Client.$anonfun$prepareLocalResources$23(Client.scala:686)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:685)
	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:984)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:221)
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1322)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1770)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
23/09/08 20:59:48 INFO ShutdownHookManager: Shutdown hook called
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f40895bb-ef48-4521-9a9f-d8d3fb92e2d7
Command exiting with ret '1'
b
Hi @Dieu M. Nguyen from the stack it looks like spark step didn't find the application jar. The exception comes from submit getting file not found.
d
Hi @Barak Amar looks like you are correct. I tried both the link to the jar file in your bucket and also to upload your jar file to my s3 bucket. I used the
--jars
option to pass this custom jar path. However, I'm still getting the same error. Do you have any idea what might be incorrect in my spark submission?
Copy code
spark-submit --deploy-mode cluster \
--class io.treeverse.gc.GarbageCollection \ 
--packages org.apache.hadoop:hadoop-aws:3.3.4 \ 
-c spark.hadoop.lakefs.api.url=[url]/api/v1 \ 
-c spark.hadoop.lakefs.api.access_key= \ 
-c spark.hadoop.lakefs.api.secret_key= \ 
-c spark.hadoop.fs.s3a.access.key= \ 
-c spark.hadoop.fs.s3a.secret.key= \ 
--jars <s3://my-bucket/lakefs-spark-client-assembly-0.10.0.jar> \ 
small-test us-east-1
y
Hey @Dieu M. Nguyen, are you using the EMR UI to submit the job, or simply the command line?
d
@Yoni Augarten Yes, I am using the EMR UI.
y
Hey @Dieu M. Nguyen, I'm creating an EMR cluster and I will try to come up with the right configuration
d
Thank you very much, Yoni!
y
@Dieu M. Nguyen I also just experienced this error, and it was due to redundant
\
characters in my command. Try to remove all backslashes and line breaks from your command.
d
@Yoni Augarten Thank you - I might be making progress now! It looks like it is finding the jar now. However, it is failing to connect the url. I'm running the lakeFS server on an EC2 instance (with the same VPC as this EMR cluster). I do have the
/api/v1
part in the url in the command. Maybe I'm missing something simple here. May I ask where you are running lakeFS?
y
@Dieu M. Nguyen For this experiment I used a free trial cluster of lakeFS cloud (confusing, because if you use lakeFS cloud you don't need to run GC - it happens automatically. But that was just for the experiment)
It seems like you are using 127.0.0.1 as your lakeFS server? It's probably not the correct one if you're on emr
d
Ah ok, I know of the cloud version which makes life easier but we're trying to see if the open source would fully work. So has, it's been great - just stuck on this final garbage collection step! Correct, my url is set to
<http://127.0.0.1:8000/api/v1>
.
y
You need the URL to be accessible from the EMR cluster
If you're running lakeFS on an ec2 instance, it should be the IP of that instance
d
Ok, I have a private IP on my EC2 instance that's running lakeFS. This instance is using the same VPC as my EMR cluster, so I assume the private IP would be fine. Would the format for the url be like this:
http://[IP of EC2 instance]/api/v1
?
y
Yes,exactly. Also don't forget the port and make sure the security group allows access to the port
👍 1
d
@Yoni Augarten Great news! It finally succeeded 😄 😄 I tried it on a small repo so I know the objects that should be deleted have been. Thank you so much for your patient help over the previous days!
y
Great news indeed!
👍 1