Hi lakeFS I d like to try out the garbage collection feature lakeFS #help

Hi, lakeFS! I’d like to try out the garbage collec...

Dieu M. Nguyen

09/05/2023, 8:37 PM

Hi, lakeFS! I’d like to try out the garbage collection feature with your help. I set the GC policy to have 1 retention day as a default and for all my branches. A few days ago, I wrote many versions of a dataset on a branch and see high storage in my S3 bucket as intended in my test. My questions: 1. Do I understand it correctly that S3 data from the latest version/commit will be retained and data from all other older versions will be deleted per the GC policy? 2. I ran the

spark-submit

command as directed by the documentation. As far as I can tell, the command ran and finished without errors. But in S3, I don’t see the list of objects removed in

_lakefs/retention/gc/unified/<RUN_ID>/deleted/

and my storage didn’t go down, so I assume objects from old versions haven’t been deleted. Do you have any ideas about this? Note: I just set the GC policy after already writing all my versions - Is this why?

Iddo Avneri

09/05/2023, 8:43 PM

Hi @Dieu M. Nguyen. Data that is in a “head” of any branch will not be deleted. Could it be that you have branches pointing to these files?

Iddo Avneri

09/05/2023, 8:44 PM

+ please make sure that versioning is not enabled on your object store (otherwise, storage will not go down)

Dieu M. Nguyen

09/05/2023, 8:57 PM

Hi @Iddo Avneri - my bucket versioning is not enabled since bucket creation. Hm I only wrote data to one branch and have not merged to main (I only have these two branches).

Yoni Augarten

09/05/2023, 9:11 PM

Hey @Dieu M. Nguyen, can you recursively list all the files under

_lakefs/retention/gc/unified

and let me know what they are?

Dieu M. Nguyen

09/05/2023, 9:13 PM

Hi @Yoni Augarten, under

lakefs/retention/gc/

I actually don't see

unified

and only

rules

Iddo Avneri

09/05/2023, 9:16 PM

Is the head of the branch pointing to any of the data?

Iddo Avneri

09/05/2023, 9:16 PM

Or all the data?

Yoni Augarten

09/05/2023, 9:17 PM

If this path is empty, it means something went wrong very early in the GC process. Even if no objects were to be deleted, this path would still have new objects in it. Do you see any interesting logs printed by the Spark job?

Dieu M. Nguyen

09/05/2023, 9:29 PM

@Iddo Avneri Can you please let me know how I can check? @Yoni Augarten Ah ok. The logs are INFO and I can't identify anything wrong, but there is one Exception that I can see:

23/09/05 212351 INFO SharedState: Warehouse path is 'file:/home/ssm-user/spark-warehouse'.

Exception in thread "main" io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.StorageConfig

Yoni Augarten

09/05/2023, 9:31 PM

This error may indicate that the lakeFS server URI given to the GC job is malformed. Can you share the

spark.hadoop.lakefs.api.url

you're passing?

Dieu M. Nguyen

09/05/2023, 9:34 PM

I have it as

<http://127.0.0.1:8000>

which is my endpoint URL. Would that be correct?

Yoni Augarten

09/05/2023, 9:34 PM

You would have to add

/api/v1

to that 🙂

Dieu M. Nguyen

09/05/2023, 9:34 PM

Ok let me try that!

Yoni Augarten

09/05/2023, 9:35 PM

Keep us posted!

👍 1

Dieu M. Nguyen

09/05/2023, 9:39 PM

Ok, that exception seems to have gone away and I get a new exception which maybe is good or bad 😅:

23/09/05 213534 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.lang.NumberFormatException: For input string: "64M"

and probably the same error:

Exception in thread "main" java.lang.NumberFormatException: For input string: "64M"

Yoni Augarten

09/05/2023, 9:40 PM

Looks like we're making progress. This one I really I hate, and it has to do with Spark/Hadoop compatibility issues. Do you know which Spark version you are running with?

Dieu M. Nguyen

09/05/2023, 9:41 PM

Sounds like you know this one. I'm on Spark 3.4.1.

Yoni Augarten

09/05/2023, 9:43 PM

It's the issue described here. I'm checking what our options are.

👍 1

Yoni Augarten

09/05/2023, 9:45 PM

I understand you are running this locally on your machine?

Dieu M. Nguyen

09/05/2023, 9:46 PM

I'm running this on an AWS EC2 instance in us-east-1.

Yoni Augarten

09/05/2023, 9:47 PM

Oh, gotcha. And are you adding the

--packages org.apache.hadoop:hadoop-aws:2.7.7

flag suggested by our docs?

Yoni Augarten

09/05/2023, 9:47 PM

I'm suspecting this addition may be outdated

Dieu M. Nguyen

09/05/2023, 9:47 PM

Yes, I have that!

Yoni Augarten

09/05/2023, 9:49 PM

Maybe we can try bumping this version to something more recent, say 3.2.4

Yoni Augarten

09/05/2023, 9:49 PM

I wouldn't get too optimistic but let's see where that brings us

Dieu M. Nguyen

09/05/2023, 9:49 PM

Alright, let me try that!

🙏🏻 1

Dieu M. Nguyen

09/05/2023, 9:52 PM

Well the

java.lang.NumberFormatException

went away but we are blessed with a new one!

23/09/05 215007 WARN FileSystem: Failed to initialize fileystem s3a://dieumy-test-update-zarr-store-lakefs/data: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

Dieu M. Nguyen

09/05/2023, 9:52 PM

Something about my S3 key and secret?

Yoni Augarten

09/05/2023, 9:55 PM

I'm wondering if we went backwards here. Complex Spark programs can pose a challenge because one needs to find out which version of each dependency to use. One moment please

👍 1

Yoni Augarten

09/05/2023, 9:57 PM

Can you access a Spark shell? (

spark-shell

)

Yoni Augarten

09/05/2023, 9:57 PM

If so, could you try to find the Hadoop version using the following command in the shell:

Copy code

org.apache.hadoop.util.VersionInfo.getVersion()

(or maybe you know the Hadoop version already?)

Dieu M. Nguyen

09/05/2023, 9:58 PM

I got into a Spark shell and this is what I got

res0: String = 3.3.4

Yoni Augarten

09/05/2023, 9:59 PM

Ok so first of all we should change the

hadoop-aws

version in the command to the same version as Hadoop.

Dieu M. Nguyen

09/05/2023, 10:02 PM

Ok I changed it to 3.3.4 but getting yet another error:

23/09/05 220033 ERROR FileFormatWriter: Aborting job db8e4543-d572-4931-86c1-91590bcd8098.

java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

Yoni Augarten

09/05/2023, 10:03 PM

Now I think we're making a progress. Can you peek in

_lakefs/retention/gc/unified

and see if something is already there?

Yoni Augarten

09/05/2023, 10:03 PM

Sorry for the long process by the way, GC is an external process that adds some complexity, but we will come up with the solution

Dieu M. Nguyen

09/05/2023, 10:05 PM

Yes! Now I see

_lakefs/retention/gc/unified/glbls1cs3mqm2j9qvc60/deleted/

. Inside is empty but at least it's making these folders. Thanks very much for your patient help! GC is one of the features that would be amazing for our use case with lakeFS so we are hoping it will work 🙂 as the way we write data we have terrabytes in our S3 bucket and only need the latest version at the end.

😍 1

Yoni Augarten

09/05/2023, 10:06 PM

Are there other files there? (not under deleted maybe?)

Yoni Augarten

09/05/2023, 10:07 PM

I think what happens here is that we were able to calculate which objects to delete, but for some reason couldn't write the list to S3.

Dieu M. Nguyen

09/05/2023, 10:07 PM

Under gc/ I see these.

Dieu M. Nguyen

09/05/2023, 10:08 PM

Under unified/ I just have

glbls1cs3mqm2j9qvc60

Yoni Augarten

09/05/2023, 10:08 PM

And under this one only deleted?

Yoni Augarten

09/05/2023, 10:09 PM

Can you search for the words "Report summary" in the logs?

Dieu M. Nguyen

09/05/2023, 10:11 PM

Yes, only deleted/ under glbls1cs3mqm2j9qvc60/.

Yoni Augarten

09/05/2023, 10:11 PM

I think we've reached a reporting error. It may be the case that objects have already been deleted by the job. Do you have an object that you know should have been deleted?

Yoni Augarten

09/05/2023, 10:11 PM

If so, trying to download it from the lakeFS UI should result in an error (it would still appear in the list of objects when looking at older commits where it existed).

Dieu M. Nguyen

09/05/2023, 10:14 PM

I'm trying to delete all previous versions of the data except the latest version. I can't tell if something has been deleted or not and the bucket size hasn't gone down. The

spark-submit

command ran super quick so I assume it didn't begin to delete anything.

Yoni Augarten

09/05/2023, 10:15 PM

That makes sense. Can you share the Spark logs here? I'll try to pin down where we are stuck.

Dieu M. Nguyen

09/05/2023, 10:16 PM

Ok one second!

Dieu M. Nguyen

09/05/2023, 10:20 PM

I'm sharing it as a txt since it's large. Now looking at the full logs, I do see that it finished some tasks if that indicates deleting some objects.

spark-log.txt

Dieu M. Nguyen

09/05/2023, 10:21 PM

And under retention/gc/commits/ I see folders for each run and these commits.csv files. I'm attaching one as an example.

commits.csv

Dieu M. Nguyen

09/05/2023, 10:22 PM

I gotta sign off for some errands and will resume tomorrow. Please let me know if you have any insights!

Yoni Augarten

09/05/2023, 10:23 PM

Thanks for the info! I'll keep digging and will update here tomorrow

Dieu M. Nguyen

09/05/2023, 10:23 PM

Thank you so much!

Yoni Augarten

09/06/2023, 7:16 AM

Hey again @Dieu M. Nguyen, to overcome this Snappy problem, let's try to set

spark.sql.parquet.compression.codec

none

Dieu M. Nguyen

09/06/2023, 6:04 PM

@Yoni Augarten Ok! I haven't used Spark before so can you please let me know how to set it? Can I do this as part of the

spark-submit

command? E.g.,

Copy code

spark-submit --class io.treeverse.gc.GarbageCollection \ --packages org.apache.hadoop:hadoop-aws:3.3.4 \ 
-c spark.sql.parquet.compression.codec=none \
# Followed by the rest of the command in your GC documentation

Yoni Augarten

09/06/2023, 6:14 PM

Yes, exactly like that

Dieu M. Nguyen

09/06/2023, 6:20 PM

Ok, I have run it but I'm still getting the exception: "Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy".

Yoni Augarten

09/07/2023, 5:01 PM

@Dieu M. Nguyen Sorry for the late response. Let me try to explain what's happening here: while deleting the files, we also write a report of what we deleted. This report is saved in Parquet format, which by default uses Snappy for compression. So what we're experiencing here is definitely not part of the critical path of GC. The configuration you just added was supposed to tell Spark not to use Snappy for compression anymore, so I'm surprised we're still getting the same error. Can you share the full stacktrace?

Dieu M. Nguyen

09/07/2023, 6:24 PM

Hi @Yoni Augarten thank you for the explanation and that makes sense. I'm also surprised it's still complaining about Snappy. I double checked that I'm running the correct command. Attached is the stracktrace. Thanks so much for your effort!

out.txt

Yoni Augarten

09/07/2023, 9:08 PM

Ok so apparently I was wrong before. I forgot that the lakeFS metadata is also Snappy-compressed.

Yoni Augarten

09/07/2023, 9:08 PM

I think at this point I would recommend using a managed Spark service like AWS EMR to run the job. Is that an option for you? This way, you will have many dependencies installed OOTB.

Dieu M. Nguyen

09/08/2023, 4:25 PM

Thanks @Yoni Augarten! I might be able to give AWS EMR a try. I see that I can get the Spark application with dependencies. Please let me know if there's anything specific to configure on EMR for this.

Yoni Augarten

09/08/2023, 10:17 PM

I suggest creating an EMR serverless app, and submitting to it the same Spark job you just did.

Dieu M. Nguyen

09/11/2023, 8:07 PM

Hi @Yoni Augarten I'm waiting for agency's permissions to use EMR Serverless. In the mean time, I created an EMR cluster with Spark 3.4.1 and tried running the Spark job as a step. Is this Spark version compatible with your code? I'm getting an exception that might relate to versions (not entirely sure): Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist. The full stack trace is below. Thank you for any insights!

Copy code

23/09/08 20:59:38 WARN DependencyUtils: Local jar /mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist, skipping.
23/09/08 20:59:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/08 20:59:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-5-165-163.us-east-1.compute.internal/10.5.165.163:8032
23/09/08 20:59:40 INFO Configuration: resource-types.xml not found
23/09/08 20:59:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/09/08 20:59:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
23/09/08 20:59:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
23/09/08 20:59:40 INFO Client: Setting up container launch context for our AM
23/09/08 20:59:40 INFO Client: Setting up the launch environment for our AM container
23/09/08 20:59:40 INFO Client: Preparing resources for our AM container
23/09/08 20:59:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/09/08 20:59:47 INFO Client: Uploading resource file:/mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7/__spark_libs__4873711776678516465.zip -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/__spark_libs__4873711776678516465.zip>
23/09/08 20:59:48 INFO Client: Uploading resource file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ -> <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001/>\
23/09/08 20:59:48 INFO Client: Deleted staging directory <hdfs://ip-10-5-165-163.us-east-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1694204830468_0001>
Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-01012223TUT7XNB9PPPQ/\ does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:390)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:461)
	at org.apache.spark.deploy.yarn.Client.distribute$1(Client.scala:557)
	at org.apache.spark.deploy.yarn.Client.$anonfun$prepareLocalResources$23(Client.scala:686)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:685)
	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:984)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:221)
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1322)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1770)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
23/09/08 20:59:48 INFO ShutdownHookManager: Shutdown hook called
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-48f22d00-2810-4f9a-9274-7a0af0bf4ac7
23/09/08 20:59:48 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f40895bb-ef48-4521-9a9f-d8d3fb92e2d7
Command exiting with ret '1'

Barak Amar

09/12/2023, 6:10 AM

Hi @Dieu M. Nguyen from the stack it looks like spark step didn't find the application jar. The exception comes from submit getting file not found.

Dieu M. Nguyen

09/12/2023, 7:16 PM

Hi @Barak Amar looks like you are correct. I tried both the link to the jar file in your bucket and also to upload your jar file to my s3 bucket. I used the

--jars

option to pass this custom jar path. However, I'm still getting the same error. Do you have any idea what might be incorrect in my spark submission?

Copy code

spark-submit --deploy-mode cluster \
--class io.treeverse.gc.GarbageCollection \ 
--packages org.apache.hadoop:hadoop-aws:3.3.4 \ 
-c spark.hadoop.lakefs.api.url=[url]/api/v1 \ 
-c spark.hadoop.lakefs.api.access_key= \ 
-c spark.hadoop.lakefs.api.secret_key= \ 
-c spark.hadoop.fs.s3a.access.key= \ 
-c spark.hadoop.fs.s3a.secret.key= \ 
--jars <s3://my-bucket/lakefs-spark-client-assembly-0.10.0.jar> \ 
small-test us-east-1

Yoni Augarten

09/13/2023, 6:51 AM

Hey @Dieu M. Nguyen, are you using the EMR UI to submit the job, or simply the command line?

Dieu M. Nguyen

09/13/2023, 2:31 PM

@Yoni Augarten Yes, I am using the EMR UI.

Yoni Augarten

09/13/2023, 2:34 PM

Hey @Dieu M. Nguyen, I'm creating an EMR cluster and I will try to come up with the right configuration

Dieu M. Nguyen

09/13/2023, 2:34 PM

Thank you very much, Yoni!

Yoni Augarten

09/13/2023, 3:28 PM

@Dieu M. Nguyen I also just experienced this error, and it was due to redundant

characters in my command. Try to remove all backslashes and line breaks from your command.

Dieu M. Nguyen

09/13/2023, 5:55 PM

@Yoni Augarten Thank you - I might be making progress now! It looks like it is finding the jar now. However, it is failing to connect the url. I'm running the lakeFS server on an EC2 instance (with the same VPC as this EMR cluster). I do have the

/api/v1

part in the url in the command. Maybe I'm missing something simple here. May I ask where you are running lakeFS?

emr-stderr.txt

Yoni Augarten

09/13/2023, 6:06 PM

@Dieu M. Nguyen For this experiment I used a free trial cluster of lakeFS cloud (confusing, because if you use lakeFS cloud you don't need to run GC - it happens automatically. But that was just for the experiment)

Yoni Augarten

09/13/2023, 6:07 PM

It seems like you are using 127.0.0.1 as your lakeFS server? It's probably not the correct one if you're on emr

Dieu M. Nguyen

09/13/2023, 6:20 PM

Ah ok, I know of the cloud version which makes life easier but we're trying to see if the open source would fully work. So has, it's been great - just stuck on this final garbage collection step! Correct, my url is set to

<http://127.0.0.1:8000/api/v1>

Yoni Augarten

09/13/2023, 6:26 PM

You need the URL to be accessible from the EMR cluster

Yoni Augarten

09/13/2023, 6:27 PM

If you're running lakeFS on an ec2 instance, it should be the IP of that instance

Dieu M. Nguyen

09/13/2023, 6:53 PM

Ok, I have a private IP on my EC2 instance that's running lakeFS. This instance is using the same VPC as my EMR cluster, so I assume the private IP would be fine. Would the format for the url be like this:

http://[IP of EC2 instance]/api/v1

Yoni Augarten

09/13/2023, 6:54 PM

Yes,exactly. Also don't forget the port and make sure the security group allows access to the port

👍 1

Dieu M. Nguyen

09/13/2023, 7:03 PM

@Yoni Augarten Great news! It finally succeeded 😄 😄 I tried it on a small repo so I know the objects that should be deleted have been. Thank you so much for your patient help over the previous days!

Yoni Augarten

09/14/2023, 8:14 AM

Great news indeed!

👍 1

210 Views

Open in Slack

Previous Next