https://lakefs.io/ logo
Title
a

Aris Kalgreadis

04/18/2023, 11:07 AM
Hey all, just starting using LakeFS on our GKE cluster in GCP. I am really enjoying it so far but my only concern is Garbage Collection. I see in the roadmap that it is scheduled for Q1 2024. Until then is there a workaround that we could do on our side to implement the Garbage Collection on GCP ?
o

Or Tzabary

04/18/2023, 11:11 AM
hey @Aris Kalgreadis, happy to hear that you’re enjoying lakeFS 🙂 let me check that for you
a

Aris Kalgreadis

04/18/2023, 11:39 AM
Thank you @Or Tzabary. In a previous response someone posted that i should run the GC on a mark only mode but then i still get this :
23/04/18 13:35:09 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: gs://<MY_BUCKET>/_lakefs/retention/gc/commits/run_id=73c64576-6970-48e1-b013-83ba69f98fe6/commits.csv.
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
o

Or Tzabary

04/18/2023, 11:42 AM
as far as I know, the “mark” (which is part of the GC component) is not ready yet for GCP, but it should be released pretty soon, I’m verifying this with the team members to see when you should expect to see it released. note that this is only the first phase, once the files will be marked, you’ll be able to use a script or something similar to iterate over the marked files and delete them by yourself. the full GC operation for GCP is expected to be released on Q1 as you mentioned.
a

Aris Kalgreadis

04/18/2023, 11:43 AM
Thanks for your response! Yes as long as we have the mark file we have our own spark scripts that can delete files from the filesystem so that should be no problem
o

Or Tzabary

04/18/2023, 12:47 PM
checked internally, we don’t have any concrete date on when the mark functionality will be released yet, but probably much sooner compared to the full GC for GCS support. we’ll try to get a concrete time estimation by end of day tomorrow and update you.
a

Aris Kalgreadis

04/18/2023, 1:21 PM
Actually i managed to run the mark phase with success on GCP with the following command:
spark-submit --class io.treeverse.clients.GarbageCollector \
  --packages org.apache.hadoop:hadoop-aws:3.3.2 \
  --packages com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.11 \
  --packages com.google.guava:guava:31.1-jre \
  --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
  --conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
  --conf spark.hadoop.fs.gs.project.id=<PROJECT_ID> \
  --conf spark.hadoop.google.cloud.auth.service.account.enable=true \
  --conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=<CREDENTIALS_FILES> \
  -c spark.hadoop.lakefs.api.url= \
  -c spark.hadoop.lakefs.api.access_key= \
  -c spark.hadoop.lakefs.api.secret_key=\
  -c spark.hadoop.lakefs.gc.do_sweep=false \
  -c spark.hadoop.lakefs.gc.mark_id=mark_id \
  <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.7.0/lakefs-spark-client-312-hadoop3-assembly-0.7.0.jar> \
  repo us-east-1
o

Or Tzabary

04/18/2023, 1:23 PM
:jumping-lakefs: happy to hear that.. let me verify for you that it’s fully supported and working so you won’t hit any data-issues
y

Yoni Augarten

04/18/2023, 1:26 PM
Awesome!
o

Or Tzabary

04/18/2023, 1:27 PM
@Aris Kalgreadis you should be OK and good to go 🙂
a

Aris Kalgreadis

04/18/2023, 1:34 PM
Thanks for the support ! Hopefully we are a bit closer to adoption of LakeFS! 😄
:heart_lakefs: 1
y

Yoni Augarten

04/18/2023, 2:39 PM
Thank you @Aris Kalgreadis for bringing this up. I've opened a PR to document what you did. https://github.com/treeverse/lakeFS/pull/5695
💯 1
a

Aris Kalgreadis

04/18/2023, 3:24 PM
Awesome! @Yoni Augarten glad i could help