Few questions I want every commit on the main branch of ever lakeFS #help

Few questions: I want every commit on the main br...

Callum Dempsey Leach

08/14/2024, 2:47 PM

Few questions: I want every commit on the main branch of every repository to be backed up in Glacier storage. Can LakeFS handle this? If LakeFS lets me access metadata tables does it create a reliable checksum per file I can refer to to do some kind of deduplication across repositories?

Amit Kesarwani

08/14/2024, 3:11 PM

lakeFS Mirroring might address your 1st question: https://docs.lakefs.io/howto/mirroring.html

Callum Dempsey Leach

08/14/2024, 3:16 PM

Hi great! I may have asked this in the past but is it possible to synchronize all metadata for files in LakeFS as an Iceberg or Delta Lake table export for all versions? Is there a notebook available that demonstrates this process? I would like to have a Delta Lake table containing all the files in LakeFS across all versions, allowing me to join my own reference tables that add metadata to those files, or compare checksums of files across repositories. Is that possible?

Amit Kesarwani

08/14/2024, 3:23 PM

We don’t have the demo notebook for this but you can access lakeFS metadata by using lakeFS Spark Metadata client and save it in a Delta table.

Callum Dempsey Leach

08/14/2024, 3:29 PM

Cool! I can't find a schema for what the metadata client offers. Is that documented anywhere?

Callum Dempsey Leach

08/14/2024, 3:29 PM

Or is it:

Copy code

key |             address|                etag|      last_modified|size|

Callum Dempsey Leach

08/14/2024, 3:29 PM

Not sure what an etag is or what the key entails

Amit Kesarwani

08/14/2024, 3:34 PM

AFAIK, etag and key are provided by object storage e.g. for S3: https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html

Callum Dempsey Leach

08/14/2024, 3:38 PM

i can see ChecksumAlgorithm is a feature of s3

Callum Dempsey Leach

08/14/2024, 3:39 PM

Amit, any chance of a schema dump? not clear to me whether the object metadata is returned by the SparkMetadataClient

Amit Kesarwani

08/14/2024, 3:43 PM

Checksum is part of object metadata. As you will see it in object info in lakeFS UI:

Callum Dempsey Leach

08/14/2024, 3:44 PM

So i get all of that through the metadata client

Callum Dempsey Leach

08/14/2024, 3:46 PM

And I'd be able to sync updates outbound to Delta based on it

Amit Kesarwani

08/14/2024, 3:46 PM

you can get Object metadata via lakeFS API e.g. Python: https://pydocs-lakefs.lakefs.io/lakefs.models.html#lakefs.models.ObjectInfo

Callum Dempsey Leach

08/14/2024, 3:47 PM

so i have about 20PB of objects and id be doing it on that kind of scale, which means that for every new commit to every repository id basically need to capture and update the delta tables for that metadata, does that sound tenable?

Callum Dempsey Leach

08/14/2024, 3:48 PM

this would enable me to do deduplication scanning across all the files in the lake as we update commits and versions

Amit Kesarwani

08/14/2024, 3:52 PM

lakeFS already handles deduplication. What are you trying to achieve?

Callum Dempsey Leach

08/14/2024, 4:02 PM

Does it do deduplication across repositories?

Callum Dempsey Leach

08/14/2024, 4:02 PM

Or just deduplication across files on the same path

Amit Kesarwani

08/14/2024, 4:03 PM

it does deduplication across files on the same path

Callum Dempsey Leach

08/14/2024, 4:04 PM

yes so we need deduplication across files not on the same path, or at least a detection mechanism

Callum Dempsey Leach

08/15/2024, 2:59 PM

@Amit Kesarwani Can I do an everything by everything comparison using LakeFS's API or in Spark ?

Amit Kesarwani

08/15/2024, 4:03 PM

I think lakeFS API will fulfill your requirements

Callum Dempsey Leach

08/15/2024, 4:46 PM

So is that a yes or a no? If I have 50 billion files across distinct repositories in LakeFS and I want to compare all of the checksums is that possible?

Callum Dempsey Leach

08/15/2024, 4:47 PM

It's really unclear what query patterns are supported

Amit Kesarwani

08/15/2024, 4:50 PM

Yes, lakeFS API can provide the data that you need. lakeFS will not compare all of the checksums. You will use another system/program for that and I don’t know whether it can scale for 50B files or not.

Callum Dempsey Leach

08/15/2024, 4:52 PM

Spark can do it with correct partitioning

Amit Kesarwani

08/15/2024, 9:39 PM

If you would like to discuss this further then you can book a Zoom call with me: https://calendly.com/amit-kesarwani/speak-with-lakefs-solutions-architect I am based in San Francisco.

Open in Slack

Previous Next