Few questions: I want every commit on the main br...
# help
c
Few questions: I want every commit on the main branch of every repository to be backed up in Glacier storage. Can LakeFS handle this? If LakeFS lets me access metadata tables does it create a reliable checksum per file I can refer to to do some kind of deduplication across repositories?
a
lakeFS Mirroring might address your 1st question: https://docs.lakefs.io/howto/mirroring.html
c
Hi great! I may have asked this in the past but is it possible to synchronize all metadata for files in LakeFS as an Iceberg or Delta Lake table export for all versions? Is there a notebook available that demonstrates this process? I would like to have a Delta Lake table containing all the files in LakeFS across all versions, allowing me to join my own reference tables that add metadata to those files, or compare checksums of files across repositories. Is that possible?
a
We don’t have the demo notebook for this but you can access lakeFS metadata by using lakeFS Spark Metadata client and save it in a Delta table.
c
Cool! I can't find a schema for what the metadata client offers. Is that documented anywhere?
Or is it:
Copy code
key |             address|                etag|      last_modified|size|
Not sure what an etag is or what the key entails
a
AFAIK, etag and key are provided by object storage e.g. for S3: https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html
c
i can see ChecksumAlgorithm is a feature of s3
Amit, any chance of a schema dump? not clear to me whether the object metadata is returned by the SparkMetadataClient
a
Checksum is part of object metadata. As you will see it in object info in lakeFS UI:
c
So i get all of that through the metadata client
And I'd be able to sync updates outbound to Delta based on it
a
you can get Object metadata via lakeFS API e.g. Python: https://pydocs-lakefs.lakefs.io/lakefs.models.html#lakefs.models.ObjectInfo
c
so i have about 20PB of objects and id be doing it on that kind of scale, which means that for every new commit to every repository id basically need to capture and update the delta tables for that metadata, does that sound tenable?
this would enable me to do deduplication scanning across all the files in the lake as we update commits and versions
a
lakeFS already handles deduplication. What are you trying to achieve?
c
Does it do deduplication across repositories?
Or just deduplication across files on the same path
a
it does deduplication across files on the same path
c
yes so we need deduplication across files not on the same path, or at least a detection mechanism
@Amit Kesarwani Can I do an everything by everything comparison using LakeFS's API or in Spark ?
a
I think lakeFS API will fulfill your requirements
c
So is that a yes or a no? If I have 50 billion files across distinct repositories in LakeFS and I want to compare all of the checksums is that possible?
It's really unclear what query patterns are supported
a
Yes, lakeFS API can provide the data that you need. lakeFS will not compare all of the checksums. You will use another system/program for that and I don’t know whether it can scale for 50B files or not.
c
Spark can do it with correct partitioning
a
If you would like to discuss this further then you can book a Zoom call with me: https://calendly.com/amit-kesarwani/speak-with-lakefs-solutions-architect I am based in San Francisco.