Hello, I am looking to store large binary files, l...
# help
b
Hello, I am looking to store large binary files, like ROS bags, in LakeFS but would like to keep storing the data efficient through de-duplication as we make changes to data over time. Is LakeFS a good tool for this use case or not really for data of this nature?
i
HI Blake, welcome to the lake. lakeFS is format agnostic, so it can support any format. Having said that, if you are looking to also benefit from data deduplication, it will be hard to achieve if the entire file changes regularly.
👍 1
a
Hi Blake, Welcome to the lake! For questions like these I like to think of lakeFS as an object store that supports Git-like semantics. The most important question is whether bags are stored as single or multiple objects ("files"). I your case I expect to get deduplication of the same object across all revisions and branches on which it appears. This also means no deduplication of similar objects. I am not familiar with the format of ROS bags; if a short summary is available I will gladly read it. After that I might be able to be more precise. Thanks!
👍 1
b
Thanks for your replies @Iddo Avneri and @Ariel Shaqed (Scolnicov)! 🙂 My ideal solution would efficient delta encoding / de-duplication to optimize storage and performance. Here is a summary of the ROS bag format 2.0 which is what I use. In summary a ROS bag file consists of sequences of records, with an initial line to indicate the format version number:
Copy code
#ROSBAG V2.0
<record 1><record 2>....<record N>
a
Thanks! I had a quick look at the bag format. It seems optimized for writing. tl;dr: From what I see, lakeFS will not be able to dedupe here. I'd like to zoom out a bit. lakeFS works at the object level, where "object" will probably map to a single bags format file. At its core it never looks inside what it stores. This means you will not get deduplication inmside a single bags file. I'd like to compare this to the situation with Parquet format files. lakeFS cannot dedupe 2 Parquet files that share a common segment. Neither can most (all?) object stores, and this is quite expensive computationally to do. So when people work with Parquet they create multiple Parquet format "files" in the same "directory". Of course it's an object store, so we say that they create multiple Parquet objects whose name share a common prefix. Now they use formats such as Apache Iceberg and Delta Lake that bind together many many Parquet objects into a single "file". And that gives an easy path to deduplication: typically single Parquet objects don't change. lakeFS handles this case incredibly well. I am a bit unsure where to go from here: How much data do you have? Could you split it up between multiple objects? Perhaps naively, I expect terrabyte-scale objects to be unusable on an object store. OTOH, if you do split up bags into multiple objects, each holding say 10-100 MiB of data, then perhaps format doesn't matter and lakeFS will just do a great job of providing branches. lakeFS itself does not duplicate any data in order to create a branch or a new revision.