Thanks! I had a quick look at the bag format. It seems optimized for writing. tl;dr: From what I see, lakeFS will not be able to dedupe here.
I'd like to zoom out a bit. lakeFS works at the object level, where "object" will probably map to a single bags format file. At its core it never looks inside what it stores. This means you will not get deduplication inmside a single bags file. I'd like to compare this to the situation with Parquet format files. lakeFS cannot dedupe 2 Parquet files that share a common segment. Neither can most (all?) object stores, and this is quite expensive computationally to do. So when people work with Parquet they create multiple Parquet format "files" in the same "directory". Of course it's an object store, so we say that they create multiple Parquet objects whose name share a common prefix. Now they use formats such as Apache Iceberg and Delta Lake that bind together many many Parquet objects into a single "file". And that gives an easy path to deduplication: typically single Parquet objects don't change. lakeFS handles this case incredibly well.
I am a bit unsure where to go from here: How much data do you have? Could you split it up between multiple objects? Perhaps naively, I expect terrabyte-scale objects to be unusable on an object store. OTOH, if you do split up bags into multiple objects, each holding say 10-100 MiB of data, then perhaps format doesn't matter and lakeFS will just do a great job of providing branches. lakeFS itself does not duplicate any data in order to create a branch or a new revision.