Hey, great community. Nice to meet you all. :sligh...
# help
y
Hey, great community. Nice to meet you all. šŸ™‚ I finally arranged a time to quickly go over LakeFS quick-start to see how it works. First of all, super useful. (and the quick start is well done as well - kudos to the team who wrote it and the rest of the documentation) šŸ˜ƒ I'm trying to compare with other tools and for now, I can't find the real edge (for tabular data) over simpler solutions like Iceberg. Don't get me wrong, I do see the differences, but for most use cases, both provide git-like capabilities. Whereas, Iceberg is easier to adopt since there is no deployment - AFAIK, only the metadata files sitting alongside the data itself and an added jar to Spark/Flink. On the other hand, LakeFS gives us more of a platform-like solution, with a UI and a CLI, with less or no dependency on Spark. Anyway, maybe I am comparing them wrong. I hope you can take a few minutes to help me understand the use cases. I'm asking from the pure desire to check it out for real production purposes. šŸ™‚ (btw, for non-databricks users, I feel that open-source delta-lake can lose ground if they don't implement better features like git-kind features and more, but that's another conversation)
lakefs 2
a
Hi @Yerachmiel Feltzman, Welcome to the lake axolotl , and thanks for the kind words! Looking around I see that we've never really had a head-to-head comparison with Iceberg. So I'll give my thoughts; others will probably have more to say about this. A major difference between lakeFS and what we call Open Table Formats (OTFs) is that of scope. Because of limitations of existing object stores, OTFs can become vertical solutions that need to provide solutions to multiple problems: ā€¢ Table metadata storage (including evolution) ā€¢ Table overwrites (really hard to do consistently) ā€¢ Table history In contrast, lakeFS does one thing, it's a horizontal tool that tries to live alongside other tools in your data stack. So it can provide Git-like branch history including merges, and also the ability to support cross-collection consistency. Without cross-collection consistency even updating side tables can become challenging; with it many things become natural. Which is better? As you say, it depends on your use-case and even on individual preferences. Unsurprisingly given my choice of where to work, I like horizontal solutions: two systems that cooperate to do a matrix of possibilities will be simpler than a single system that needs to perform some subset of that matrix. Hope this illuminates some of our (actually my) thinking. Please let us know if you have any other questions, from very specific or very general.
i
HI @Yerachmiel Feltzman, Joining a bit late to the party. I agree with Ariel that understanding the use case is key, because we are comparing apples and oranges (if not cucumbers šŸ˜‰ ) While there are specific technicalities that are different (like commits management, or long lived branches) I think the key is that lakeFS is complimentary to Open Table Formats and serves a different purpose. Iceberg (As an OTF example) helps providing a balance between data consistency and mutability by implementing a versioned table format that supports both mutable data and strong consistency guarantees. lakeFS, on the other hand, helps with: 1. Isolation. For example,

developing ETLs against production data without risking productionā–¾

. 2. CI/CD for data. For example, atomically promote (merge) data from stage to prod, including automatic data quality checks via hooks. 3. Entire Lake Rollback. Rollback all your tables (or unstructured data) to a historical commit. 4.

ML reproducibilityā–¾

at scale. Most data science and machine learning workflows are not linear. lakeFS enables you to easily go back and forth between different components (code + data + model versions). Worth mentioning we have many users that use lakeFS together with open table formats today. Moreover, by implementing OTF, we (lakeFS) can actually provide even better comparison between commits or branches, which is format aware. One last thing - we are working on the Iceberg integration with lakeFS. Sounds like you might have good input to share, we are discussing it on the #iceberg-integration channel. HTH!
sunglasses lakefs 1
a
Hi again, @Yerachmiel Feltzman! Thanks for pointing out the importance to users of the library / platform distinction! Whichever you pick... I'd be really happy to ask you some questions after you get some real-world experience with versioning on Iceberg or lakeFS. Thanks!
y
hey, thank you all. Just passing by and saying I was out for a time, so I'll read what you wrote and comment. šŸ™‚
šŸ‘ 1
and @Iddo Avneri, thanks for pointing out the iceberg integration channel. I'll check it out.
a
I think you and I might be reading the same situation in opposite ways. Allow me to outline the way I read the situation, with a few technical comments. I do understand that your way is different, perhaps as a result of thinking of lakeFS as a component at different levels in the stack. ā€¢ "_Mixing up dimensions_": I don't see that at all. In my mind, every commit holds multiple tables for the different dimensions. So if you have a large table of clicks and a small table for the user dimension, I would expect you to have two Iceberg tables or Parquet "files" or even one Iceberg table for the clicks and one small JSON object for the users. How much you denormalize or don't is up to you. Here lakeFS is unopinionated! ā€¢ "_Cross collection consistency_" is key: to my mind it means that every commit holds a consistent view of clicks and users. You can use this to enforce application-level consistency. For instance, you can enforce that all commits or all commits on certain branches guarantee that all users in the clicks table are found in the users table. lakeFS has more of an opinion here: it encourages you to define consistency on your long-lived branches. ā€¢ There is no object duplication: if you don't change a partition of your clicks table, it is not duplicated across versions. Here lakeFS is opinionated: The way to create a dev environment is to branch out. The way to manage a process of multiple consecutive changes is to branch out, commit after each change, and merge back. In the example, each ingest branch is updated separately. A process reconciles them and eventually merges a consistent view to the trunk. As a smaller example: say you wished to perform consistent change across clicks and users. For example, you need to ingest new clicks that can have new users, and any new users must be generated or fetched from an external system. I would suggest: ā€¢ Branch out of production to a work branch and work there. ā—¦ Ingest new clicks, commit. ā—¦ Find all users in clicks table that are not in users ā—¦ Fetch missing users ā—¦ Write new users table ā—¦ Merge back to production ā€¢ Look at the history of production: it is always consistent, all users in clicks appear in users. And there are 2 commits: the first consistent before the new clicks and users, the second consistent after the new clicks and users. @Oz Katz may be able to give more examples.
y
Humm.. What you wrote did shed even more light. Now I understand how LakeFS (and you šŸ˜„) suggests solving the problem, which I found really interesting approach. šŸ™‚ I'll be waiting for @Oz Katz for more examples to make sure I understand the one you gave correctly.