so the question maybe is at a very high level.. I understand that once you are in the data, it becomes very obvious whether a data lake will be easy and a data base will be hard.
In our case, we have a distributed organization of researchers with various levels of data sanitization, and sundry formats, including proprietary formats and technical (such as pcaps).
It is obvious (to me) that we require a DataLake that will allow us to enforce a gradient of structure, build machinery to enhance data from less structure to more structure...
...and it is further obvious (to me) that we require something like LakeFS to distribute control of data sets to our various user groups in a common culture (everyone understands git).
But is it possible for us to work hard and use MongoDB to hold all our data, and then try to sell the workflow of structuring data and getting access to the DB to external usergroups?
• I think the answer is yes, but it will not be a success
What I am looking for is some support for the idea that we should be using lakeFS. Perhaps a good article that delineates when you need LakeFS rather than MongoDB.
• For me, right now, I've got to reject the notion we are going to use MongoDB as a catalog
---
Getting back to your zone of interest (working well with MongoDB), what I was thinking was some kind of data structure level where, once data complies with minimum standards of MongoDB, you could, say, spin up a MongoDB instance holding the data at a specific commit...
...and update the DB with a commit of your choice, and so on
(as one possible "friendly" way to work with Mongo)
Then, as MongoDB changes, you could compute the commit back to LakeFS and effectively snapshot MongoDB, and create version tracking of a sort.
• This kind of integration with a db would be useful, because I can see some scenarios where it would be nice to spin up a specialized DB with LakeFS data...