Hey I am new here Perhaps this is the wrong place but I am l lakeFS #help

Hey, I am new here. Perhaps this is the wrong plac...

Offir Cohen

08/28/2024, 9:32 AM

Hey, I am new here. Perhaps this is the wrong place, but I am looking into adopting this tool for our MLOps system here at my company. One issue that is being raised is whether we can perform much of what LakeFS has to offer in MongoDB. My question amounts to: • whether that is true (which is something I am looking into on my own) • whether that is beside the point (maybe that is something you can help with) ◦ Is there an existing integration or expected way to work with MongoDB and LakeFS? (I don't see one) ◦ Or are LakeFS and MongoDB different solutions to the same problem, and are competitors? The reason I ask is because while I understand MongoDB can certainly hold and query most of the data we'd look to store in LakeFS in principle, I have high confidence Mongo will be a major headache in that role in practice. I'd like to go with LakeFS for its distributed control, its human user friction (the commit process), revert-ability, governance, UI, git workflow, and "file system" capabilities. Only problem is I'll have to justify it, and prove that it can work well alongside Mongo. I figure you all have some strong justification or recommendations for playing nice w/ Mongo... Thread in Slack Conversation

👀 1

Offir Cohen

08/28/2024, 9:44 AM

Hi @Chris and welcome to the lake lakefs lakeFS provides version control over the data lake, and uses Git-like semantics to create and access those versions. lakeFS sits between object stores such as AWS S3, Azure Blob Storage, Google Cloud Storage (GCS) and compute engines such as Spark, Databricks as shown here:

Offir Cohen

08/28/2024, 9:48 AM

So to your questions, MongoDB as a database does not fall into the layers described above. I am interested and would like your clarification on your last statement of working well alongside Mongo

👍 1

Chris

08/28/2024, 7:25 PM

so the question maybe is at a very high level.. I understand that once you are in the data, it becomes very obvious whether a data lake will be easy and a data base will be hard. In our case, we have a distributed organization of researchers with various levels of data sanitization, and sundry formats, including proprietary formats and technical (such as pcaps). It is obvious (to me) that we require a DataLake that will allow us to enforce a gradient of structure, build machinery to enhance data from less structure to more structure... ...and it is further obvious (to me) that we require something like LakeFS to distribute control of data sets to our various user groups in a common culture (everyone understands git). But is it possible for us to work hard and use MongoDB to hold all our data, and then try to sell the workflow of structuring data and getting access to the DB to external usergroups? • I think the answer is yes, but it will not be a success What I am looking for is some support for the idea that we should be using lakeFS. Perhaps a good article that delineates when you need LakeFS rather than MongoDB. • For me, right now, I've got to reject the notion we are going to use MongoDB as a catalog --- Getting back to your zone of interest (working well with MongoDB), what I was thinking was some kind of data structure level where, once data complies with minimum standards of MongoDB, you could, say, spin up a MongoDB instance holding the data at a specific commit... ...and update the DB with a commit of your choice, and so on (as one possible "friendly" way to work with Mongo) Then, as MongoDB changes, you could compute the commit back to LakeFS and effectively snapshot MongoDB, and create version tracking of a sort. • This kind of integration with a db would be useful, because I can see some scenarios where it would be nice to spin up a specialized DB with LakeFS data...

Chris

08/28/2024, 7:31 PM

In particular, we have: • Box / Msoft Office for document sharing • MongoDB is the go to choice for "schemaless" data storage But what we require is an operationalizable data catalog that's as easy as Box, but allows us to scale the quality of data all the way up to what LakeFS / other data lake can provide, and maybe even further to something like DeepLake/Pinecone or even a MongoDB. But in order to support that position, I'll need clear and obvious documentation. Some other Sr SEs were pretty interested in the notion of a data lake, because frankly everyone org wide has rolled their own database-dashboard solutions up and down the company. But we'll have to build support for the position that, yes, we need a data lake to allow data to be anywhere from a simple dump to fully structured in the same common workflow and spot.

Offir Cohen

08/29/2024, 9:13 AM

Hi @Chris

What I am looking for is some support for the idea that we should be using lakeFS. Perhaps a good article that delineates when you need LakeFS rather than MongoDB.

You would rather need an article on whether to use a data lake over a database. lakeFS by itself is not a Data Lake. It is a 'git like' version control tool for Data Lakes data management. If and when you decide to migrate your data structure from a DB to a lake and choose your lake vendor, we will be happy to further assist you in implementing lakeFS for your lake

Chris

08/29/2024, 1:43 PM

I think this is a good article that helps outline the use case of data lakes: growing multiple data sets from many stakeholders from raw data to structured formats incrementally, and in one place.

👍 1

3 Views

Open in Slack

Previous Next