Hello, Are there any plans to integrate lakeFS wi...
# dev
c
Hello, Are there any plans to integrate lakeFS with Elasticsearch?
o
Hey @Corey Zimmet! What kind of integration are you looking for?
c
The ability to version documents in Elastic directly without a different backing store
I am relatively new to lakeFS but I have been asked to version documents in Elasticsearch and will use S3 gateway for now but it would be great if we could take advantage of searching metadata fields that are added to lakeFS since we don't want to clutter the actual document.
o
If lakeFS was able to provide search/filtering based on object metadata tags (while still remaining on top of object storage) would that also be helpful?
There are no concrete plans to provide git-like versioning for data managed in ElasticSearch, we are currently focusing on object stores (And systems that can natively use an object store as its operational storage backend)
c
A full search through a repo would definitely be useful if performant. Looping through logs only gets you so far. We are looking for things like everything a user touched or misc. labels associated with documents in the metadata. All documents that have a field that says "expired", for example, even if it was deleted in a commit.
o
Hmm, full search on the contents of the object is probably out of scope, but filtering tags/metadata fields attached to objects might be. I wonder if that could be useful?
c
That field expired, I meant as metadata, sorry. Is full search on metadata out of scope?
o
what do you mean by full search? What I believe is in scope is filtering ("give me all objects that have this metadata field:
label:foo
). Doing any sort of full-text search or fuzzy matching on these metadata fields, scoring, etc. will probably not be.
c
Both would be ideal but if filtering is performant over an entire repo that may be enough to get by with our users. Since the ultimate intent of versioning data is more for data analysts than developers, I am anticipating that reports and searching will become more and more important over time.
o
Thanks! that's helpful. Can you share a little bit about the types of objects you're looking to store? data formats, existing tools used to query/update them..
c
I cant get into specifics, but a basic example would be using sample Yelp JSON data where there are documents such as restaurants, recreation centers, Etc. and a different document such as reviews that tie back to them. If the restaurant closes, it would be labeled "closed". At present we do not delete it or the reviews but if we use lakeFS we could, especially for years old data. The search would allow users to find these "closed" documents and potentially restore them. Our search capability is in its infancy, just Kibana and Elasticsearch
o
got it, thanks! we do plan on adding metadata filtering to lakeFS Cloud later this year. If that's interesting to you, would be happy to schedule a quick call to make sure it meets your requirements 🙂
c
I'll discuss with others on my team and get back to you
Most of my team is in a different time zone, though. What time zone are you in? I would definitely want a broader range of knowledge than just me. Thanks!
o
I'm in GMT+2, but pretty flexible. I'll DM you the details