Hi guys, I am POCing lakefs. I am wondering how lakefs supports multi-site scenarios? Given Site A and Site B, connected by a limited line
• lakefs deployed on site A, and two Minio instances deployed on both site A and B configured Active-Active replication. Can users in Site B access data directly from Minio on Site B(use pre-signed url)？
• lakefs(each have its own prostgres) and minio deployed on both sites, where minio are configured with replication.
We can set DNS records pointing to the instances in the same site. The idea is to achieve data locality and cross site data replication seamless
Thanks in advance! Any thought is very appreciated!
02/07/2024, 7:47 AM
You are describing a very interesting setup for lakeFS. We are currently unaware of users who use active-active replication with lakeFS. Excuse me if I'm asking the wrong question but since you stated the line is limited, are you using sync or async replication?
02/07/2024, 8:01 AM
Hi, Thanks for replying. My apology that I did not make it clear enough. Our goal is to support multi-site pipeline. Lakefs is a candidate for the solution. So, we need to understand what is the best practice for lakefs given the multi-site scenarios. Active-Active replication is one of possible settings. We expect bidirectional data flow. Any modification on one side, would(eventually) affect on another side. I understand that there will be a huge challenge regarding the consistency, consensus, latency etc. I'd like to know what is the capability and caveats of lakefs in multi-site cases.
02/07/2024, 8:16 AM
I'm assuming since you used "eventually" that this is indeed not a synchronous replication. The issue with multi site replication and lakeFS is not so much the writing of the data itself but the managing of the metadata store. If the purpose is utilizing lakeFS as part of the pipeline in both sites then it makes sense to deploy lakeFS on both, however eventual consistency will require that lakeFS's metastore be managed in a central location which I don't know if it is a possibility for you due to the limited connectivity.
I'll see if I can think of an alternative solution to replication that might satisfy your user case and get back to you during the day
02/07/2024, 8:38 AM
Thank you so much! Yes, async can be acceptable as long as we can handle any failure properly. I am asking here because I did not see anything covering multi-site in lakefs docs.In terms of the limited line, We already have a multi-site pipeline, the bandwidth is enough for our current cross site data flow. However, If we can have an uniformed view over sites by leverage lakefs,it would be excellent! In our concerns, It is more important regarding the fault tolerance. How can we handle the network partition or one site is completely unavailable. And the metadata should not be corrupted or should be at least fixable once network is back to normal.
02/07/2024, 9:30 AM
We are about to release a feature for cross region mirroring in a couple of days, on lakeFS paid products :enterprise/ cloud.
DM @Iddo Avneri if this is something you wish to explore.
02/20/2024, 12:54 AM
Hi @einat.orr, hope you are doing well. Sorry to get back this late, we just had the Chinese new year holiday. May I get any update regarding the cross region mirroring feature? I found that we have 2 new version released since then, however, none of them seems related to mirroring features. Thanks!
02/20/2024, 12:59 AM
It was released under preview for the paid offering.
I'll have someone ping you to provide you with more information.