Hi! I'm planning on deploying lakefs on our GCP st...
# help
u
Hi! I'm planning on deploying lakefs on our GCP storage so that we can version-control our data in GBQ. How would I go about doing this, and will I have to bring any modifications to existing code or make modifications everytime I change branches. Thanks!
u
Hi @Stéphane Burwash, here you can find information on how to setup lakefs on GCP https://docs.lakefs.io/deploy/gcp.html. lakeFS uses google storage as the underlying storage and provides UI, API, and S3 compatible interfaces. GBQ and the code will query one of these interfaces. How does the code work with the data today? Using GBQ sdk?
u
(Sorry in advance if I don't answer all the questions, I'm new to the DE game) Thanks @Barak Amar! Currently our main interaction with the data is through GBQ. It's connected to an ELT pipeline using stitch or meltano and loaded (with a loader) into our project. We mostly use the integrated GBQ SQL to query the data from there (using the UI). Does this make sense?
u
Hope I got it right (and I may have got it wrong), GBQ is a warehouse (loaded with stitch) and used to query the data loaded to the application. lakeFS is used as a layer above the object store that can be used by data processing/query/transformation tools. It will provide versioning for GBQ data warehouse. There is an option to query data using GBQ where the data source is S3 - didn't explore this one yet. You will need to load the data to lakeFS in this case for BQ to query on the federated data. If there is a use case to manage the data extracted from GBQ, lakeFS can be used to store the results. But you will probably need a different way (not GBQ) to load the data from lakeFS.
u
Ok awesome, thank you so much! I'll look into what's best for our architecture and get back to you guys soon 😉