Hey! I’m looking to adopt a data versioning tool, ...
# help
u
Hey! I’m looking to adopt a data versioning tool, and I was trying out LakeFS to see if it meets my needs. I’ve tried an introductory tutorial to use a repository connected to a GCS bucket (lets call it bucket x). I was able to follow the tutorial, which included branching, changing a file, commiting to main, and reverting the changes. Everything was done from the web UI with SQL queries from the DuckDB prompt. The thing is, after importing a file from another GCS bucket (bucket y), the only way I see to work with the file is through the web UI… This is already really nice, but I was wondering if LakeFS could track the changes in the files from bucket y? This is because my data pipeline uploads/downloads files directly to/from bucket y, using the GCS API. For example, could I tell LakeFS to track folder ‘data’ in bucket y, commit to main, make a branch ‘fix’, modify the CSVs in ‘data’ dir in bucket y (through the GCS API, not through LakeFS), commit those changes in branch ‘fix’? Like I would do with git, the goal is to be able to choose which version of the data (main or fix) appears on GCS bucket y. Essentially, can my git-like operations from LakeFS change my files on bucket y (like checking out to branch ‘fix’)?
u
lakeFS supports an "S3 gateway" that allows programs to access it as though it were, well, an S3 compatible store. Would that help? (Because we have no corresponding "GCS gateway", as GCS is not that common a backend storage protocol)
u
Yeah I think I'm starting to understand better how my workflow would look like with LakeFS. Now my code interfaces with GCS for data manipulations. I think I understand that with LakeFS, my code would interface with the LakeFS S3 gateway (not GCS, although this should not change much code), and I could occasionally backup the most up to date dataset on GCS by using the export process described here Exporting Data | lakeFS. Does that sound right?
u
Absolutely! And AFAIK there are several users who have implemented just that, even exporting back to S3. We try for lakeFS to be a tool that fits into your stack of other tools, rather than some framework. It should never be "all or nothing".
u
Good to know, thank you very much!!