Hey!
I’m looking to adopt a data versioning tool, and I was trying out LakeFS to see if it meets my needs. I’ve tried an introductory tutorial to use a repository connected to a GCS bucket (lets call it bucket x). I was able to follow the tutorial, which included branching, changing a file, commiting to main, and reverting the changes. Everything was done from the web UI with SQL queries from the DuckDB prompt.
The thing is, after importing a file from another GCS bucket (bucket y), the only way I see to work with the file is through the web UI… This is already really nice, but I was wondering if LakeFS could track the changes in the files from bucket y? This is because my data pipeline uploads/downloads files directly to/from bucket y, using the GCS API.
For example, could I tell LakeFS to track folder ‘data’ in bucket y, commit to main, make a branch ‘fix’, modify the CSVs in ‘data’ dir in bucket y (through the GCS API, not through LakeFS), commit those changes in branch ‘fix’? Like I would do with git, the goal is to be able to choose which version of the data (main or fix) appears on GCS bucket y. Essentially, can my git-like operations from LakeFS change my files on bucket y (like checking out to branch ‘fix’)?