How do you delete some file in a repo ?
Like for git, you delete file, the you "add" those file and do a commit
I have a dataset folder, then I delete some file inside it, based on some complex script/logic. How do I sync the changes to lakeFS ??
05/05/2023, 5:08 AM
is your dataset folder already in lakeFS? If so, I think you would use the deleteObject API.
What's your script/logic written in?
also - welcome to lakeFS 🙂
05/05/2023, 5:10 AM
Like we are massaging our data, testing that, modifying this, deleting that ... After a while, we are happy with the state of the dataset, then we want to commit push the change ... Sorry, i am using lakefs like git. May be this is not the right approach?
05/05/2023, 5:23 AM
@HT lakeFS shares similar concepts of git, but there are some differences. One of them is that there is no concept of local and remote. All operations are performed on the datalake (remote). Therefore, in order for lakeFS to work properly, data management should go through the lakeFS API. This means adding, updating and deleting objects should be done using the appropriate API.
Failing to do so, might result in inconsistencies in lakeFS and missing versioned data
05/05/2023, 5:27 AM
With lakectl upload operation, you lakefs can track change and new added file right?
So it's just missing a "sync" operation that detect deleted file?
05/05/2023, 6:37 AM
Not exactly. Since lakeFS is a versioning engine - delete operation does not mean the object is physically deleted from the storage. The reason for that is that you can upload and commit an object in a certain "commitA" and then later decide to delete the object (or modify it) in "commitB" - but you'll still want to have access to the original object from "commitA".
Deleting the object from the storage will in fact break the versioning capabilities - even if you had a "sync" that detects deleted files.
Here's an overview of the lakeFS model, and in this page you can learn more on how the versioning mechanism actually works
05/05/2023, 6:47 AM
Thank you. I will have a good look
ok. So, after a bunch of testing. Now I can get "sync"
To achieve sync (delete, rename file) between local copy and what inside lakefs, you can use the S3 compatible interface. I use rclone to sync between local copy and lakefs. Deleted file then appear automatically in the "uncommit changes".
I guess, rclone is one way to use the API as @Robin Moffatt mentioned above ?
syncing your data (download, upload, sync, checkout, ....) : better use rclone.
dealing with the repo (branch, merge, commit, ...): lakectl
05/05/2023, 10:45 AM
It's basically whatever fits your needs best, and really depends on your use case. As long as you're using the lakeFS endpoint for your rclone operations with the s3 configuration it should be fine