I'm trying to use LakeFS for data version control ...
# help
f
I'm trying to use LakeFS for data version control in machine learning. I have PDF files that get copied to a lakefs local dir and can be committed from there. However, even when the contents of the files do not change, the copy operation causes lakeFs to mark these files as modified. This forces me to commit the changes making a data-driven pipeline tedious. Is there a way to deal with this?
o
Hi @Farhan Ahmad Did you try working with lakectl local in order to clone a repository from lakeFS so it will recognize your existing PDF files?
f
@Offir Cohen yes. I did a clone, then ran my pipeline that `touch`ed all local files without actually changing the contents. When I do
lakectl local status
I see all files as changed. If I do
lakectl local commit
and check my lakefs server, I see something like this:
It says "Identical size"
and I'm not modifying the contents either so I'd expect the hashes to remain unchanged except for files that actually changed
o
Once you have touched the file you have changed the metadata of the file, hence it is not the same file anymore although the content is identical. As you can see in the comment in your screenshot, we do not support content diff for pdf files. Please feel free to open a feature request for your needs
f