What is the equivalent of `git clone` for LakeFS t...
# help
g
What is the equivalent of
git clone
for LakeFS to clone and manipulate the data locally? I have setup a LakeFS in docker with Minio S3 endpoint as the object store. Uploaded some files into LakeFS through its UI. Now, I do not see any option to organize the uploaded files (such as creating folders) in LakeFS UI nor see any option to
clone
the data to local machine to manipulate its hierarchy and commit back. What am I missing? What is the recommended toolset or workflow to organize the data hierarchies in LakeFS? I have some files in LakeFS UI and now want to be able to re-organize them into different folders and commit as new branch.
g
Hi @GK Palem, Due to the fact that lakeFS is built to work at scale, as opposed to Git, you work directly on the “origin” in lakeFS. having that said, lakectl has the
lakectl local
functionality you can look at our documentation, or check out this blog post where it was first mentioned. Regarding the recommended toolset/workflow for organizing data hierarchies in lakeFS, it largely depends on your use case. If you’re dealing with a small amount of data, then
lakectl local
should work for you. Generally, I would recommend using the same tool you would use with Minio. For instance, if you use
boto
with Minio, you can create a branch in lakeFS, make your changes using
boto
on your branch, and then merge your changes to the main branch.
g
Thank you @Guy Hardonag The
lakectl local
command and the (blog post) are helpful pointers. I will look into them. Regarding using the Minio directly to manipulate the files, unfortunately it does not seem to work. This is what I have done: 1. In the LakeFS, uploaded few files, and could see them listed in the Main branch fine (screenshot attached) 2. In the Minio UI though, it does not display the file names as is. Since the files are uploaded in the LakeFS UI, the Minio UI is rather displaying the file hashes or some random IDs (instead of their original names). (Screenshot attached). 3. Since the Minio UI does not show the original filenames it is very difficult to organize the files into proper hierarchies (such as folders) directly in that Minio UI. 4. Next, instead of UI tried accessing the files using the Minio client API. In the Python used
minio
package to list the objects and it too lists the files with their random names / hashes instead of their original names (as uploaded in the LakeFS). (Attached screenshot.) In other words, only the LakeFS seem to know and display the original filenames correctly, and Minio UI and API does not seem to be allow manipulating the files in their original way. (I guess Minio UI/API is not aware or does not understand the LakeFS format and hence treating the files just as blobs). • Is there any Minio client (UI or API) that is aware of LakeFS format and allows working on the blogs as if working on the files directly? For working on the files locally in shell, the
lakectl local
can be useful. How to achieve similar thing through UI and API? Any toolset / opensource packages?
g
Oh, sorry, I didn’t of think of the pure minIO options like the minIO UI or cli. You should manage files via lakeFS, minIO functions as the underlying storage. I gave the minIO example because lakeFS is S3 compatible exactly like minIO is. That is, many data tools that work with S3 can work with minIO and lakeFS. Some examples are: AWS CLI, Boto, Spark. So for AWS CLI for example, minIO can be accessed by setting
--endpoint-url
to the minIO URL and setting the minIO credentials as documented In lakeFS you can do the same but set the lakeFS variables as documented here You can check out the available integrations here Hope I managed to clear things up a bit
g
Thank you @Guy Hardonag I was able to use the command
lakectl local commit
to make the local data appear in the lakeFS webui. One question though is: what is the relation between the LakeFS and my local folder now? (My LakeFS is backed by Minio S3 storage) • Does LakeFS have a copy of all of my local data in its storage (the minio) or did it just import the checksums/hashes etc. during the
local commit
? • In other words, if I remove my data from the local folder (from where the
local commit
is ran), would lakeFS lose the actual content or will the files be still ok in LakeFS? I guess this is similar to
git commit
where my local data can be deleted safely after
commit
and if required can clone or pull it back from the
lakefs
anytime without losing data - correct? or should I keep the local data (on my local disk) because LakeFS only imported the checksums?
i
Hi @GK Palem Using
lakectl local
indeed makes a copy of the files got the specific commit locally. If you modify files locally and then commit, lakeFS will sync those with the object store (minio in your case). In other words, if you remove the local folder, nothing will happen on the storage (minIO) side. Of course, if you chose to commit after deleting the files, those files will no longer be a part of the commit. They, however, will not be removed from the object store until a relevant Garbage Collection cycle run.
g
Thanks much for the details @Iddo Avneri