hi again all, I have a potentially naive question!...
# help
j
hi again all, I have a potentially naive question! our use case is to have a large database accessible to our users, on s3. We want our data to be versioned because we will be changing some of our data i.e. changing some variable unit or something like that. LakeFS seems perfect for this, but we are hesitant to have LakeFS as this access layer to the data. Is it possible to have a situation where we can version our data with LakeFS, but the user does not need to install/use LakeFS to access any versions? I guess the most basic example would be have the user access the first ever version of some data file directly from s3, using its version number (or something like that) I understand that the limitations of how LakeFS fundamentally works might prevent this, but wanted to ask anyway 🙂
a
@James Hodson There are 2 options that I know: 1. Use Hadoop Router FS if you are using Spark 2. Export data from lakeFS Let me know if you would like to discuss these options on a call.
a
@James Hodson another option, depending on what "not use lakeFS" means: can your users use an S3 client? Most S3 clients will let you set an endpoint, and then you could use the S3 endpoint on the lakeFS server. This way there is a lakeFS server on path, but users need not run any lakeFS client code.
👍 1
👍🏽 1
j
thanks both @Amit Kesarwani @Ariel Shaqed (Scolnicov), those are both good options for me to explore. Before I dive in, would any of these solutions allow the user to access older versions of the data without using lakefs, and without the unique name that older versions are given in the /data/ folder? I understand the latter might be impossible
a
One advantage of using the S3 gateway is that it has a live lakeFS server behind it. So you can use any version reference as the first path element to access any version: • s3://repo/main/... to access the main branch • s3://repo/main~3/... to access the 3rd-from-last version of the main branch • s3://repo/tag/... to access the version with that tag • s3://repo/1acef355/... to access the version w with digest beginning with this hex digits • ... And any other mixture of these expressions. Solutions based on exporting require existing each version separately , as you rightly point out. (Edited: previous version messed up some URLs, sorry)
j
that seems great, thank you for your help! I'll try get it set up
sunglasses lakefs 1
n
Hi all, thank you for your guidance so far on this. I guess it might be useful to clarify what our dataset currently looks like. We have a our data stored as zarr files (like HDF5/netCDF, but optimised for object storage for those unfamiliar), and one of the typically access patterns is for users to load the data into xarray Datasets, taking advantage of xarray's use of dask under the hood for lazy loading and out-of-memory considerations. It might be that the export data option might be the best, but I need to get a better understanding of how that works.
o
Hi @Nathan Cummings and thanks for the details. Do you require any additional information at this stage, or are you going to try the above suggestions and report on findings?
a
Thanks for the info, @Nathan Cummings! Unfortunately I cannot claim much familiarity with many of the systems you mentioned. But "dask" stands out! AFAIK, dask uses fsspec for I/O. The appliedAI people created lakeFS-spec to let fsspec use lakeFS natively. So dask might make using lakeFS as easy as loading lakeFS-spec into it. Would this work for you?
n
Thanks all! Still trying to piece together what these suggestions would look like in concert. What I think this would look like is: • We have a lakeFS server that ingests data and manages versioning. • We have a publicly accessible (read only) bucket that we export tagged versions to. • Users access that bucket and read data, seeing that there are different versions of the data, but with no knowledge that lakeFS is being used by us at all. My question, is whether that means that we’ll have to have redundancy in the data, where we have another version, but only a small part of the dataset changed? If so, this inefficiency isn’t the end of the world, and is probably a fair trade off if it means the data are more accessible (users don’t need to install lakeFS to read the data), and they can access versions of the data with a clear versioning scheme that we define and document ourselves. The bit I don’t understand is the s3 gateway. It feels like it might represent a solution to some (or all of the above) but I’m not clear on how it would affect user’s access to the data.
o
@Nathan Cummings and company - thanks for joining the call today. As discussed, please feel free to contact us for further help as needed
❤️ 2
j
hey again guys, thank you for the call again 🙂 we will explore the options you gave us, it was really helpful to speak with you We think the export feature would be best for our use case, but we were unsure on the following: If we exported all of our data to some object store for a customer to use and then a few days later we made a small change to only one of the files in our lakeFS repo, if we exported that new commit, does it overwrite ALL of the files in that object store? Or will it just note the fact that only one of the files has actually changed and skip over the ones that don't need to be exported again, hence saving time? I think you mentioned it uses rclone/rsync(?) under the hood
👀 1
i
Hi @James Hodson, We will verify that and will let you know what the answer is.
j
cool, thanks! 🙂
jumping lakefs 1
o
Hi @James Hodson Sorry for the late reply here Please refer to https://docs.lakefs.io/howto/export.html#exporting-data-with-docker any further action from another branch or commit will export only modified files and will not override all other files.
j
sounds perfect, thank you @Offir Cohen
jiggling lakefs 1