Is there any way to restrict a user permissions to only view lakeFS #help

Is there any way to restrict a user permissions to...

Alex Buck

03/11/2024, 2:23 PM

Is there any way to restrict a user permissions to only view the “HEAD” of a specific branch and no other branches? I’m using the open-source deployment and have been reading in the ACL list, but with the 4 groups, Reader, Writer, Super, Admin, I don’t see a way to restrict access that granularly. The use case is that I wish to version a dataset used in machine learning model development. I want to have lineage of the data transformations from its raw ingest all the way through the data splits into Train/Validation/Test. However, for policy reasons, the developers actually building models cannot have access to anything other than the Train and Validation sets. I did a quick test of how I thought this might work, and I’m able to make a “Training” branch that only has the training data, but even as a Reader permissioned user, I could view the commit history which traced back to the full dataset before performing data splits, which means that the test set is visible to a “misbehaving” developer, and this is not acceptable for the use case I’m working. My other thought is to give developers access to the underlying S3 objects for the train/validation data which wouldn’t include the metadata about the versioned history. I’m about to run a test to see what this looks like. I’m still learning LakeFS so I’m not yet sure what the “underlying s3 objects” look like or if they’re accessible in this way.

Jonathan Rosenberg

03/11/2024, 2:42 PM

Hi @Alex Buck! Regarding your first point, the granularity you’re looking for doesn’t currently exist in lakeFS, but it sounds like an interesting use case. Would you mind creating an issue describing this specific use-case so that the product managers could address it? About the second point, the way to reach the data is by passing through the metadata (which describes the location of the data among other things). I’m not sure how your architecture looks like but you can provide presigned-urls to validated requests and that way access the data directly (that can also be a solution for the first point, but again, I’m not sure how your architecture looks like).

Alex Buck

03/11/2024, 4:18 PM

Happy to create an issue for this! Unfortunately, my governance policy doesn’t allow me to rely on the “good behavior” of developers to not look at the test data split so I need to ensure this via permissions boundaries. Thanks @Jonathan Rosenberg. I’m in the very early stages so the architecture is not yet fully designed. I was really interested in LakeFS but the analogy to git made me concerned about being able to roll back the commit history as described here, which doesn’t work for my use case. There may be other approaches still.

03/11/2024, 9:07 PM

a hacky way would be to have another storage that only mirror the content of the HEAD. You then have full permission control over that mirrored copy of the HEAD. You can automate the update of the mirror via lakefs hook may be ? Not sure how big is your data: the mirror operation may take time/cost ?

Alex Buck

03/11/2024, 9:13 PM

That an option I’ve thought about. It might naturally fit in. We were considering using EFS to expose the data via a share drive instead of via s3, to simplify the consumption of it, and the approach you described would lend itself well to that architecture I think

03/11/2024, 9:18 PM

ah yes, mounting can also be another option where with admin right you mount

<s3://server/branch/>

to local filesystem, thus only that branch head is exposed. No need to mirror/copy. Again, it will depends on the architecture behind ...

Open in Slack

Previous Next