Hello everybody...!
I have a question, and I'm sorry in advance if it's a stupid one. But I work for a company that has a lot of PII on our datalake, if we create a branch, we'd be able to see that info and it would be a problem. Did anybody has this issue and solved it somehow?
Thanks in advance
u
user
03/27/2022, 3:18 PM
Hi Matias,
Great to have you in our lake! lakeFS supports IAM that looks a lot like AWS IAM as applied to S3. If you have something that works with S3 IAM policy statements, you can probably translate it into similar lakeFS policies, and apply those to users. So objects could still appear on branches... they just would not be visible to users.
Will something like that work for you?
u
user
03/27/2022, 3:33 PM
Not quite, for a developer point of view, I want to develop and check the results of my transformations
u
user
03/27/2022, 3:57 PM
@Matias Stanislavsky can you tell us more about your use-case and environment? It sounds like you are trying to create an environment for safely experimenting with your transformations? What is the current process you are doing (without lakeFS)?
PII is indeed a problem our users facing with, i’ll look up for a useful reference and get back you.
u
user
03/27/2022, 4:01 PM
@Matias Stanislavsky, I wonder if today, you have some sort of process that runs on top of the lake to remove PII data? Can we execute that against a branch?
u
user
03/27/2022, 4:04 PM
Basically we've mocked data created on our dev datalake, and we generate our transformations there
@Matias Stanislavsky If I understand you correctly, you'd want to have the developers work with the mocked data? if so you can have it in its own lakeFS repository and track experiments and ensure reproducibility using lakeFS commits and tags.
lakeFS in itself doesn't provide any mocking/obfuscation/sampling capabilities - but you can define CI/CD hooks to make sure that repo does not contain any PII