Hi! We are using S3 and want to restrict write acc...
# help
u
Hi! We are using S3 and want to restrict write access so that, once data is committed to LakeFS, it cannot be updated or removed. Is this possible? If so, what setup do you recommend? We are planning to use the Hadoop Filesystem so that our Spark jobs can write directly to S3. For this to work, the Spark job cluster must have write access to the entire LakeFS S3 bucket. This means that the Spark job could overwrite existing committed files. One idea is to add logic around the staging operations (i.e. between LakeFS and the client) to update the bucket's policy to allow the path to be writable. Similarly, extra logic would be added on commits to remove the write permission for committed paths. This doesn't seem scalable though if there are a large number of files being concurrently staged and committed. We are considering S3 Gateway instead. LakeFS would be the only one with write access to the bucket; all data write requests would go through the gateway. It is unclear though if S3 Gateway has its own permissions that can be configured or if they are simply passthrough to S3? For example, could we create a role that has write permission in LakeFS' S3 Gateway but not write access to the underlying S3 bucket?
u
Hey @Clinton Monk! The S3 gateway is simply an API exposed by lakeFS that is compatible with S3. To authenticate with it, you will use your lakeFS credentials on the client side. Therefore, your client will not be able to access the storage, as these credentials are useless for AWS. In turn, like you mentioned, lakeFS server will write to your storage. To do so, it needs permissions, and there are various ways to provide them, depending on how lakeFS is installed on your cloud.
u
So, to the point, if you only provide lakeFS (and no one else) with these permissions, you will achieve your goal of protecting your data from being overwritten by someone else.
u
I must mention that using the S3-compatible API is less scalable than using the Hadoop Filesystem, so this may or may not be a good fit depending on the number of objects you interact with.
u
Finally, for the sake of completeness, I will mention that there is an open issue to make our Hadoop Filesystem credential free. Of course, this issue has been opened thanks to your ideas and research.
u
Thanks! I'll give the S3 gateway a try to see if it is a good fit for our data. You're right, it becomes a tradeoff between scalability and security (well, we have to trust folks at some point).
u
Hi @Clinton Monk, I'm not entirely certain of the requirement not to allow the client to "overwrite files". If you are concerned that clients may accidentally overwrite objects, then I would suggest committing objects once done writing to them. If the concern is that a client may maliciously overwrite a file, this will be harder since we would need to do it on S3, which does not provide much to help (e.g. I just checked, existing tags are still not usable in PutObject). It's a pretty tough requirement on the object store. I'm not even sure whether the Spark "magic" output committers would work - they like to create many empty files, and S3 has no atomic "create object" operation. As you can probably guess, we are very interested in increasing Hadoop FS use-cases, and I would like to understand yours. If impossible with S3 then it becomes harder, but also a real plus for our users if we can pull it off... Thanks!
u
Sure thing! It is mostly to prevent accidents, because I assume a truly malicious user in our environment could find their way around the security measures we set up. With Hadoop FS, the Spark job has permission to overwrite the objects in S3. I suppose a user would have a difficult time finding the S3 path of an object to overwrite. The parameters to the Spark job would be LakeFS URLs (
<lakefs://some/path/to/spark/dataframe/>
). A user would need list the LakeFS files there, retrieve the physical path for one or more of those, and then overwrite or delete each individual file. That shows a certain intent if the user is to go down that path. Maybe that is enough for us? To only provide LakeFS URLs to our Spark jobs and instruct users to not attempt to retrieve the physical paths. Doing so should prevent accidental overwrites and deletions. FWIW, we use Spark both in interactive notebooks and automated jobs. Automated jobs go through code review, so the real concern for accidents is in the interactive notebooks. If we instruct users to only use the LakeFS paths, then, again, we may be able to avoid accidental deletions.
u
You're right that S3 doesn't provide much to prevent overwrites. 😓 From the linked issue, I like the idea of the executor assuming a role right before performing the write. It at least adds another step to prevent accidental overwrites.
u
@Clinton Monk, thanks for the details. I agree that it's less likely for a non-malicious user to accidentally delete the physical paths.
u
We could enforce this, sort-of, by generating presigned URLs on the client side (including support for generating presigned URLs for multipart upload, which is apparently a thing...). But then we no longer go through S3A, so we lose all the optimizations that it provides. However I think the most interesting optimizations are on the read side. 🤔 So maybe we could give the clients read-only acces so that they can use s3a://... URLs for reading -- like they do today. And for writing we could use presigned URLs; about the only optimization that we might need is to switch to multipart upload. Let's figure it out on the issue?