https://lakefs.io/ logo
Title
y

Yoni Augarten

11/21/2022, 5:06 PM
Hey @John Zielke, you are right in that objects saved to the backend are immutable. Note, however, that when you're using an S3 client (e.g. Boto or the AWS CLI) as the client for lakeFS, you use lakeFS logical paths. Hence, from the client's point of view, these paths are mutable. For example, consider the following command uploading a file to my repository's main branch:
aws s3 cp --endpoint-url <https://yoni.lakefscloud.io> myfile.txt <s3://example-repo/main/dest.txt>
And suppose I'm writing another file, to the same path:
aws s3 cp --endpoint-url <https://yoni.lakefscloud.io> myfile2.txt <s3://example-repo/main/dest.txt>
In the backend, both saved objects are immutable. However, to the client, the file s3://example-repo/main/dest.txt has changed.
e

einat.orr

11/21/2022, 5:14 PM
@Itay Ariel
j

John Zielke

11/21/2022, 5:18 PM
Thank you for the reply! I see what you mean and was aware of that. But as we are considering using a cache in-between lakefs and AWS s3 for on-prem deployments (to save egress costs and reduce latency), having this metadata added (or at least configurable) for the backend objects would be very useful, so that the cache knows that it does not need to send a HEAD request to check the Etag before responding with the cached object. As the objects are immutable once written, changing these after the fact just to add this info (by copying in-place in a Lambda function or similar) is very slow and inefficient.
y

Yoni Augarten

11/21/2022, 5:20 PM
I see. So you mean adding these headers between lakeFS and S3, rather than between the client and lakeFS.
I want to take a look into that tomorrow, and I will write here again after I've done some research. Meanwhile, I've opened an issue to track this. Thank you for this suggestion and feel free to comment on the issue.
j

John Zielke

11/21/2022, 5:27 PM
Exactly, that's what I mean. Thank you very much!
a

Ariel Shaqed (Scolnicov)

11/21/2022, 10:10 PM
That's a really interesting idea! Note that lakeFS anyway uses a local cache for immutable metadata objects (ranges and metaranges). An additional external cache would probably help mostly when used with a larger lakeFS cluster.
j

John Zielke

11/22/2022, 9:17 AM
Thank you for your reply! Yes, I'm using this functionality for the metaobjects, but mainly for the actual data objects, to have a LRU cache of those, as we will have only a small percentage of the data as "hot" data at any point in time and these will be regularly accessed
a

Ariel Shaqed (Scolnicov)

11/22/2022, 3:35 PM
Not sure we've proposed this workaround: you might perform direct access like lakeFSFS and the
--direct
flag of
lakectl
fs {upload,cat} do: ask lakeFS to statObject and then read the object from the physical path field of the result - it's immutable. Now you can use whatever s3 client features you like door data, because lakeFS never sees that actual data! As a bonus, it scales really nicely because lakeFS only manages metadata.
j

John Zielke

11/22/2022, 4:14 PM
Thank you for the suggestion! How would this work on upload? If I use lakectl upload --direct, I still don't have control over the metadata (i.e. Cache-Control header) that is being set. I guess I somehow would have to use this function right? This could also be imitated by setting the right headers on read if they are propagated correctly, but would then need to be configured for each client reading. I would also loose support for any access control that lakefs does, right? I see the how this approach could work, but is more involved in the client implementation, as I need to add extra code for the path translation, maybe add access credentials, expose the endpoints for the underlying storage etc.
a

Ariel Shaqed (Scolnicov)

11/24/2022, 4:02 PM
Yeah. You'd have to set these headers on the HTTP client used by the AWS client; we don't do that currently. You could edit the call to
UploadWithCotext
in pkg/api/adapters.go,
s3manager.UploadInput
takes a CacheCotrol field. This is admittedly icky.
👍 1
j

John Zielke

11/24/2022, 4:44 PM
I agree. I would prefer not having to fork/build my own version of lakefs, as that will probably lead to me not upgrading to newer versions regularly. But thanks for the hint, if all else fails I might do that. As all of these objects are immutable by design, it would be great if this could be integrated. Is there a downside to setting this field? Adding this information would also make it obvious to users looking at the underlying S3 store that these objects will never change and might help for services like CloudFront (although a CDN might admittedly not be a common use case for lakefs).
a

Ariel Shaqed (Scolnicov)

11/26/2022, 11:24 AM
I agree, it would sounds a useful addition for lakeFS with a remote object store, or for serving objects off of lakeFS to a remote client that uses direct access! But I want to be cautious here. @John Zielke, I completely understand your frustration about our caution here, and I share it to some extent. I believe this belongs on the issue, so copying it over there. So let me try to explain the forces that act against this. The basic issue is that it's a very broad front. On the server side it's "just a small matter of programming 😒tuck_out_tongue_lakefs:": We would need to add support to lakeFS, of course -- and possibly also for the GCS and Azure adapters too. But now S3 is not a single well-defined implementation, it's a de facto standard that has no definition. And we want to support it wherever possible... I am not even sure which implementations to pick to verify that this works, or how to do so: Not even if I relax the requirement from "sends back the same Cache-Control header" to "doesn't do something unexpectedly bad when it sees a Cache-Control header". As a cautious developer, that translates to adding and documenting another option. Which translates to increased cognitive load for our users.
:lakefs: 1
j

John Zielke

11/28/2022, 9:21 AM
Thanks for your reply! Don't get me wrong, I am not at all frustrated, just trying to provide my perspective on this (although I'm obviously biased) 🙂 I greatly appreciate the help&effort you put in to possibly accommodating such a need. For reference it seems that the only one to use is the AWS S3 refrence, as even the most common alternative, Minio seems to not have their own API published (or at least I couldn't quickly find it). Therefore I would argue it makes sense to test it against this API. IMHO, if other S3 services don't return/support it (I guess as long as they don't fail), that should be a issue of the other service, not a reason for Lakefs to provide a reduced featureset. A question in relation to this: What does Support for caching headers, ETag mean? The fact that this is stated as supported in the Get Api is great, but makes less sense if there is no way to "use" it (apart from the implicitly placed ETag) I can see why you might be hesitant to apply this for all objects or make a specific option for this. Instead I have a different proposal, that could help with this issue: 1. Pass-through these system-defined headers https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html in the API. The fact that this does not work is unintuitive in my opinion, since the other metadate works 2. Add a generic option to add metadata (user+system-defined) to all PUTs through the S3 gateway. This would make setting this value for all clients a matter of lakefs configuration, and would not have to be enforced on every client.