I m using the <https docs lakefs io reference api html objec lakeFS #dev

I’m using the <getObject> endpoint to get the cont...

Tal Sofer

10/26/2021, 6:49 AM

I’m using the getObject endpoint to get the contents of two objects (of size <= 100MB) that I would like to compare to calculate their diff. The getObject operation returns the object contents as an

application/octet-stream

, and I’m looking into using a react library that can calculate the diff for me, this library gets file contents as strings. • Should I first read the contents and save it in-memory and then compare it? • What is the right way to read from a stream in javascript? @Barak Amar @Ariel Shaqed (Scolnicov) do you have useful tips to share?

Itai Admi

10/26/2021, 7:02 AM

Why 100 MB limit? I think that for readable formats it should be much less..

Tal Sofer

10/26/2021, 7:05 AM

That’s the Max file size supported by git. I agree that readable files will be most probably smaller

Barak Amar

10/26/2021, 7:16 AM

Think the bottom line is the api you use with nthe diff library. I assume it will require having the complete content in memory. (verify that first) The reading part in this case will like http get request. api.js should include code like that. if the library support working with uri you can just point it to the lakefs endpoint and it will read the content.

Ariel Shaqed (Scolnicov)

10/26/2021, 8:26 AM

1. I don't want to hold even 16MiB of data in memory on lakeFS or on the client for a user-facing app. For that matter, I cannot imagine a UI for examining such large diffs, unless the diff itself is very small. I expect users who need to diff huge files on an object store to use a specialized tool that does this. Also note that any tool will need to be very well-written in order to work on 100 MiB objects. I don't understand how a JavaScript library running in a browser will get anywhere close to doing this: the standard algorithms have O(n^2) runtime (although TBH there is the horribly-named Four Russians Algorithm to provide a theoretical speedup. 2. I don't understand why either object would ever need to touch memory or even disk.

diff

should work for streaming input, or at most seekable streaming input. E.g.

diff

has a

--speed-large-files

option that might help here. 3. Please separate detecting file type from file diff. File type should ideally come from the

Content-Type

header rather than a heuristic. And if we separate, we can switch the one without switching the other. Since users are very likely to have an opinion about the desired behaviour, we should make our lives easier to support the right one on each side. 4. You need to figure out how to read from a network connection in JavaScript. You will need to use XMLHttpRequest ("XHR") or the newer Fetch (see the page on MDN for basic expectations of browser support for this). Fetch returns this kind of Response object, you could probably ask for fewer bytes than the whole thing and then use Response.blob if you need everything in memory.

3 Views

Open in Slack

Previous Next