Thread
#data-architecture-discussion
    c

    Comte Frédéric

    1 month ago
    Using spark with s3, there is a triple ( or a double depending on the output file commiter algorithm version) write for each output data which is painful on S3. I guess with lakeFS it could solve this kind of problem. Is it already the case ?
    Oz Katz

    Oz Katz

    1 month ago
    Hey @Comte Frédéric ! Excellent question 🙂 we have an open proposal to provide such functionality using a custom Spark OutputCommitter
    If you have opinions or feedback we’d LOVE to hear them via an issue or pull request to the proposal.
    c

    Comte Frédéric

    1 month ago
    Oz Katz

    Oz Katz

    1 month ago
    Yes! We'll probably support that using the lakeFS Hadoop Filesystem that does direct data access, as you mentioned above 🙂
    Are you already using lakeFS?
    c

    Comte Frédéric

    1 month ago
    not really, just offer this new service on my dataplatform
    and thinking how to use it for my users
    Ariel Shaqed (Scolnicov)

    Ariel Shaqed (Scolnicov)

    1 month ago
    As @Oz Katz says, it will be offered as part of lakeFSFS. But you're right it might work separately... In a sense it will be easier to use with lakeFSFS than without! In both cases you need credentials for lakeFS and for S3, the only difference will be in wiring up the proposed LakeFSOutputCommitter. It turns out that this is easiest to configure by FileSystem. I plan us to start by offering just on top of lakeFSFS; if people make a case for using it with S3AFileSystem then we'll see how to add it.
    Forgot to add: For S3 itself, new Spark versions provide the "magic outputcommitter" that is supposed to be quite good. It depends on the new consistent behaviour of S3. I sometimes find myself bewildered by the Spark/Hadoop support matrix for enabling this feature safely. (Also it might not work for other "S3-compatible" object stores, if those are not consistent.)
    c

    Comte Frédéric

    2 weeks ago
    Yes, we enabled magic commiter by default in our datalab
    but it's writing file in S3 , with the lakefsoutputcommiter it will not
    Ariel Shaqed (Scolnicov)

    Ariel Shaqed (Scolnicov)

    2 weeks ago
    Indeed -- I do not expect the magic committer to work with LakeFSFS (and indeed LakeFSFS does not report it is magic-capable). I am building the case for adding the above OutputCommitter to LakeFSFS on the potential wins PR. Could you help me add your use-case (ideally with its expected scale) to this PR? It will help me boost its priority. (Applies to any potential users, of course!)