Title
#data-discussion
c

Comte Frédéric

08/12/2022, 10:01 AM
Using spark with s3, there is a triple ( or a double depending on the output file commiter algorithm version) write for each output data which is painful on S3. I guess with lakeFS it could solve this kind of problem. Is it already the case ?
Oz Katz

Oz Katz

08/12/2022, 10:08 AM
Hey @Comte Frédéric ! Excellent question 🙂 we have an open proposal to provide such functionality using a custom Spark OutputCommitter
10:09 AM
If you have opinions or feedback we’d LOVE to hear them via an issue or pull request to the proposal.
c

Comte Frédéric

08/12/2022, 10:59 AM
Oz Katz

Oz Katz

08/12/2022, 11:02 AM
Yes! We'll probably support that using the lakeFS Hadoop Filesystem that does direct data access, as you mentioned above 🙂
11:02 AM
Are you already using lakeFS?
c

Comte Frédéric

08/12/2022, 11:03 AM
not really, just offer this new service on my dataplatform
11:04 AM
and thinking how to use it for my users
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

08/12/2022, 2:56 PM
As @Oz Katz says, it will be offered as part of lakeFSFS. But you're right it might work separately... In a sense it will be easier to use with lakeFSFS than without! In both cases you need credentials for lakeFS and for S3, the only difference will be in wiring up the proposed LakeFSOutputCommitter. It turns out that this is easiest to configure by FileSystem. I plan us to start by offering just on top of lakeFSFS; if people make a case for using it with S3AFileSystem then we'll see how to add it.
2:04 PM
Forgot to add: For S3 itself, new Spark versions provide the "magic outputcommitter" that is supposed to be quite good. It depends on the new consistent behaviour of S3. I sometimes find myself bewildered by the Spark/Hadoop support matrix for enabling this feature safely. (Also it might not work for other "S3-compatible" object stores, if those are not consistent.)
c

Comte Frédéric

09/08/2022, 10:54 AM
Yes, we enabled magic commiter by default in our datalab
10:57 AM
but it's writing file in S3 , with the lakefsoutputcommiter it will not
Ariel Shaqed (Scolnicov)

Ariel Shaqed (Scolnicov)

09/08/2022, 12:13 PM
Indeed -- I do not expect the magic committer to work with LakeFSFS (and indeed LakeFSFS does not report it is magic-capable). I am building the case for adding the above OutputCommitter to LakeFSFS on the potential wins PR. Could you help me add your use-case (ideally with its expected scale) to this PR? It will help me boost its priority. (Applies to any potential users, of course!)