https://lakefs.io/ logo
Title
c

Comte Frédéric

08/12/2022, 10:01 AM
Using spark with s3, there is a triple ( or a double depending on the output file commiter algorithm version) write for each output data which is painful on S3. I guess with lakeFS it could solve this kind of problem. Is it already the case ?
o

Oz Katz

08/12/2022, 10:08 AM
Hey @Comte Frédéric ! Excellent question 🙂 we have an open proposal to provide such functionality using a custom Spark OutputCommitter
If you have opinions or feedback we’d LOVE to hear them via an issue or pull request to the proposal.
c

Comte Frédéric

08/12/2022, 10:59 AM
o

Oz Katz

08/12/2022, 11:02 AM
Yes! We'll probably support that using the lakeFS Hadoop Filesystem that does direct data access, as you mentioned above 🙂
Are you already using lakeFS?
c

Comte Frédéric

08/12/2022, 11:03 AM
not really, just offer this new service on my dataplatform
and thinking how to use it for my users
👍 1
a

Ariel Shaqed (Scolnicov)

08/12/2022, 2:56 PM
As @Oz Katz says, it will be offered as part of lakeFSFS. But you're right it might work separately... In a sense it will be easier to use with lakeFSFS than without! In both cases you need credentials for lakeFS and for S3, the only difference will be in wiring up the proposed LakeFSOutputCommitter. It turns out that this is easiest to configure by FileSystem. I plan us to start by offering just on top of lakeFSFS; if people make a case for using it with S3AFileSystem then we'll see how to add it.
Forgot to add: For S3 itself, new Spark versions provide the "magic outputcommitter" that is supposed to be quite good. It depends on the new consistent behaviour of S3. I sometimes find myself bewildered by the Spark/Hadoop support matrix for enabling this feature safely. (Also it might not work for other "S3-compatible" object stores, if those are not consistent.)
c

Comte Frédéric

09/08/2022, 10:54 AM
Yes, we enabled magic commiter by default in our datalab
but it's writing file in S3 , with the lakefsoutputcommiter it will not
a

Ariel Shaqed (Scolnicov)

09/08/2022, 12:13 PM
Indeed -- I do not expect the magic committer to work with LakeFSFS (and indeed LakeFSFS does not report it is magic-capable). I am building the case for adding the above OutputCommitter to LakeFSFS on the potential wins PR. Could you help me add your use-case (ideally with its expected scale) to this PR? It will help me boost its priority. (Applies to any potential users, of course!)