if I want to use lakeFS with Spark SQL do I set `spark sql w lakeFS #help

if I want to use lakeFS with Spark SQL, do I set `...

Robin Moffatt

04/25/2023, 5:03 PM

if I want to use lakeFS with Spark SQL, do I set

spark.sql.warehouse.dir

? I've tried that but getting

org.apache.spark.SparkException: Unable to create database default as failed to create its directory <s3://example/main>.

and no obvious error on the lakeFS side that I can see.

Oz Katz

04/25/2023, 5:23 PM

is that the path you provided? I think spark would not be able to create a "directory" right at the root of the branch, since that would mean an empty path..

Oz Katz

04/25/2023, 5:23 PM

Are you looking to use managed tables? I think this setting might not be necessary if you're using external tables..

Amit Kesarwani

04/25/2023, 9:28 PM

@Robin Moffatt If interested, I have a Databricks notebook for Spark SQL: https://dbc-8ada78b6-3a6d.cloud.databricks.com/?o=8376305627582670#notebook/55671720554255/command/55671720554256

gratitude thank you 1

Robin Moffatt

04/26/2023, 9:53 AM

thanks Amit, that's useful

Robin Moffatt

04/26/2023, 9:54 AM

I got it working - it needed to be

s3a

not

s3

. Is there a useful reference on why the URI is sometimes one or the other?

Robin Moffatt

04/26/2023, 11:47 AM

actually I hit another issue. I'm not sure if it's lakeFS misbehaving, or my misunderstanding of How Stuff Works. If I just create the empty lakeFS repo, when I run

CREATE DATABASE

from Spark SQL it fails:

Copy code

org.apache.spark.SparkException: Unable to create database default as failed to create its directory <s3a://example/main>

java.io.FileNotFoundException: PUT 0-byte object  on main/: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:404 Not Found

However, if I create a dummy file (e.g.

spark.range(0, 1).write.save('<s3a://example/main/dummy_file>')

) then the

CREATE DATABASE

works fine. Is there a more elegant/proper way to do this than creating a dummy file first? Notebook: https://gist.github.com/rmoff/28211a28adf7c55607f7ed7e4c4efc8f

Oz Katz

04/26/2023, 5:07 PM

You can set your warehouse dir to be

<s3a://example/main/warehouse/>

or something of that sort. Otherwise, Spark will attempt to write an zero-length-named file at the root of that location, which is ambiguous with the branch URI.

Oz Katz

04/26/2023, 5:07 PM

I believe that will work

Robin Moffatt

04/26/2023, 5:07 PM

<s3a://example/main>

worked just fine - it was the

s3

prefix that broke things

Robin Moffatt

04/26/2023, 5:08 PM

I still don't get why the existence of a dummy file would make a difference though.

Robin Moffatt

04/26/2023, 5:08 PM

is this by design?

Ariel Shaqed (Scolnicov)

04/27/2023, 3:44 AM

1. "s3" might be s3n, or some other Hadoop FileSystem. Each one writes differently. Full logs of the exception can help figure out what's going on. s3a and lakeFSFS should work. Others might. 2. The issue itself is that you cannot create an empty object at the root of a lakeFS branch. This is supposed to mirror the fact that you cannot create an empty object at the root of an s3 bucket. Hadoop loves creating empty files all over the place... And AFAIR it takes social care but to create them at the root of the bucket by counting slashes in the key. That doesn't work for lakeFS roots that have an extra slash.

gratitude thank you 1

18 Views

Open in Slack

Previous Next