if I want to use lakeFS with Spark SQL, do I set `...
# help
r
if I want to use lakeFS with Spark SQL, do I set
spark.sql.warehouse.dir
? I've tried that but getting
org.apache.spark.SparkException: Unable to create database default as failed to create its directory <s3://example/main>.
and no obvious error on the lakeFS side that I can see.
o
is that the path you provided? I think spark would not be able to create a "directory" right at the root of the branch, since that would mean an empty path..
Are you looking to use managed tables? I think this setting might not be necessary if you're using external tables..
a
gratitude thank you 1
r
thanks Amit, that's useful
I got it working - it needed to be
s3a
not
s3
. Is there a useful reference on why the URI is sometimes one or the other?
actually I hit another issue. I'm not sure if it's lakeFS misbehaving, or my misunderstanding of How Stuff Works. If I just create the empty lakeFS repo, when I run
CREATE DATABASE
from Spark SQL it fails:
Copy code
org.apache.spark.SparkException: Unable to create database default as failed to create its directory <s3a://example/main>

java.io.FileNotFoundException: PUT 0-byte object  on main/: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:404 Not Found
However, if I create a dummy file (e.g.
spark.range(0, 1).write.save('<s3a://example/main/dummy_file>')
) then the
CREATE DATABASE
works fine. Is there a more elegant/proper way to do this than creating a dummy file first? Notebook: https://gist.github.com/rmoff/28211a28adf7c55607f7ed7e4c4efc8f
o
You can set your warehouse dir to be
<s3a://example/main/warehouse/>
or something of that sort. Otherwise, Spark will attempt to write an zero-length-named file at the root of that location, which is ambiguous with the branch URI.
I believe that will work
r
<s3a://example/main>
worked just fine - it was the
s3
prefix that broke things
I still don't get why the existence of a dummy file would make a difference though.
is this by design?
a
1. "s3" might be s3n, or some other Hadoop FileSystem. Each one writes differently. Full logs of the exception can help figure out what's going on. s3a and lakeFSFS should work. Others might. 2. The issue itself is that you cannot create an empty object at the root of a lakeFS branch. This is supposed to mirror the fact that you cannot create an empty object at the root of an s3 bucket. Hadoop loves creating empty files all over the place... And AFAIR it takes social care but to create them at the root of the bucket by counting slashes in the key. That doesn't work for lakeFS roots that have an extra slash.
gratitude thank you 1