Hi, Could someone help me? When I try to read csv ...
# help
a
Hi, Could someone help me? When I try to read csv file from lakeFS using Apache Spark, I get the following exception?
Copy code
Caused by: io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.ObjectStats
  at io.lakefs.clients.api.ApiClient.deserialize(ApiClient.java:820)
  at io.lakefs.clients.api.ApiClient.handleResponse(ApiClient.java:1018)
  at io.lakefs.clients.api.ApiClient.execute(ApiClient.java:942)
  at io.lakefs.clients.api.ObjectsApi.statObjectWithHttpInfo(ObjectsApi.java:951)
  at io.lakefs.clients.api.ObjectsApi.statObject(ObjectsApi.java:926)
  at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:534)
Do any body face this issue before?
b
Hi, can you share the spark configuration you set - it looks like the lakefs api endpoint is not set right
a
LakeFS has been deployed in docker
Screenshot 2021-08-10 at 18.33.49.png
b
can you set it to http://lakefs:8000/api/v1
added '/api/v1'
a
Looks like it helped but now I have another exception. Does it mean I have incorrect access key and secret key?
b
yes, you need to use the lakefs key/secret
fs.lakefs.access.key and fs.lakefs.secret.key
a
yep, It was a mistake in keys. I made a next step but still have exception:
b
are you using lakefs with 'local' adapter?
The hadoop-lakefs library was built to access data directly from the client. Spark can read the data directly from the object storage and the metadata from lakefs. Currently we support only the
s3a
endpoint. When lakefs is using
local
adapter, objects are stored inside the container.
o
Hey @Artsiom Yudovin (cc @Barak Amar) - indeed the native hadoop FS used with Spark only supports s3a. I've opened an issue to provide better error messages when using an unsupported storage backend. To access data stored in lakeFS when running with locally stored data, you can directly use S3A to talk to lakeFS using the lakeFS S3 Gateway
Hope this helps - would be happy to help with further questions... keep in mind - you can setup a local installation with docker that stores lakeFS objects back to S3, making it possible to use the native spark client also with a local lakeFS instance, it's just that the underlying storage has to be accessible using s3a.
a
thank you for your help! I started to use s3a but I still have issue. I’m not sure that this issue is connected to lakefs. It looks like I need to configure AmazonS3Client to work with docekr service name.
I have the following configuration:
o
hmm. The
s3a.endpoint
should be the same as configured for gateways.s3.domain_name in the lakeFS configuration.
from the examples you've sent I'm assuming you're running both spark and lakeFS in the same Docker network as linked containers?
a
yes
o
I'm not 100% how to setup the DNS for this to work between containers, but generally lakeFS differentiates S3 gateway requests from other API requests based on the hostname used, this is why the endpoint needs to match up with the config. If alright with you I'd like to advise with people on the team to see what's the recommended way of doing this with a Docker setup. Let me do my research and get back to you around noon tomorrow (I'm GMT+3, so it's currently night time here).
a
yeah. thank you 🙏 ! FYI spark version is 3.0.2, hadoop-aws is 2.7.4
o
Thanks! that's helpful. Will update here tomorrow then.
a
JFYI: I changed the docker service name to IP and begin to get the following exception. But I have this bucket. No need to answer now. it’s ok to wait your response tomorrow.
👍 1
o
Hey @Artsiom Yudovin! After digging a bit into this, the best solution I've found is to use docker-compose'
links
to match up the value given in the lakeFS config for
LAKEFS_GATEWAYS_S3_DOMAIN_NAME
with the link name. I've uploaded a detailed working example here - it should be usable as is, but probably easy to adapt to your existing configuration. It's configured to run Spark 3.0, so I believe it should work for you. Let me know how this works!
a
thank you! small question: Do I need to have S3 under lakefs? In case of I want to deploy it locally without using any tool from cloud. Do I have such kind of opportunity?
Also, yesterday I begin to use ip address for container and get new exception as
bucket not exists
. I guess it’s another issue. Do u have any idea about the cause?
o
you have to use the same hostname as given in
LAKEFS_GATEWAYS_S3_DOMAIN_NAME
- the S3 protocol doesn't work well with IP addresses
Sure - with this setup you don't have to use S3 as storage, you can use the local storage adapter. Let me send you a quick example
a
got u, thank you! I will try to use example and come back with feedback
o
I've updated the example to use a local storage adapter - this means the data itself will be stored inside the docker container, so not really suitable for production (but faster and easier for local testing)
Let me know how it goes 🙂
a
So, I try to use your example and have some trouble with spark execution. Spark doesn’t execute any stage and wait something. In spite of this, I try to execute spark not from docker. I execute from my machine and get the following exception:
my configurations is
o
hmm, for this config to work, you’d need to set
LAKEFS_GATEWAYS_S3_DOMAIN_NAME
to s3.local.lakefs.io - is that properly configured?
a
Do you mean configured for my machine?
o
configured for the lakefs container
a
I used container from this example
o
a
Do I really need to use it if I deploy lakefs as docker container ?
o
you must tell lakefs which hostname will be used to accept s3 requests. it has to match the hostname you’re using as s3a.endpoint otherwise lakefs will not be able to parse spark’s requests
if you set it to s3.local.lakefs.io you will need to modify the example I shared to use this name instead of s3.docker.lakefs.io
a
ah, sorry, my fail, I didn’t get that in example we have s3.docker.lakefs.io
I thought that we use s3.local.lakefs.io
everything works now. Thank you so much! You are really help me 🙏
o
yay! happy it works!