Hi Could someone help me When I try to read csv file from la lakeFS #help

Join Slack

Channels

help

dev

Hi, Could someone help me? When I try to read csv ...

# help

Artsiom Yudovin

08/10/2021, 3:27 PM

Hi, Could someone help me? When I try to read csv file from lakeFS using Apache Spark, I get the following exception?

Copy code

Caused by: io.lakefs.clients.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.clients.api.model.ObjectStats
  at io.lakefs.clients.api.ApiClient.deserialize(ApiClient.java:820)
  at io.lakefs.clients.api.ApiClient.handleResponse(ApiClient.java:1018)
  at io.lakefs.clients.api.ApiClient.execute(ApiClient.java:942)
  at io.lakefs.clients.api.ObjectsApi.statObjectWithHttpInfo(ObjectsApi.java:951)
  at io.lakefs.clients.api.ObjectsApi.statObject(ObjectsApi.java:926)
  at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:534)

Do any body face this issue before?

Barak Amar

08/10/2021, 3:32 PM

Hi, can you share the spark configuration you set - it looks like the lakefs api endpoint is not set right

Artsiom Yudovin

08/10/2021, 3:33 PM

LakeFS has been deployed in docker

Artsiom Yudovin

08/10/2021, 3:33 PM

Screenshot 2021-08-10 at 18.33.49.png

Barak Amar

08/10/2021, 3:34 PM

can you set it to http://lakefs:8000/api/v1

Barak Amar

08/10/2021, 3:34 PM

added '/api/v1'

Artsiom Yudovin

08/10/2021, 3:37 PM

Looks like it helped but now I have another exception. Does it mean I have incorrect access key and secret key?

Barak Amar

08/10/2021, 3:38 PM

yes, you need to use the lakefs key/secret

Barak Amar

08/10/2021, 3:39 PM

fs.lakefs.access.key and fs.lakefs.secret.key

Artsiom Yudovin

08/10/2021, 3:40 PM

yep, It was a mistake in keys. I made a next step but still have exception:

Barak Amar

08/10/2021, 3:42 PM

are you using lakefs with 'local' adapter?

Barak Amar

08/10/2021, 3:52 PM

The hadoop-lakefs library was built to access data directly from the client. Spark can read the data directly from the object storage and the metadata from lakefs. Currently we support only the

s3a

endpoint. When lakefs is using

local

adapter, objects are stored inside the container.

Oz Katz

08/10/2021, 5:32 PM

Hey @Artsiom Yudovin (cc @Barak Amar) - indeed the native hadoop FS used with Spark only supports s3a. I've opened an issue to provide better error messages when using an unsupported storage backend. To access data stored in lakeFS when running with locally stored data, you can directly use S3A to talk to lakeFS using the lakeFS S3 Gateway

Oz Katz

08/10/2021, 5:34 PM

Hope this helps - would be happy to help with further questions... keep in mind - you can setup a local installation with docker that stores lakeFS objects back to S3, making it possible to use the native spark client also with a local lakeFS instance, it's just that the underlying storage has to be accessible using s3a.

Artsiom Yudovin

08/10/2021, 8:09 PM

thank you for your help! I started to use s3a but I still have issue. I’m not sure that this issue is connected to lakefs. It looks like I need to configure AmazonS3Client to work with docekr service name.

Artsiom Yudovin

08/10/2021, 8:09 PM

I have the following configuration:

Oz Katz

08/10/2021, 8:15 PM

hmm. The

s3a.endpoint

should be the same as configured for gateways.s3.domain_name in the lakeFS configuration.

Oz Katz

08/10/2021, 8:16 PM

from the examples you've sent I'm assuming you're running both spark and lakeFS in the same Docker network as linked containers?

Artsiom Yudovin

08/10/2021, 8:17 PM

yes

Oz Katz

08/10/2021, 8:20 PM

I'm not 100% how to setup the DNS for this to work between containers, but generally lakeFS differentiates S3 gateway requests from other API requests based on the hostname used, this is why the endpoint needs to match up with the config. If alright with you I'd like to advise with people on the team to see what's the recommended way of doing this with a Docker setup. Let me do my research and get back to you around noon tomorrow (I'm GMT+3, so it's currently night time here).

Artsiom Yudovin

08/10/2021, 8:22 PM

yeah. thank you 🙏 ! FYI spark version is 3.0.2, hadoop-aws is 2.7.4

Oz Katz

08/10/2021, 8:23 PM

Thanks! that's helpful. Will update here tomorrow then.

Artsiom Yudovin

08/10/2021, 8:37 PM

JFYI: I changed the docker service name to IP and begin to get the following exception. But I have this bucket. No need to answer now. it’s ok to wait your response tomorrow.

👍 1

Oz Katz

08/11/2021, 9:45 AM

Hey @Artsiom Yudovin! After digging a bit into this, the best solution I've found is to use docker-compose'

links

to match up the value given in the lakeFS config for

LAKEFS_GATEWAYS_S3_DOMAIN_NAME

with the link name. I've uploaded a detailed working example here - it should be usable as is, but probably easy to adapt to your existing configuration. It's configured to run Spark 3.0, so I believe it should work for you. Let me know how this works!

Artsiom Yudovin

08/11/2021, 9:54 AM

thank you! small question: Do I need to have S3 under lakefs? In case of I want to deploy it locally without using any tool from cloud. Do I have such kind of opportunity?

Artsiom Yudovin

08/11/2021, 9:59 AM

Also, yesterday I begin to use ip address for container and get new exception as

bucket not exists

. I guess it’s another issue. Do u have any idea about the cause?

Oz Katz

08/11/2021, 10:00 AM

you have to use the same hostname as given in

LAKEFS_GATEWAYS_S3_DOMAIN_NAME

- the S3 protocol doesn't work well with IP addresses

Oz Katz

08/11/2021, 10:01 AM

Sure - with this setup you don't have to use S3 as storage, you can use the local storage adapter. Let me send you a quick example

Artsiom Yudovin

08/11/2021, 10:01 AM

got u, thank you! I will try to use example and come back with feedback

Oz Katz

08/11/2021, 10:04 AM

I've updated the example to use a local storage adapter - this means the data itself will be stored inside the docker container, so not really suitable for production (but faster and easier for local testing)

Oz Katz

08/11/2021, 10:04 AM

Let me know how it goes 🙂

Artsiom Yudovin

08/11/2021, 9:22 PM

So, I try to use your example and have some trouble with spark execution. Spark doesn’t execute any stage and wait something. In spite of this, I try to execute spark not from docker. I execute from my machine and get the following exception:

Artsiom Yudovin

08/11/2021, 9:22 PM

my configurations is

Oz Katz

08/11/2021, 9:30 PM

hmm, for this config to work, you’d need to set

LAKEFS_GATEWAYS_S3_DOMAIN_NAME

to s3.local.lakefs.io - is that properly configured?

Artsiom Yudovin

08/11/2021, 9:31 PM

Do you mean configured for my machine?

Oz Katz

08/11/2021, 9:32 PM

configured for the lakefs container

Artsiom Yudovin

08/11/2021, 9:33 PM

I used container from this example

Oz Katz

08/11/2021, 9:33 PM

following the convention documented here: https://docs.lakefs.io/reference/configuration.html#using-environment-variables

Artsiom Yudovin

08/11/2021, 9:35 PM

Do I really need to use it if I deploy lakefs as docker container ?

Oz Katz

08/11/2021, 9:37 PM

you must tell lakefs which hostname will be used to accept s3 requests. it has to match the hostname you’re using as s3a.endpoint otherwise lakefs will not be able to parse spark’s requests

Oz Katz

08/11/2021, 9:38 PM

if you set it to s3.local.lakefs.io you will need to modify the example I shared to use this name instead of s3.docker.lakefs.io

Artsiom Yudovin

08/11/2021, 9:42 PM

ah, sorry, my fail, I didn’t get that in example we have s3.docker.lakefs.io

Artsiom Yudovin

08/11/2021, 9:42 PM

I thought that we use s3.local.lakefs.io

Artsiom Yudovin

08/11/2021, 9:57 PM

everything works now. Thank you so much! You are really help me 🙏

Oz Katz

08/11/2021, 9:57 PM

yay! happy it works!

2 Views

Open in Slack

Previous Next