Hi I am trying to configure Spark to access data from LakeFS lakeFS #help

Hi, I am trying to configure Spark to access data ...

Cristian Caloian

03/17/2022, 8:47 AM

Hi, I am trying to configure Spark to access data from LakeFS following this tutorial using the S3A gateway. I set up my LakeFS credentials as env vars and start a Spark shell with the following command:

Copy code

spark-shell --conf spark.hadoop.fs.s3a.access.key=${LAKECTL_CREDENTIALS_ACCESS_KEY_ID} \
            --conf spark.hadoop.fs.s3a.secret.key=${LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY} \
            --conf spark.hadoop.fs.s3a.endpoint=${LAKECTL_SERVER_ENDPOINT_URL} \
            --conf spark.hadoop.fs.s3a.path.style.access=true

When I try to read a LakeFS file rom the Spark shell I get the following error (replacing actual repo and branch):

Copy code

java.nio.file.AccessDeniedException: s3a://<my-repo>/<my-branch>/<path-to-file>: getFileStatus on s3a://<my-repo>/<my-branch>/<path-to-file>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 4442587FB7D0A2F9; S3 Extended Request ID: null), S3 Extended Request ID: null:403 Forbidden

Do I need other key/secret in addition to the LakeFS key/secret? e.g. AWS Thank you!

Itai Admi

03/17/2022, 8:54 AM

Hi @Cristian Caloian, your configuration seems to be correct. If they are set properly, there’s no other keys/secrets to be passed to Spark.

Itai Admi

03/17/2022, 8:55 AM

Can you try using lakectl with the same access_key, secret_key, endpoint and see if you’re able to reach your installation using the same values?

Cristian Caloian

03/17/2022, 9:10 AM

The credentials are correct. I can

lakectl fs cat

the file locally for example, and read the local file into a Spark df successfully.

Itai Admi

03/17/2022, 9:13 AM

Just making sure I understand, you are now able to read the file using Spark, i.e. your issue was resolved?

Cristian Caloian

03/17/2022, 9:21 AM

Sorry for the confusion. What I could do was to access the file with

lakectl

, and for a test I just downloaded the file from lakefs locally. Just as a sanity check, I read the local copy of the file in a Spark df. What I would like to be able to do is to read the data in Spark directly from LakeFS, using a path of the form

s3a://<repo>/<branch>/<filepath>

. This last step is where I get the 403 Forbidden error above.

Itai Admi

03/17/2022, 9:34 AM

Got it. Let’s first try to understand if it reaches the lakeFS server. Do you see it in the lakeFS logs? (Try setting the log level to TRACE first - see our docs)

Itai Admi

03/17/2022, 9:39 AM

The reason I’m asking is that the lakeFS’s s3 gateway returns the same errors as S3 itself. So from the client error we can’t tell if it reached lakeFS or not.

Leonard Aukea

03/17/2022, 12:50 PM

I will take a look

👍 1

Leonard Aukea

03/17/2022, 2:35 PM

Ok so it seems like there is an issue on lakefs side. I think lakefs fails to get polices etc.

SQL query returned no results

Copy code

{"args":["rdda-concept-team-pilot-1","%",""],"duration":9578071,"file":"build/pkg/db/logged_rows.go:33","func":"pkg/db.(*LoggedRows).logDuration","level":"debug","msg":"rows done","query":"\n\t    WITH resolved_policies_view AS (\n                SELECT auth_policies.id, auth_policies.created_at, auth_policies.display_name, auth_policies.statement, auth_users.display_name AS user_display_name\n                FROM auth_policies INNER JOIN\n                     auth_user_policies ON (auth_policies.id = auth_user_policies.policy_id) INNER JOIN\n\t\t     auth_users ON (auth_users.id = auth_user_policies.user_id)\n                UNION\n\t\tSELECT auth_policies.id, auth_policies.created_at, auth_policies.display_name, auth_policies.statement, auth_users.display_name AS user_display_name\n\t\tFROM auth_policies INNER JOIN\n\t\t     auth_group_policies ON (auth_policies.id = auth_group_policies.policy_id) INNER JOIN\n\t\t     auth_groups ON (auth_groups.id = auth_group_policies.group_id) INNER JOIN\n\t\t     auth_user_groups ON (auth_user_groups.group_id = auth_groups.id) INNER JOIN\n\t\t     auth_users ON (auth_users.id = auth_user_groups.user_id)\n\t    ) SELECT id, created_at, display_name, statement FROM resolved_policies_view WHERE (user_display_name = $1 AND display_name LIKE $2) AND display_name \u003e $3 ORDER BY display_name","time":"2022-03-17T13:53:39Z","type":"start query","user":"rdda-concept-team-pilot-1"}
{"args":["api"],"file":"build/pkg/db/tx.go:87","func":"pkg/db.(*dbTx).Get","level":"trace","msg":"SQL query returned no results","query":"SELECT storage_namespace, creation_date, default_branch FROM graveler_repositories WHERE id = $1","time":"2022-03-17T13:53:39Z","took":331396,"type":"get","user":"rdda-concept-team-pilot-1"}

This is not the case when you use

lakectl

directly

Leonard Aukea

03/17/2022, 2:39 PM

@Itai Admi

👀 1

Itai Admi

03/17/2022, 2:51 PM

Are you sure the repo name is correct? The second log line suggests that the targeted repository doesn’t exist.

Leonard Aukea

03/17/2022, 2:53 PM

100% sure it exists

👍 1

Leonard Aukea

03/17/2022, 2:53 PM

so I think maybe there is some bug maybe when s3a:// is used

Itai Admi

03/17/2022, 2:57 PM

ok taking a deeper look into the code, will update here

👀 1

Cristian Caloian

03/17/2022, 3:26 PM

I think we figured out what was going wrong. It works when we set the url to

https://<my-url>

. Initially I was setting it to

https://<my-url>/api/v1

as we do for

lakectl

👍 1

👌 1

Cristian Caloian

03/17/2022, 3:45 PM

@Itai Admi could you please explain, or point me to relevant docs, what the difference is between the two urls?

Itai Admi

03/17/2022, 3:53 PM

Sure. lakeFS has 2 endpoints receiving data. The openAPI endpoint and the S3 gateway. The S3 gateway mimics S3 behaviour. In your Spark settings you want Spark to communicate with lakeFS in the same manner it does with S3, so you use the gateway under

https://<my-url>

. The openAPI endpoint offers a wider set of versioning capabilities like committing, creating branches, etc.. and is accepting traffic under

https://<my-url>/api/v1/

🙌 1

Itai Admi

03/17/2022, 3:54 PM

You can read more about it in our architecture page

👍 1

Cristian Caloian

03/17/2022, 4:30 PM

@Itai Admi Thanks for the support and for the explanation!

🙌 1

2 Views

Open in Slack

Previous Next