Hi, I am trying to configure Spark to access data ...
# help
c
Hi, I am trying to configure Spark to access data from LakeFS following this tutorial using the S3A gateway. I set up my LakeFS credentials as env vars and start a Spark shell with the following command:
Copy code
spark-shell --conf spark.hadoop.fs.s3a.access.key=${LAKECTL_CREDENTIALS_ACCESS_KEY_ID} \
            --conf spark.hadoop.fs.s3a.secret.key=${LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY} \
            --conf spark.hadoop.fs.s3a.endpoint=${LAKECTL_SERVER_ENDPOINT_URL} \
            --conf spark.hadoop.fs.s3a.path.style.access=true
When I try to read a LakeFS file rom the Spark shell I get the following error (replacing actual repo and branch):
Copy code
java.nio.file.AccessDeniedException: s3a://<my-repo>/<my-branch>/<path-to-file>: getFileStatus on s3a://<my-repo>/<my-branch>/<path-to-file>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 4442587FB7D0A2F9; S3 Extended Request ID: null), S3 Extended Request ID: null:403 Forbidden
Do I need other key/secret in addition to the LakeFS key/secret? e.g. AWS Thank you!
i
Hi @Cristian Caloian, your configuration seems to be correct. If they are set properly, there’s no other keys/secrets to be passed to Spark.
Can you try using lakectl with the same access_key, secret_key, endpoint and see if you’re able to reach your installation using the same values?
c
The credentials are correct. I can
lakectl fs cat
the file locally for example, and read the local file into a Spark df successfully.
i
Just making sure I understand, you are now able to read the file using Spark, i.e. your issue was resolved?
c
Sorry for the confusion. What I could do was to access the file with
lakectl
, and for a test I just downloaded the file from lakefs locally. Just as a sanity check, I read the local copy of the file in a Spark df. What I would like to be able to do is to read the data in Spark directly from LakeFS, using a path of the form
s3a://<repo>/<branch>/<filepath>
. This last step is where I get the 403 Forbidden error above.
i
Got it. Let’s first try to understand if it reaches the lakeFS server. Do you see it in the lakeFS logs? (Try setting the log level to TRACE first - see our docs)
The reason I’m asking is that the lakeFS’s s3 gateway returns the same errors as S3 itself. So from the client error we can’t tell if it reached lakeFS or not.
l
I will take a look
👍 1
Ok so it seems like there is an issue on lakefs side. I think lakefs fails to get polices etc.
SQL query returned no results
Copy code
{"args":["rdda-concept-team-pilot-1","%",""],"duration":9578071,"file":"build/pkg/db/logged_rows.go:33","func":"pkg/db.(*LoggedRows).logDuration","level":"debug","msg":"rows done","query":"\n\t    WITH resolved_policies_view AS (\n                SELECT auth_policies.id, auth_policies.created_at, auth_policies.display_name, auth_policies.statement, auth_users.display_name AS user_display_name\n                FROM auth_policies INNER JOIN\n                     auth_user_policies ON (auth_policies.id = auth_user_policies.policy_id) INNER JOIN\n\t\t     auth_users ON (auth_users.id = auth_user_policies.user_id)\n                UNION\n\t\tSELECT auth_policies.id, auth_policies.created_at, auth_policies.display_name, auth_policies.statement, auth_users.display_name AS user_display_name\n\t\tFROM auth_policies INNER JOIN\n\t\t     auth_group_policies ON (auth_policies.id = auth_group_policies.policy_id) INNER JOIN\n\t\t     auth_groups ON (auth_groups.id = auth_group_policies.group_id) INNER JOIN\n\t\t     auth_user_groups ON (auth_user_groups.group_id = auth_groups.id) INNER JOIN\n\t\t     auth_users ON (auth_users.id = auth_user_groups.user_id)\n\t    ) SELECT id, created_at, display_name, statement FROM resolved_policies_view WHERE (user_display_name = $1 AND display_name LIKE $2) AND display_name \u003e $3 ORDER BY display_name","time":"2022-03-17T13:53:39Z","type":"start query","user":"rdda-concept-team-pilot-1"}
{"args":["api"],"file":"build/pkg/db/tx.go:87","func":"pkg/db.(*dbTx).Get","level":"trace","msg":"SQL query returned no results","query":"SELECT storage_namespace, creation_date, default_branch FROM graveler_repositories WHERE id = $1","time":"2022-03-17T13:53:39Z","took":331396,"type":"get","user":"rdda-concept-team-pilot-1"}
This is not the case when you use
lakectl
directly
@Itai Admi
👀 1
i
Are you sure the repo name is correct? The second log line suggests that the targeted repository doesn’t exist.
l
100% sure it exists
👍 1
so I think maybe there is some bug maybe when s3a:// is used
i
ok taking a deeper look into the code, will update here
👀 1
c
I think we figured out what was going wrong. It works when we set the url to
https://<my-url>
. Initially I was setting it to
https://<my-url>/api/v1
as we do for
lakectl
.
👍 1
👌 1
@Itai Admi could you please explain, or point me to relevant docs, what the difference is between the two urls?
i
Sure. lakeFS has 2 endpoints receiving data. The openAPI endpoint and the S3 gateway. The S3 gateway mimics S3 behaviour. In your Spark settings you want Spark to communicate with lakeFS in the same manner it does with S3, so you use the gateway under
https://<my-url>
. The openAPI endpoint offers a wider set of versioning capabilities like committing, creating branches, etc.. and is accepting traffic under
https://<my-url>/api/v1/
.
🙌 1
You can read more about it in our architecture page
👍 1
c
@Itai Admi Thanks for the support and for the explanation!
🙌 1