What are my next steps? Maybe using Docker verion?
# help
a
What are my next steps? Maybe using Docker verion?
a
I am not sure that Docker will help if it's a browser/GUI issue. But obviously I do not know. If you are planning also to access data on lakeFS with tools other than the GUI, this might be a good time to skip ahead and use those. On the other thread I posted think link to how to use Spark to access data on lakeFS.
a
In fact, I have my local version of LakeFS running on Docker Desktop and the UI is not breaking with it. Once I use EC2 deployed version, it does fail with the same browser. That's why I was suspecting the versions of LakeFS. (I did deploy the binary in EC2)
a
Interesting! Do you know if these are the same software versions of lakeFS? It's easiest to see on the dropdown from your username on the top RHS of the lakeFS GUI.
a
Yes, I just checked: I have 1.2.0 LakeFS in Docker and I have 1.3.0 in EC2
I can definitely try to run docker version on EC2 and see the diff
a
Thanks! I guess I'll add it to our issue.
a
Or, I could try to deploy 1.2.0 in my EC2 (dont know if there is a way to get previous binary)
a
GitHub releases have it all -- one of the things I love with releasing through GitHub.
a
@Ariel Shaqed (Scolnicov) I re-deployed LakeFS v1.2.0 and the behaviour is the same. I dont believe deploying Docker version would change anything. May I ask you about permissions: I am not specifying any access keys, I have defined IAM role that has all rights attached and my EC2 is assigned that role. Is that sufficient ? I guess so because I can see how the server can create DynamoDB table etc
a
I believe you are correct, and you've probably done everything correctly. I hope well be able to resolve this one in short order, or provide a workaround. Anything further is unfortunately speculation on my part
b
Can you verify if it is related to the browser cache? in the network tab, you can check 'disable cache' and run the query again?
a
@Barak Amar I am using Incognito window in Chrome and am expecting that it has no cache, at least when I start it.
sunglasses lakefs 1
b
Thanks @Alexey Shchedrin, is there a way you can use duckdb cli to run the same query, just use the s3 gateway?
https://docs.lakefs.io/integrations/duckdb.html for settings the right configuration for query the same data from the cli
a
Yes @Barak Amar. I will try that
lakefs 1
b
thank you
a
I tried to run it using DuckDB cli and here is what my errors are:
I suspect that I have access-related problems
b
you need to use s3:// scheme + specify the name of the repository and branch
also need to turn off the ssl verfivication
as you use http
Copy code
SET s3_use_ssl=false;
a
Can you please confirm again, are these things set correctly:
SET s3_region='us-west-2';
D SET s3_endpoint='http://10.208.46.44/';
D SET s3_url_style='path';
10.208.46.44/
is where LAkeFS is running
b
the port if needed as part of the endpoint
if it is not 80
you will need to set
s3_access_key_id
and
s3_secret_access_key
with lakefs access creds
same you use for login
a
Well, this is the thing: I asked Ariel a question about access rights, here is what he responded: Alexey: May I ask you about permissions: I am not specifying any access keys, I have defined IAM role that has all rights attached and my EC2 is assigned that role. Is that sufficient ? I guess so because I can see how the server can create DynamoDB table etc Ariel Shaqed (Scolnicov) [4:18 PM] I believe you are correct, and you've probably done everything correctly. I hope well be able to resolve this one in short order, or provide a workaround. Anything further is unfortunately speculation on my part
I do not have any access keys, I am using EC2 with a role where everything is set up to help with access
maybe it is not required for LakeFS server but only for DuckDB.
a
Sorry for confusion. lakeFS authenticates itself to AWS using any IAM mechanism. When you access lakeFS using its API, the s3 gateway, you need to authenticate yourself to lakeFS. You do this using an access key and a secret key that you got from lakeFS. They will probably be the same as you used to connect to lakeFS in the first place. It's a bit confusing because the same protocol "s3" may be used to communicate with both AWS s3 and with the lakeFS s3 gateway. Which credentials you need to use depend on which endpoint you're connecting to.
a
Well, I think I am stuck. I still dont understand (sorry) whether IAM role assigned to EC2 with all permissions enabled would enable LakeFS server work provided that I am NOT putting any additional access keys information into config.yaml file. Can you please clarify this?
a
You need two sets of credentials. 1. lakeFS needs credentials to AWS. I think you're doing that part correctly - you do have a lakeFS that has some objects on it, right? 2. DuckDB needs credentials to lakeFS. You can generate these credentials on the Admin tab of the lakeFS gui. We suspect you may have an issue in that bit.
a
OK, #1: I do have objects in S3, yes, but I cannot access them from the UI (understand that this could be an issue on your side). #2. I did not know that those credentials that are used by DuckDB are those that LakeFS generates. I will try to re-test that. Thanks
👍 1
Maybe I got confused because your documentation says:
Copy code
SET s3_region='us-east-1';
SET s3_endpoint='<http://lakefs.example.com|lakefs.example.com>';
SET s3_access_key_id='AKIAIOSFODNN7EXAMPLE';
SET s3_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY';
SET s3_url_style='path';
it is calling them "s3_access_key_id"
Do I confuse something again? What parameter needs to be set to enter LakeFS credentials that I generated in LakeFS UI? Is it "s3_access_key_id" and "s3_secret_access_key"?
a
Yes indeed.
a
So, what do I do? Do I use "s3_access_key_id" and place LakeFs-generated credentials?
b
yes
sunglasses lakefs 1
lakefs holds a S3 gateway that communicates using S3 protocol.
a
OK, with this fix, it is working with DuckDB cli
jumping lakefs 1
thanks!
b
can you try one more thing on the web?, if possible. goto the application tab, if you select on the left service workers and there is one while you are on the lakefs page. delete it and run the query again. want to know if the issue is related to an old version of the service worker.
a
I am not sure I understand where to go and how to find "service workers" on LakeFS UI/Web. I dont see anything that is called "applications" and/or "service workers"... Here is my screen, where do you see application tab here?:
Maybe you are talking about Web version of LakeFS that you offer in "Try without installing" ?
b
sorry I wasn't clear. the service workers section is found in your Google Chrome devtools, under application tab. I am still trying to find the root cause the duckdb in the browser fails.
a
oh, i see
I just did it and I dont have any workers listed in that page
b
you can delete the one you see? they can be reloaded if needed
just need to know if they are the source of this issue
a
As I said I DONT see any workers there
👍 2
One more quick question: can I run my local (installed on a local machine) pyspark environment against locally running LakeFS (docker)? What parameters do I need to provide?
a
Yes, this too will work! You'll need to connect to your machine. Export the port from Docker with e.g. a flag
-p 8000:8000
, and then your S3 endpoint to give pyspark will be http://localhost:8000/ . The gui will be on the same URL.
Let us know how it works!
a
So, here is what and how I tried it and that's the result:
(please check my parameters of pyspark first)
pyspark --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAJN2IF3P57GKXFCOQ' --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='npt58bvqSMwlHK1Nk/X0mkokNNINLGXW0pymvExf' --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='http://localhost:8000' --conf spark.hadoop.fs.s3a.path.style.access=true --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem --conf spark.hadoop.fs.lakefs.access.key=AKIAJN2IF3P57GKXFCOQ --conf spark.hadoop.fs.lakefs.secret.key=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 --conf spark.hadoop.fs.lakefs.endpoint=http://localhost:8000 --packages io.lakefshadoop lakefs assembly0.2.1
>> df = spark.read.parquet('lakefs://starlight/v1/credits.parquet')
23/12/08 174253 WARN FileSystem: Failed to initialize fileystem lakefs://starlight/v1/credits.parquet: java.io.IOException: Failed to get lakeFS blockstore type 23/12/08 174253 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: lakefs://starlight/v1/credits.parquet. java.io.IOException: Failed to get lakeFS blockstore type at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:143) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:113) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: io.lakefs.hadoop.shade.sdk.ApiException: Message: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.hadoop.shade.sdk.model.StorageConfig HTTP response code: 200 HTTP response body: <!DOCTYPE html> <html lang="en"> <head> <!-- Generated with Vite--> <meta charset="UTF-8" /> <link rel="icon" href="/favicon.ico" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>lakeFS</title> <!--Snippets--> <script type="module" crossorigin src="/assets/index-1490b7ac.js"></script> <link rel="stylesheet" href="/assets/index-0308a6b6.css"> </head> <body> <div id="root"></div> </body> </html>
a
Sorry, you will seem to be configuring both lakeFSFS and S3A access. That will be hard. To use just S3A:
Copy code
pyspark --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAJN2IF3P57GKXFCOQ' --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='npt58bvqSMwlHK1Nk/X0mkokNNINLGXW0pymvExf' --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='<http://localhost:8000>' --conf spark.hadoop.fs.s3a.path.style.access=true
and then access objects as
<s3a://sample-repo/main/path/to/object>
(note the "s3a://" scheme!) To use lakeFSFS probably easiest will be:
Copy code
pyspark  --conf spark.hadoop.fs.lakefs.access.mode=presigned  --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem --conf spark.hadoop.fs.lakefs.access.key=AKIAJN2IF3P57GKXFCOQ --conf spark.hadoop.fs.lakefs.secret.key=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 --conf spark.hadoop.fs.lakefs.endpoint=<http://localhost:8000/api/v1> --packages io.lakefs:hadoop-lakefs-assembly:0.2.1
which uses pre-signed mode. And then access objects as
<lakefs://sample-repo/main/path/to/object>
, with scheme "lakefs".
See also here for a detailed reference about lakeFS and Spark.
b
think you need to add
/api/v1
to the lakefs endpoint property (after the port number)
a
thanks, let me try these
it is not working for me. here is the result:
Untitled
b
The underlying storage of the repository is not on a S3 bucket. If you wish to working with lakeFS configure with local block adapter, you can configure your s3a to your lakefs address (S3 gateway) as working with S3 with custom endpoint. The Hadoop lakeFS library will not work with local.
a
Sorry, @Barak Amar is correct. Use the s3a option that I offered above. lakeFSFS doesn't work with a local backing store - it would have no advantages over s3a.
a
s3a option is not working as it seems to lack s3a library. Do I need to add something in the command like to specify the library?
a
For some Spark builds you will need to add the hadoop-aws package.
Or, if you're just "kicking the tires" to see how everything works together, maybe I'm pointing you in the wrong direction... Have you considered following our @Iddo Avneri 's blog post? I think if you start with a managed Spark service (he uses DataBricks I think) we might skip over many of the issues of seeing up a Spark environment.