What are my next steps Maybe using Docker verion lakeFS #help

Join Slack

What are my next steps? Maybe using Docker verion?

# help

Alexey Shchedrin

12/05/2023, 2:27 PM

What are my next steps? Maybe using Docker verion?

Ariel Shaqed (Scolnicov)

12/05/2023, 2:32 PM

I am not sure that Docker will help if it's a browser/GUI issue. But obviously I do not know. If you are planning also to access data on lakeFS with tools other than the GUI, this might be a good time to skip ahead and use those. On the other thread I posted think link to how to use Spark to access data on lakeFS.

Alexey Shchedrin

12/05/2023, 2:34 PM

In fact, I have my local version of LakeFS running on Docker Desktop and the UI is not breaking with it. Once I use EC2 deployed version, it does fail with the same browser. That's why I was suspecting the versions of LakeFS. (I did deploy the binary in EC2)

Ariel Shaqed (Scolnicov)

12/05/2023, 2:38 PM

Interesting! Do you know if these are the same software versions of lakeFS? It's easiest to see on the dropdown from your username on the top RHS of the lakeFS GUI.

Alexey Shchedrin

12/05/2023, 2:39 PM

Yes, I just checked: I have 1.2.0 LakeFS in Docker and I have 1.3.0 in EC2

Alexey Shchedrin

12/05/2023, 2:40 PM

I can definitely try to run docker version on EC2 and see the diff

Ariel Shaqed (Scolnicov)

12/05/2023, 2:40 PM

Thanks! I guess I'll add it to our issue.

Alexey Shchedrin

12/05/2023, 2:40 PM

Or, I could try to deploy 1.2.0 in my EC2 (dont know if there is a way to get previous binary)

Ariel Shaqed (Scolnicov)

12/05/2023, 2:41 PM

GitHub releases have it all -- one of the things I love with releasing through GitHub.

Alexey Shchedrin

12/05/2023, 3:15 PM

@Ariel Shaqed (Scolnicov) I re-deployed LakeFS v1.2.0 and the behaviour is the same. I dont believe deploying Docker version would change anything. May I ask you about permissions: I am not specifying any access keys, I have defined IAM role that has all rights attached and my EC2 is assigned that role. Is that sufficient ? I guess so because I can see how the server can create DynamoDB table etc

Ariel Shaqed (Scolnicov)

12/05/2023, 3:18 PM

I believe you are correct, and you've probably done everything correctly. I hope well be able to resolve this one in short order, or provide a workaround. Anything further is unfortunately speculation on my part

Barak Amar

12/05/2023, 3:38 PM

Can you verify if it is related to the browser cache? in the network tab, you can check 'disable cache' and run the query again?

Alexey Shchedrin

12/05/2023, 4:16 PM

@Barak Amar I am using Incognito window in Chrome and am expecting that it has no cache, at least when I start it.

sunglasses lakefs 1

Barak Amar

12/05/2023, 5:24 PM

Thanks @Alexey Shchedrin, is there a way you can use duckdb cli to run the same query, just use the s3 gateway?

Barak Amar

12/05/2023, 5:25 PM

https://docs.lakefs.io/integrations/duckdb.html for settings the right configuration for query the same data from the cli

Alexey Shchedrin

12/05/2023, 5:50 PM

Yes @Barak Amar. I will try that

lakefs 1

Barak Amar

12/05/2023, 5:51 PM

thank you

Alexey Shchedrin

12/06/2023, 5:41 PM

I tried to run it using DuckDB cli and here is what my errors are:

Alexey Shchedrin

12/06/2023, 5:42 PM

I suspect that I have access-related problems

Barak Amar

12/06/2023, 5:42 PM

you need to use s3:// scheme + specify the name of the repository and branch

Barak Amar

12/06/2023, 5:42 PM

also need to turn off the ssl verfivication

Barak Amar

12/06/2023, 5:42 PM

as you use http

Barak Amar

12/06/2023, 5:43 PM

Copy code

SET s3_use_ssl=false;

Alexey Shchedrin

12/06/2023, 5:52 PM

Can you please confirm again, are these things set correctly:

SET s3_region='us-west-2';

D SET s3_endpoint='http://10.208.46.44/';

D SET s3_url_style='path';

Alexey Shchedrin

12/06/2023, 5:53 PM

10.208.46.44/

is where LAkeFS is running

Barak Amar

12/06/2023, 5:54 PM

the port if needed as part of the endpoint

Barak Amar

12/06/2023, 5:54 PM

if it is not 80

Barak Amar

12/06/2023, 5:55 PM

you will need to set

s3_access_key_id

and

s3_secret_access_key

with lakefs access creds

Barak Amar

12/06/2023, 5:55 PM

same you use for login

Alexey Shchedrin

12/06/2023, 7:14 PM

Well, this is the thing: I asked Ariel a question about access rights, here is what he responded: Alexey: May I ask you about permissions: I am not specifying any access keys, I have defined IAM role that has all rights attached and my EC2 is assigned that role. Is that sufficient ? I guess so because I can see how the server can create DynamoDB table etc Ariel Shaqed (Scolnicov) [4:18 PM] I believe you are correct, and you've probably done everything correctly. I hope well be able to resolve this one in short order, or provide a workaround. Anything further is unfortunately speculation on my part

Alexey Shchedrin

12/06/2023, 7:15 PM

I do not have any access keys, I am using EC2 with a role where everything is set up to help with access

Alexey Shchedrin

12/06/2023, 7:15 PM

maybe it is not required for LakeFS server but only for DuckDB.

Ariel Shaqed (Scolnicov)

12/06/2023, 7:19 PM

Sorry for confusion. lakeFS authenticates itself to AWS using any IAM mechanism. When you access lakeFS using its API, the s3 gateway, you need to authenticate yourself to lakeFS. You do this using an access key and a secret key that you got from lakeFS. They will probably be the same as you used to connect to lakeFS in the first place. It's a bit confusing because the same protocol "s3" may be used to communicate with both AWS s3 and with the lakeFS s3 gateway. Which credentials you need to use depend on which endpoint you're connecting to.

Alexey Shchedrin

12/06/2023, 7:22 PM

Well, I think I am stuck. I still dont understand (sorry) whether IAM role assigned to EC2 with all permissions enabled would enable LakeFS server work provided that I am NOT putting any additional access keys information into config.yaml file. Can you please clarify this?

Ariel Shaqed (Scolnicov)

12/06/2023, 7:26 PM

You need two sets of credentials. 1. lakeFS needs credentials to AWS. I think you're doing that part correctly - you do have a lakeFS that has some objects on it, right? 2. DuckDB needs credentials to lakeFS. You can generate these credentials on the Admin tab of the lakeFS gui. We suspect you may have an issue in that bit.

Alexey Shchedrin

12/06/2023, 7:28 PM

OK, #1: I do have objects in S3, yes, but I cannot access them from the UI (understand that this could be an issue on your side). #2. I did not know that those credentials that are used by DuckDB are those that LakeFS generates. I will try to re-test that. Thanks

👍 1

Alexey Shchedrin

12/06/2023, 7:30 PM

Maybe I got confused because your documentation says:

Alexey Shchedrin

12/06/2023, 7:30 PM

Copy code

SET s3_region='us-east-1';
SET s3_endpoint='<http://lakefs.example.com|lakefs.example.com>';
SET s3_access_key_id='AKIAIOSFODNN7EXAMPLE';
SET s3_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY';
SET s3_url_style='path';

Alexey Shchedrin

12/06/2023, 7:30 PM

it is calling them "s3_access_key_id"

Alexey Shchedrin

12/06/2023, 7:32 PM

Do I confuse something again? What parameter needs to be set to enter LakeFS credentials that I generated in LakeFS UI? Is it "s3_access_key_id" and "s3_secret_access_key"?

Ariel Shaqed (Scolnicov)

12/06/2023, 7:33 PM

Yes indeed.

Alexey Shchedrin

12/06/2023, 7:38 PM

So, what do I do? Do I use "s3_access_key_id" and place LakeFs-generated credentials?

Barak Amar

12/06/2023, 7:45 PM

yes

sunglasses lakefs 1

Barak Amar

12/06/2023, 7:46 PM

lakefs holds a S3 gateway that communicates using S3 protocol.

Alexey Shchedrin

12/06/2023, 7:53 PM

OK, with this fix, it is working with DuckDB cli

jumping lakefs 1

Alexey Shchedrin

12/06/2023, 7:53 PM

thanks!

Barak Amar

12/06/2023, 7:57 PM

can you try one more thing on the web?, if possible. goto the application tab, if you select on the left service workers and there is one while you are on the lakefs page. delete it and run the query again. want to know if the issue is related to an old version of the service worker.

Alexey Shchedrin

12/07/2023, 8:42 AM

I am not sure I understand where to go and how to find "service workers" on LakeFS UI/Web. I dont see anything that is called "applications" and/or "service workers"... Here is my screen, where do you see application tab here?:

Alexey Shchedrin

12/07/2023, 8:43 AM

Maybe you are talking about Web version of LakeFS that you offer in "Try without installing" ?

Barak Amar

12/07/2023, 8:59 AM

sorry I wasn't clear. the service workers section is found in your Google Chrome devtools, under application tab. I am still trying to find the root cause the duckdb in the browser fails.

Alexey Shchedrin

12/07/2023, 9:00 AM

oh, i see

Alexey Shchedrin

12/07/2023, 9:01 AM

I just did it and I dont have any workers listed in that page

Barak Amar

12/07/2023, 9:37 AM

you can delete the one you see? they can be reloaded if needed

Barak Amar

12/07/2023, 9:37 AM

just need to know if they are the source of this issue

Alexey Shchedrin

12/07/2023, 10:47 AM

As I said I DONT see any workers there

👍 2

Alexey Shchedrin

12/08/2023, 4:21 PM

One more quick question: can I run my local (installed on a local machine) pyspark environment against locally running LakeFS (docker)? What parameters do I need to provide?

Ariel Shaqed (Scolnicov)

12/08/2023, 4:26 PM

Yes, this too will work! You'll need to connect to your machine. Export the port from Docker with e.g. a flag

-p 8000:8000

, and then your S3 endpoint to give pyspark will be http://localhost:8000/ . The gui will be on the same URL.

Ariel Shaqed (Scolnicov)

12/08/2023, 4:27 PM

Let us know how it works!

Alexey Shchedrin

12/08/2023, 4:50 PM

So, here is what and how I tried it and that's the result:

Alexey Shchedrin

12/08/2023, 4:50 PM

(please check my parameters of pyspark first)

Alexey Shchedrin

12/08/2023, 4:51 PM

pyspark --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAJN2IF3P57GKXFCOQ' --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='npt58bvqSMwlHK1Nk/X0mkokNNINLGXW0pymvExf' --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='http://localhost:8000' --conf spark.hadoop.fs.s3a.path.style.access=true --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem --conf spark.hadoop.fs.lakefs.access.key=AKIAJN2IF3P57GKXFCOQ --conf spark.hadoop.fs.lakefs.secret.key=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 --conf spark.hadoop.fs.lakefs.endpoint=http://localhost:8000 --packages io.lakefshadoop lakefs assembly0.2.1

>> df = spark.read.parquet('lakefs://starlight/v1/credits.parquet ')

23/12/08 174253 WARN FileSystem: Failed to initialize fileystem lakefs://starlight/v1/credits.parquet : java.io.IOException: Failed to get lakeFS blockstore type 23/12/08 174253 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: lakefs://starlight/v1/credits.parquet. java.io.IOException: Failed to get lakeFS blockstore type at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:143) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:113) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: io.lakefs.hadoop.shade.sdk.ApiException: Message: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.hadoop.shade.sdk.model.StorageConfig HTTP response code: 200 HTTP response body: <!DOCTYPE html> <html lang="en"> <head>  <meta charset="UTF-8" /> <link rel="icon" href="/favicon.ico" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>lakeFS</title>  <script type="module" crossorigin src="/assets/index-1490b7ac.js"></script> <link rel="stylesheet" href="/assets/index-0308a6b6.css"> </head> <body> <div id="root"></div> </body> </html>

Ariel Shaqed (Scolnicov)

12/08/2023, 5:01 PM

Sorry, you will seem to be configuring both lakeFSFS and S3A access. That will be hard. To use just S3A:

Copy code

pyspark --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAJN2IF3P57GKXFCOQ' --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='npt58bvqSMwlHK1Nk/X0mkokNNINLGXW0pymvExf' --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='<http://localhost:8000>' --conf spark.hadoop.fs.s3a.path.style.access=true

and then access objects as

<s3a://sample-repo/main/path/to/object>

(note the "s3a://" scheme!) To use lakeFSFS probably easiest will be:

Copy code

pyspark  --conf spark.hadoop.fs.lakefs.access.mode=presigned  --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem --conf spark.hadoop.fs.lakefs.access.key=AKIAJN2IF3P57GKXFCOQ --conf spark.hadoop.fs.lakefs.secret.key=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 --conf spark.hadoop.fs.lakefs.endpoint=<http://localhost:8000/api/v1> --packages io.lakefs:hadoop-lakefs-assembly:0.2.1

which uses pre-signed mode. And then access objects as

<lakefs://sample-repo/main/path/to/object>

, with scheme "lakefs".

Ariel Shaqed (Scolnicov)

12/08/2023, 5:02 PM

See also here for a detailed reference about lakeFS and Spark.

Barak Amar

12/08/2023, 5:06 PM

think you need to add

/api/v1

to the lakefs endpoint property (after the port number)

Alexey Shchedrin

12/08/2023, 5:23 PM

thanks, let me try these

Alexey Shchedrin

12/08/2023, 5:50 PM

it is not working for me. here is the result:

Alexey Shchedrin

12/08/2023, 5:52 PM

Untitled

Barak Amar

12/08/2023, 6:22 PM

The underlying storage of the repository is not on a S3 bucket. If you wish to working with lakeFS configure with local block adapter, you can configure your s3a to your lakefs address (S3 gateway) as working with S3 with custom endpoint. The Hadoop lakeFS library will not work with local.

Ariel Shaqed (Scolnicov)

12/08/2023, 6:24 PM

Sorry, @Barak Amar is correct. Use the s3a option that I offered above. lakeFSFS doesn't work with a local backing store - it would have no advantages over s3a.

Alexey Shchedrin

12/08/2023, 7:11 PM

s3a option is not working as it seems to lack s3a library. Do I need to add something in the command like to specify the library?

Ariel Shaqed (Scolnicov)

12/08/2023, 7:12 PM

For some Spark builds you will need to add the hadoop-aws package.

Ariel Shaqed (Scolnicov)

12/09/2023, 8:11 AM

Or, if you're just "kicking the tires" to see how everything works together, maybe I'm pointing you in the wrong direction... Have you considered following our @Iddo Avneri 's blog post? I think if you start with a managed Spark service (he uses DataBricks I think) we might skip over many of the issues of seeing up a Spark environment.

15 Views

Open in Slack

Previous Next