Alexey Shchedrin
12/05/2023, 2:27 PMAriel Shaqed (Scolnicov)
12/05/2023, 2:32 PMAlexey Shchedrin
12/05/2023, 2:34 PMAriel Shaqed (Scolnicov)
12/05/2023, 2:38 PMAlexey Shchedrin
12/05/2023, 2:39 PMAlexey Shchedrin
12/05/2023, 2:40 PMAriel Shaqed (Scolnicov)
12/05/2023, 2:40 PMAlexey Shchedrin
12/05/2023, 2:40 PMAriel Shaqed (Scolnicov)
12/05/2023, 2:41 PMAlexey Shchedrin
12/05/2023, 3:15 PMAriel Shaqed (Scolnicov)
12/05/2023, 3:18 PMBarak Amar
Alexey Shchedrin
12/05/2023, 4:16 PMBarak Amar
Barak Amar
Alexey Shchedrin
12/05/2023, 5:50 PMBarak Amar
Alexey Shchedrin
12/06/2023, 5:41 PMAlexey Shchedrin
12/06/2023, 5:42 PMBarak Amar
Barak Amar
Barak Amar
Barak Amar
SET s3_use_ssl=false;
Alexey Shchedrin
12/06/2023, 5:52 PMSET s3_region='us-west-2';
D SET s3_endpoint='http://10.208.46.44/';
D SET s3_url_style='path';
Alexey Shchedrin
12/06/2023, 5:53 PM10.208.46.44/is where LAkeFS is running
Barak Amar
Barak Amar
Barak Amar
s3_access_key_id
and s3_secret_access_key
with lakefs access credsBarak Amar
Alexey Shchedrin
12/06/2023, 7:14 PMAlexey Shchedrin
12/06/2023, 7:15 PMAlexey Shchedrin
12/06/2023, 7:15 PMAriel Shaqed (Scolnicov)
12/06/2023, 7:19 PMAlexey Shchedrin
12/06/2023, 7:22 PMAriel Shaqed (Scolnicov)
12/06/2023, 7:26 PMAlexey Shchedrin
12/06/2023, 7:28 PMAlexey Shchedrin
12/06/2023, 7:30 PMAlexey Shchedrin
12/06/2023, 7:30 PMSET s3_region='us-east-1';
SET s3_endpoint='<http://lakefs.example.com|lakefs.example.com>';
SET s3_access_key_id='AKIAIOSFODNN7EXAMPLE';
SET s3_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY';
SET s3_url_style='path';
Alexey Shchedrin
12/06/2023, 7:30 PMAlexey Shchedrin
12/06/2023, 7:32 PMAriel Shaqed (Scolnicov)
12/06/2023, 7:33 PMAlexey Shchedrin
12/06/2023, 7:38 PMBarak Amar
Barak Amar
Alexey Shchedrin
12/06/2023, 7:53 PMAlexey Shchedrin
12/06/2023, 7:53 PMBarak Amar
Alexey Shchedrin
12/07/2023, 8:42 AMAlexey Shchedrin
12/07/2023, 8:43 AMBarak Amar
Alexey Shchedrin
12/07/2023, 9:00 AMAlexey Shchedrin
12/07/2023, 9:01 AMBarak Amar
Barak Amar
Alexey Shchedrin
12/07/2023, 10:47 AMAlexey Shchedrin
12/08/2023, 4:21 PMAriel Shaqed (Scolnicov)
12/08/2023, 4:26 PM-p 8000:8000
, and then your S3 endpoint to give pyspark will be http://localhost:8000/ . The gui will be on the same URL.Ariel Shaqed (Scolnicov)
12/08/2023, 4:27 PMAlexey Shchedrin
12/08/2023, 4:50 PMAlexey Shchedrin
12/08/2023, 4:50 PMAlexey Shchedrin
12/08/2023, 4:51 PM>> df = spark.read.parquet('lakefs://starlight/v1/credits.parquet')23/12/08 174253 WARN FileSystem: Failed to initialize fileystem lakefs://starlight/v1/credits.parquet: java.io.IOException: Failed to get lakeFS blockstore type 23/12/08 174253 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: lakefs://starlight/v1/credits.parquet. java.io.IOException: Failed to get lakeFS blockstore type at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:143) at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:113) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: io.lakefs.hadoop.shade.sdk.ApiException: Message: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.hadoop.shade.sdk.model.StorageConfig HTTP response code: 200 HTTP response body: <!DOCTYPE html> <html lang="en"> <head> <!-- Generated with Vite--> <meta charset="UTF-8" /> <link rel="icon" href="/favicon.ico" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>lakeFS</title> <!--Snippets--> <script type="module" crossorigin src="/assets/index-1490b7ac.js"></script> <link rel="stylesheet" href="/assets/index-0308a6b6.css"> </head> <body> <div id="root"></div> </body> </html>
Ariel Shaqed (Scolnicov)
12/08/2023, 5:01 PMpyspark --conf spark.hadoop.fs.s3a.bucket.example-repo.access.key='AKIAJN2IF3P57GKXFCOQ' --conf spark.hadoop.fs.s3a.bucket.example-repo.secret.key='npt58bvqSMwlHK1Nk/X0mkokNNINLGXW0pymvExf' --conf spark.hadoop.fs.s3a.bucket.example-repo.endpoint='<http://localhost:8000>' --conf spark.hadoop.fs.s3a.path.style.access=true
and then access objects as <s3a://sample-repo/main/path/to/object>
(note the "s3a://" scheme!)
To use lakeFSFS probably easiest will be:
pyspark --conf spark.hadoop.fs.lakefs.access.mode=presigned --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem --conf spark.hadoop.fs.lakefs.access.key=AKIAJN2IF3P57GKXFCOQ --conf spark.hadoop.fs.lakefs.secret.key=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 --conf spark.hadoop.fs.lakefs.endpoint=<http://localhost:8000/api/v1> --packages io.lakefs:hadoop-lakefs-assembly:0.2.1
which uses pre-signed mode. And then access objects as <lakefs://sample-repo/main/path/to/object>
, with scheme "lakefs".Ariel Shaqed (Scolnicov)
12/08/2023, 5:02 PMBarak Amar
/api/v1
to the lakefs endpoint property (after the port number)Alexey Shchedrin
12/08/2023, 5:23 PMAlexey Shchedrin
12/08/2023, 5:50 PMAlexey Shchedrin
12/08/2023, 5:52 PMBarak Amar
Ariel Shaqed (Scolnicov)
12/08/2023, 6:24 PMAlexey Shchedrin
12/08/2023, 7:11 PMAriel Shaqed (Scolnicov)
12/08/2023, 7:12 PMAriel Shaqed (Scolnicov)
12/09/2023, 8:11 AM