Joe M
05/31/2024, 1:13 PMJoe M
05/31/2024, 1:14 PMwith SparkSession.builder.appName(appName) \
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
.config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
.config("spark.executor.memory", '2g') \
.getOrCreate() as spark:
Joe M
05/31/2024, 1:16 PMJoe M
05/31/2024, 1:19 PMNiro
05/31/2024, 1:53 PMNiro
05/31/2024, 1:54 PMJoe M
05/31/2024, 1:55 PMJoe M
05/31/2024, 1:59 PMNiro
05/31/2024, 2:05 PMJoe M
05/31/2024, 2:07 PMNiro
05/31/2024, 2:08 PMAriel Shaqed (Scolnicov)
05/31/2024, 2:40 PMJoe M
05/31/2024, 2:47 PMJoe M
05/31/2024, 3:03 PM.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
.config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey)
I see that the presigned configuration is using lakefs specific properties with fs.lakefs.*
sc._jsc.hadoopConfiguration().set("fs.lakefs.access.mode", "presigned")
sc._jsc.hadoopConfiguration().set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
sc._jsc.hadoopConfiguration().set("fs.lakefs.access.key", "AKIAlakefs12345EXAMPLE")
sc._jsc.hadoopConfiguration().set("fs.lakefs.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
sc._jsc.hadoopConfiguration().set("fs.lakefs.endpoint", "<https://example-org.us-east-1.lakefscloud.io/api/v1>")
Do i need to include/install any python packages for this to work?Joe M
05/31/2024, 3:04 PMimport os
import boto3
import logging
from pyspark.sql import SparkSession
from pyspark.sql.functions import year, month, dayofmonth, hour, current_timezone, to_utc_timestamp
Niro
05/31/2024, 3:08 PM<s3a://repo/branch/path-to-object>
use <lakefs://repo/branch/path-to-object/>
Joe M
05/31/2024, 3:08 PMJoe M
05/31/2024, 10:43 PM24/05/31 15:38:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
loading partition from <lakefs://ct-raw/partitioned/partitioned-data/>
24/05/31 15:38:28 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
24/05/31 15:38:28 INFO SharedState: Warehouse path is 'file:/mnt/c/src/ct/badger/badger_be/load-scripts/spark-warehouse'.
24/05/31 15:38:29 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: <lakefs://ct-raw/partitioned/partitioned-data/>.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
Joe M
05/31/2024, 10:58 PM.config("spark.jars.packages", "io.lakefs:hadoop-lakefs-assembly:0.2.4")
Joe M
05/31/2024, 10:59 PMJoe M
06/01/2024, 12:12 AM24/05/31 23:58:59 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: <lakefs://ct-raw/partitioned/partitioned-data/>.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2693) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3628) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3663) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:173) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3767) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3718) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:564) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:373) ~[hadoop-client-api-3.3.6-amzn-3.jar:?]
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53) ~[spark-sql_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
Joe M
06/01/2024, 12:34 AMJoe M
06/01/2024, 4:01 AMJoe M
06/01/2024, 4:13 AMOffir Cohen
06/02/2024, 6:44 AMJoe M
06/02/2024, 1:35 PMAriel Shaqed (Scolnicov)
06/05/2024, 7:10 PMJoe M
06/05/2024, 7:34 PMAriel Shaqed (Scolnicov)
06/05/2024, 7:39 PMJoe M
06/05/2024, 7:40 PMJoe M
06/05/2024, 10:35 PMJoe M
06/05/2024, 10:36 PMNiro
06/06/2024, 2:39 AMAriel Shaqed (Scolnicov)
06/06/2024, 4:47 AMJoe M
06/06/2024, 1:12 PMAriel Shaqed (Scolnicov)
06/06/2024, 1:34 PMJoe M
06/06/2024, 2:32 PMeinat.orr
Joe M
06/06/2024, 4:54 PMJoe M
06/06/2024, 4:54 PMeinat.orr
Joe M
06/06/2024, 4:59 PMJoe M
06/06/2024, 5:02 PMJoe M
06/08/2024, 5:03 PMJoe M
06/08/2024, 5:11 PMeinat.orr
Joe M
06/08/2024, 5:24 PMJoe M
06/08/2024, 5:25 PMAriel Shaqed (Scolnicov)
06/08/2024, 6:04 PMmain@
sees the latest commit on main, but no uncommitted objects. In fact anything other than a branch head ref is immutable, and shows no uncommitted objects. This includes tags, commit digests, abbreviated commit digests, and any ref expression such as main~1
.Joe M
06/08/2024, 6:05 PMJoe M
06/08/2024, 6:05 PMAriel Shaqed (Scolnicov)
06/08/2024, 6:05 PMct-ref@
Joe M
06/08/2024, 6:08 PMAriel Shaqed (Scolnicov)
06/08/2024, 6:09 PMJoe M
06/08/2024, 6:12 PMJoe M
06/08/2024, 6:12 PMAriel Shaqed (Scolnicov)
06/08/2024, 6:14 PMJoe M
06/08/2024, 6:16 PMJoe M
06/08/2024, 6:16 PMJoe M
06/08/2024, 6:20 PMAriel Shaqed (Scolnicov)
06/08/2024, 6:23 PMJoe M
06/08/2024, 6:24 PMAriel Shaqed (Scolnicov)
06/08/2024, 7:19 PMJoe M
06/08/2024, 7:53 PMJoe M
06/08/2024, 7:58 PMJoe M
06/08/2024, 7:59 PMAriel Shaqed (Scolnicov)
06/08/2024, 8:04 PMJoe M
06/08/2024, 8:08 PMJoe M
06/08/2024, 8:10 PMJoe M
06/08/2024, 8:29 PMJoe M
06/08/2024, 8:30 PMJoe M
06/08/2024, 8:31 PMJoe M
06/08/2024, 9:06 PMAriel Shaqed (Scolnicov)
06/09/2024, 6:52 AMmapreduce.fileoutputcommitter.cleanup-failures.ignored
or even mapreduce.fileoutputcommitter.failures.attempts
? (ref)
• Do you want to go back to trying to use S3A?
◦ Can you set fs.s3a.directory.marker.retention
to keep
? (ref)Joe M
06/09/2024, 4:03 PMJoe M
06/09/2024, 4:20 PMJoe M
06/09/2024, 4:22 PMAriel Shaqed (Scolnicov)
06/09/2024, 4:37 PMthe KV is a postgres db, and it's probably a t1.micro
That probably does not have sufficient iops. Also I'd want a more resilient database setup for production.
i don't think ignoring the delete failures is an option. that will leave temporary files in the branch after the final commit.
Well, you'd still be deleting them - only after the Spark job, or by calling
lakectl fs rm
from the driver. In any case failure to delete is a recoverable failure.Joe M
06/09/2024, 4:42 PMJoe M
06/09/2024, 4:58 PMJoe M
06/09/2024, 4:59 PMJoe M
06/09/2024, 5:00 PMJoe M
06/09/2024, 5:03 PMJoe M
06/09/2024, 5:04 PMJoe M
06/09/2024, 5:07 PMAriel Shaqed (Scolnicov)
06/09/2024, 5:39 PM/metrics
endpoint. They're intended for Prometheus, but yours doesn't sound like a k8s setup. However you could periodically fetch those into a sequence of files, and maybe we could fake it from those. Or run lakeFS with DEBUG logging I think, and get durations of many operations.Joe M
06/09/2024, 6:04 PMJoe M
06/09/2024, 6:24 PMJoe M
06/09/2024, 6:25 PMJoe M
06/09/2024, 6:25 PMAriel Shaqed (Scolnicov)
06/09/2024, 6:27 PMJoe M
06/09/2024, 6:31 PMJoe M
06/09/2024, 6:31 PMJoe M
06/09/2024, 8:24 PMJoe M
06/09/2024, 10:59 PMJoe M
06/09/2024, 10:59 PMJoe M
06/09/2024, 11:00 PMJoe M
06/10/2024, 1:36 PMJoe M
06/10/2024, 1:36 PMJoe M
06/10/2024, 1:37 PMJoe M
06/10/2024, 1:37 PMJoe M
06/10/2024, 1:38 PMJoe M
06/10/2024, 1:38 PMJoe M
06/10/2024, 9:29 PMeinat.orr
Joe M
06/11/2024, 1:01 AMeinat.orr
Ariel Shaqed (Scolnicov)
06/11/2024, 8:19 AM