Hi everyone! I'm trying to use pyspark to write a ...
# help
b
Hi everyone! I'm trying to use pyspark to write a new file to some repository and I can do it, but It's very slow...getting the logs I saw this message:
msg="path not found" func="pkg/gateway/operations.(*HeadObject).Handle" file="build/pkg/gateway/operations/headobject.go:30" host="<http://s3.local.lakefs.io:8000|s3.local.lakefs.io:8000>" method=HEAD path=.spark-staging-2d639c01-addd-4495-91a1-37ef56187e69/ ref=develop repository=customers request_id=f7da0575-3295-4fc2-90d0-ba6bb1c96b83 service_name=s3_gateway user=admin
many times....Can someone help me?
here is my snippet code:
Copy code
from pyspark.sql import SparkSession

def spark_session():
    spark: SparkSession = SparkSession \
        .builder \
        .master('local[1]') \
        .appName('lakefs') \
        .config('spark.jars', '../resources/jars/hadoop-aws-3.2.0.jar, ../resources/jars/aws-java-sdk-bundle-1.11.375.jar') \
        .getOrCreate()
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'AKIAJYCCHHXD5F3I3NDQ')
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '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')
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '<http://s3.local.lakefs.io:8000>')
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
    return spark

def main():
    spark = spark_session()
    df = spark.createDataFrame([(3, 'bern'), (4, 'raf'), (6, 'brun')], ['id', 'value'])
    repo = "customers"
    branch = "develop"
    path = f's3a://{repo}/{branch}'
    print(f'writing to {path}')
    df.write.mode('overwrite').parquet(path)
    print('finished')
    spark.stop()

main()
t
Hi @Bruno Canal, welcome! I’m looking into it and will be in touch with you shortly. • When you say that the write operation is very slow, how much time does it take?
b
It at around 20 minutes to write one parquet file containing three rows! Thanks @Tal Sofer
t
Thanks @Bruno Canal, right now i’m unable to reproduce this, let me work on it this morning (GMT +3) and get back to you! • Do you mind sharing more of your logs? except for the message that you have mentioned are you seeing some ERROR level logs? • What spark and lakefs versions are you running? • Are you getting errors related to the aws sdk version you are running?
b
Here is the log with all logs after the write operation
t
Thank you!
I’m looking at it
b
The spark version is pyspark==3.1.1 and LakeFS version is treeverse/lakefs 0.41.1 docker image
🙏 1
I didn't get any error from sdk, even I just took the versions about aws-sdk and hadoop-aws from Katakoda Tutorial (https://www.katacoda.com/lakefs/scenarios/lakefs-play)
The versions: • hadoop-aws-3.2.0 • aws-java-sdk-bundle-1.11.375
t
Thank you!
b
Thank's @Tal Sofer
t
np 🙂 @Bruno Canal do you still have the logs during the write operation?
b
Hey @Tal Sofer yesI It's the logs that I sent to you...These logs are after the start of write, So It's during the write operation
t
Oh ok, I miss read your message
@Bruno Canal how is your deployment look like? what db are you using and where it is deployed?
b
I'm using a postgres 13 on gcp
t
Just to confirm - is lakefs deployed locally on your machine?
b
yes
the yml file config is:
Copy code
listen_address: "0.0.0.0:8000"

logging:
  format: text
  level: DEBUG
  output: "-"

database:
  connection_string: "<postgres://lakefs>:lakefs@<DATABASE IP>)/postgres?sslmode=disable"

auth:
  encrypt:
    secret_key: "lakefs"

blockstore:
  type: gs
  gs:
    credentials_file: "/home/lakefs/.credentials.json"

gateways:
  s3:
    domain_name: "<http://s3.local.lakefs.io:8000|s3.local.lakefs.io:8000>"
docker run cmd:
Copy code
docker run \
--name lakefs \
-p 8000:8000 \
--network bridge \
-v $(pwd)/lakefs-config.yaml:/home/lakefs/.lakefs.yaml \
-v $(pwd)/credentials.json:/home/lakefs/.credentials.json \
treeverse/lakefs:0.41.1 run
t
@Bruno Canal The reason for the slowness of your spark job is the distance between the lakefs process and the database. Going over your logs I saw no errors, but noticed that the time each action took is significantly longer than usual. To confirm that, can you run Postgres locally or deploy lakefs on gcp and test the same simple Spark app? both ways should work, the important thing is that the db and the process will share the same network.
And please let me know if you have more questions 🙂
b
Thanks @Tal Sofer! I’ll do it and let you know if I get some problem
t
Cool! good luck