Hi everyone I m trying to use pyspark to write a new file to lakeFS #help

Hi everyone! I'm trying to use pyspark to write a ...

Bruno Canal

06/06/2021, 12:47 AM

Hi everyone! I'm trying to use pyspark to write a new file to some repository and I can do it, but It's very slow...getting the logs I saw this message:

msg="path not found" func="pkg/gateway/operations.(*HeadObject).Handle" file="build/pkg/gateway/operations/headobject.go:30" host="<http://s3.local.lakefs.io:8000|s3.local.lakefs.io:8000>" method=HEAD path=.spark-staging-2d639c01-addd-4495-91a1-37ef56187e69/ ref=develop repository=customers request_id=f7da0575-3295-4fc2-90d0-ba6bb1c96b83 service_name=s3_gateway user=admin

many times....Can someone help me?

Bruno Canal

06/06/2021, 12:48 AM

here is my snippet code:

Copy code

from pyspark.sql import SparkSession

def spark_session():
    spark: SparkSession = SparkSession \
        .builder \
        .master('local[1]') \
        .appName('lakefs') \
        .config('spark.jars', '../resources/jars/hadoop-aws-3.2.0.jar, ../resources/jars/aws-java-sdk-bundle-1.11.375.jar') \
        .getOrCreate()
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'AKIAJYCCHHXD5F3I3NDQ')
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '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')
    spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '<http://s3.local.lakefs.io:8000>')
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
    return spark

def main():
    spark = spark_session()
    df = spark.createDataFrame([(3, 'bern'), (4, 'raf'), (6, 'brun')], ['id', 'value'])
    repo = "customers"
    branch = "develop"
    path = f's3a://{repo}/{branch}'
    print(f'writing to {path}')
    df.write.mode('overwrite').parquet(path)
    print('finished')
    spark.stop()

main()

Tal Sofer

06/06/2021, 1:04 AM

Hi @Bruno Canal, welcome! I’m looking into it and will be in touch with you shortly. • When you say that the write operation is very slow, how much time does it take?

Bruno Canal

06/06/2021, 1:21 AM

It at around 20 minutes to write one parquet file containing three rows! Thanks @Tal Sofer

Tal Sofer

06/06/2021, 1:52 AM

Thanks @Bruno Canal, right now i’m unable to reproduce this, let me work on it this morning (GMT +3) and get back to you! • Do you mind sharing more of your logs? except for the message that you have mentioned are you seeing some ERROR level logs? • What spark and lakefs versions are you running? • Are you getting errors related to the aws sdk version you are running?

Bruno Canal

06/06/2021, 10:44 AM

Here is the log with all logs after the write operation

lakefs_log_20210606

Tal Sofer

06/06/2021, 10:45 AM

Thank you!

Tal Sofer

06/06/2021, 10:45 AM

I’m looking at it

Bruno Canal

06/06/2021, 10:46 AM

The spark version is pyspark==3.1.1 and LakeFS version is treeverse/lakefs 0.41.1 docker image

🙏 1

Bruno Canal

06/06/2021, 10:49 AM

I didn't get any error from sdk, even I just took the versions about aws-sdk and hadoop-aws from Katakoda Tutorial (https://www.katacoda.com/lakefs/scenarios/lakefs-play)

Bruno Canal

06/06/2021, 10:50 AM

The versions: • hadoop-aws-3.2.0 • aws-java-sdk-bundle-1.11.375

Tal Sofer

06/06/2021, 10:51 AM

Thank you!

Bruno Canal

06/06/2021, 10:51 AM

Thank's @Tal Sofer

Tal Sofer

06/06/2021, 10:56 AM

np 🙂 @Bruno Canal do you still have the logs during the write operation?

Bruno Canal

06/06/2021, 11:00 AM

Hey @Tal Sofer yesI It's the logs that I sent to you...These logs are after the start of write, So It's during the write operation

Tal Sofer

06/06/2021, 11:01 AM

Oh ok, I miss read your message

Tal Sofer

06/06/2021, 11:32 AM

@Bruno Canal how is your deployment look like? what db are you using and where it is deployed?

Bruno Canal

06/06/2021, 11:33 AM

I'm using a postgres 13 on gcp

Tal Sofer

06/06/2021, 11:35 AM

Just to confirm - is lakefs deployed locally on your machine?

Bruno Canal

06/06/2021, 11:42 AM

yes

Bruno Canal

06/06/2021, 11:42 AM

the yml file config is:

Copy code

listen_address: "0.0.0.0:8000"

logging:
  format: text
  level: DEBUG
  output: "-"

database:
  connection_string: "<postgres://lakefs>:lakefs@<DATABASE IP>)/postgres?sslmode=disable"

auth:
  encrypt:
    secret_key: "lakefs"

blockstore:
  type: gs
  gs:
    credentials_file: "/home/lakefs/.credentials.json"

gateways:
  s3:
    domain_name: "<http://s3.local.lakefs.io:8000|s3.local.lakefs.io:8000>"

Bruno Canal

06/06/2021, 11:45 AM

docker run cmd:

Copy code

docker run \
--name lakefs \
-p 8000:8000 \
--network bridge \
-v $(pwd)/lakefs-config.yaml:/home/lakefs/.lakefs.yaml \
-v $(pwd)/credentials.json:/home/lakefs/.credentials.json \
treeverse/lakefs:0.41.1 run

Tal Sofer

06/06/2021, 11:57 AM

@Bruno Canal The reason for the slowness of your spark job is the distance between the lakefs process and the database. Going over your logs I saw no errors, but noticed that the time each action took is significantly longer than usual. To confirm that, can you run Postgres locally or deploy lakefs on gcp and test the same simple Spark app? both ways should work, the important thing is that the db and the process will share the same network.

Tal Sofer

06/06/2021, 12:00 PM

And please let me know if you have more questions 🙂

Bruno Canal

06/06/2021, 12:04 PM

Thanks @Tal Sofer! I’ll do it and let you know if I get some problem

Tal Sofer

06/06/2021, 12:18 PM

Cool! good luck

Open in Slack

Previous Next