Thread
#help
    Guy Hardonag

    Guy Hardonag

    1 week ago
    Welcome back 😄 As far as I know the first error has nothing todo with lakeFS, writing without overwrite to an existing should fail. Looking into to the second error… will be back with an answer today
    Iddo Avneri

    Iddo Avneri

    1 week ago
    You might want to use this notebook - could be easier: https://github.com/treeverse/lakeFS-samples/tree/main/03-apache-spark-python-demo
    r

    Robin Moffatt

    1 week ago
    oh cool @Iddo Avneri, that looks useful. thanks!
    Iddo Avneri

    Iddo Avneri

    1 week ago
    There is also a youtube video going over setting up one of these (the delta specifically, but SPARK will be the same):

    https://youtu.be/knbiXy6mWNg

    Barak Amar

    Barak Amar

    1 week ago
    @Robin Moffatt in your notebook I've configured the spark session this way:
    from pyspark.context import SparkContext
    from pyspark import SparkFiles
    from pyspark.sql.session import SparkSession
    
    spark = SparkSession \
        .builder \
        .config("spark.hadoop.fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") \
        .config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") \
        .config("spark.hadoop.fs.s3a.endpoint", "<http://lakefs:8000>") \
        .getOrCreate()
    sc = spark.sparkContext
    Using the above the s3a will go though lakefs s3 gateway and write/override the data.
    r

    Robin Moffatt

    1 week ago
    Thanks @Barak Amar. So does that override https://github.com/treeverse/lakeFS/blob/master/deployments/compose/etc/hive-site.xml#L51-L62 and do something different?
    Barak Amar

    Barak Amar

    1 week ago
    right, I need to check why the spark inside the notebook doesn't use these values.
    did my session setup work for you?
    r

    Robin Moffatt

    1 week ago
    It seems to use something because it writes the first dataframe just fine
    I'm trying your config on the notebook now
    No, I get the same error
    Caused by: java.io.FileNotFoundException: 
    No such file or directory: <s3a://example/remove_pii/demo/users/part-00000-7a0bbe79-a3e2-4355-984e-bd8b950a4e0c-c000.snappy.parquet>
    I'm using Everything Bagel setup, if that makes any difference?
    Barak Amar

    Barak Amar

    1 week ago
    the error is on the second branch
    do you see the branch in the UI?
    r

    Robin Moffatt

    1 week ago
    let me check
    so is something weird happening with the overwrite behaviour?
    it's removing it to overwrite it but expects it to be there when it then writes it again?
    /me grasps at straws with his very limited understanding of this stuff
    Barak Amar

    Barak Amar

    1 week ago
    the last error you paste was from a request to read the file?
    r

    Robin Moffatt

    1 week ago
    no, a write statement
    df2.write.mode('overwrite').parquet('s3a://'+repo+'/'+branch+'/demo/users')
    Barak Amar

    Barak Amar

    1 week ago
    ok, I'll be next to my laptop a bit later will rerun the notebook and get back with better answers in an hour
    r

    Robin Moffatt

    1 week ago
    thanks!
    Barak Amar

    Barak Amar

    1 week ago
    Hi Robin, the problem is that when we (notebook) remove the pii - we need to read the data and write the data. In this case you like to read the data from the main branch and write the data to the remove_pii branch. Currently the branch remove_pii holds the right data (at the point you branch), but we write to the same location we read from.
    Update the notebook to read from 'main' and write to 'remove_pii'
    repo='example'
    xform_df = spark.read.parquet('s3a://'+repo+'/main/demo/users')
    ...
    df2=xform_df.drop('ip_address','birthdate','salary','email')
    and later
    df2.write.parquet('s3a://'+repo+'/remove_pii/demo/users')
    and it should work
    let me know if it is working for you and/or I need to better explain what I found so far.
    r

    Robin Moffatt

    1 week ago
    hey @Barak Amar, thanks for this. By using your approach (read from main, write to branch) I got it to work. However I’m still not clear on this, because in my mental model of branching (based on git) I would expect to also read from my branch;
    main
    could have changed since I created my
    remove_pii
    branch, so reading from it feels incorrect. Can you help clarify this?
    Barak Amar

    Barak Amar

    1 week ago
    Sure, the thing is not directly related to lakeFS. We try to transform and overwrite the same path. So the data is lost as part of the override process.
    From lakefs point of view you can read it from the commit ID the branch was created from. it should work too.
    When using the branch name, the read and the write look at the same directory. The staging area.
    Use the reference from https://pydocs.lakefs.io/docs/BranchesApi.html#get_branch from 'remove_pii' branch before you override. When you think about it we enable a nice solution to a problems like https://community.databricks.com/s/question/0D53f00001HKHkSCAX/spark-how-to-simultaneously-read-from-and-write-to-the-same-parquet-file Hope it make sense.
    r

    Robin Moffatt

    1 week ago
    I think the issue here is I am also learning pyspark & parquet files etc too as I go, so apologies for confusing things with general misunderstanding 😉
    Use the reference from https://pydocs.lakefs.io/docs/BranchesApi.html#get_branch from ‘remove_pii’ branch before you override.
    so in pseudo code I would get_branch use the ref returned to then read the file (instead of via branch ) is that right?
    Barak Amar

    Barak Amar

    1 week ago
    Yes, about the branch information - I think it would be the right way to go. About spark - learning the pitfalls with you, so this is great 🙂
    Don't need all the code from the docs - like you did with create branch, you already have a client.
    Here if you need anything else to get it working.
    r

    Robin Moffatt

    1 week ago
    thanks, I appreciate it!
    this also seems to work - read from the branch, but use
    .cache()
    on the data frame before writing it https://stackoverflow.com/a/65330116/350613
    Barak Amar

    Barak Amar

    1 week ago
    Saw this one - don't like to count on cache to prevent data access. Don't know when it will not have enough space and it will go to the actual storage. Or it is based on writing the data and move it at the end. So read/process/write without the cache will probably be faster for later sets.
    Assume there is no magic in software engineering - I can be wrong.
    r

    Robin Moffatt

    1 week ago
    got it. thanks.