Guy Hardonag
09/15/2022, 1:26 PMIddo Avneri
09/15/2022, 2:07 PMRobin Moffatt
09/15/2022, 2:11 PMBarak Amar
from pyspark.context import SparkContext
from pyspark import SparkFiles
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.hadoop.fs.s3a.access.key", "AKIAIOSFODNN7EXAMPLE") \
.config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") \
.config("spark.hadoop.fs.s3a.endpoint", "<http://lakefs:8000>") \
.getOrCreate()
sc = spark.sparkContext
Robin Moffatt
09/15/2022, 4:11 PMBarak Amar
Robin Moffatt
09/15/2022, 4:15 PMCaused by: java.io.FileNotFoundException:
No such file or directory: <s3a://example/remove_pii/demo/users/part-00000-7a0bbe79-a3e2-4355-984e-bd8b950a4e0c-c000.snappy.parquet>
Barak Amar
Robin Moffatt
09/15/2022, 4:18 PM/me grasps at straws with his very limited understanding of this stuff
Barak Amar
Robin Moffatt
09/15/2022, 4:22 PMdf2.write.mode('overwrite').parquet('s3a://'+repo+'/'+branch+'/demo/users')
Barak Amar
Robin Moffatt
09/15/2022, 4:28 PMBarak Amar
repo='example'
xform_df = spark.read.parquet('s3a://'+repo+'/main/demo/users')
...
df2=xform_df.drop('ip_address','birthdate','salary','email')
and later
df2.write.parquet('s3a://'+repo+'/remove_pii/demo/users')
Robin Moffatt
09/16/2022, 8:43 AMmain
could have changed since I created my remove_pii
branch, so reading from it feels incorrect. Can you help clarify this?Barak Amar
Robin Moffatt
09/16/2022, 9:40 AMUse the reference from https://pydocs.lakefs.io/docs/BranchesApi.html#get_branch from ‘remove_pii’ branch before you override.so in pseudo code I would get_branch use the ref returned to then read the file (instead of via branch ) is that right?
Barak Amar
Robin Moffatt
09/16/2022, 9:43 AM.cache()
on the data frame before writing it https://stackoverflow.com/a/65330116/350613Barak Amar
Robin Moffatt
09/16/2022, 10:13 AM