using ```spark.sparkContext.hadoopConfiguration.se...
# help
r
using
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", dbutils.secrets.get("development","LAKEFS_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", dbutils.secrets.get("development","LAKEFS_SECRET_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1/>")
This write fails silently, as in no failures but when I check location there is no data I’m able to read data using the lakefs:// prefix if it has previously been written using the the other method for writing data
Copy code
val masterPath = s"lakefs://${repo}/${master}/parquet/"
case class Test(story: String)
val masterStory = Seq(Test("I live on mastery")).toDS
println(masterPath)
masterStory.write.mode("overwrite").parquet(masterPath)
👀 1
b
Created a cluster and run the following:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "<lakefs key>")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "<lakefs secret>")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<https://bfs.lakefs.dev/api/v1>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "<s3a key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "<s3a secret>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.region", "us-east-1")

val masterPath = s"<lakefs://test31/master/parquet2/>"
case class Test(story: String)
val masterStory = Seq(Test("Say Yes More")).toDS
println(masterPath)
masterStory.write.mode("overwrite").parquet(masterPath)
I've updated the message after the first run. Downloaded the parquet created and verified that it holds the updated message.
Can to add the s3a properties and restart the cluster?
r
we use instance profiles fs.s3a settings
I’m able to read using
<lakefs://repo/branch/path/>
b
There is a specific case in the lakeFS implementation where it uses s3 aws sdk when a file is written - in order to verify this issue is related to this client access the underlying storage using the properties. I understand it is something that we will need to support, just want to focus this is the issue and not something else.
Created this issue after trying to use your sample using databricks instance profile. Feel free to add more information. Will look into how to support databrick's instance profile on Sunday, and update when this issue can be resolved.
r
seen this ticket had been closed so assembled this and reran, I can see the data is written to s3 but it’s not registered with lakefs
Copy code
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3AFileSystem:V3: FS_OP_CREATE BUCKET[211459479356-databricks-data] FILE[<s3a://acc-databricks-data/lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46>] Creating output stream; permission: rw-r--r--, overwrite: false, bufferSize: 4096
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 25
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[acc-databricks-data] FILE[lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46] Closing stream; size: 579
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[acc-databricks-data] FILE[lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46] Upload complete; size: 579
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO BasicWriteTaskStatsTracker: Expected 1 files, but only saw 0. This could be due to the output format not writing empty files, or files being not immediately visible in the filesystem.
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2114 bytes result sent to driver
I can read that path directly and it has the data it’s just not registered to lakefs ecosystem…
g
Hi @Richard Gilmore, I am reopening the ticket. someone will give it a look tomorrow morning. (it’s now 23:50 here)
👍 1
t
Hi @Richard Gilmore, can you please try removing the trailing slash from the lakeFS endpoint URL:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1/>")
It becomes:
Copy code
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1>")
Please let us know if this fixes the problem. If this will not work we will investigate further!
Yesterday, we discovered this bug https://github.com/treeverse/lakeFS/issues/2214 and I think that this may be the issue you are experiencing.
r
ah yep that solved it can write parquet and it’s 99% of the way there for
delta
tables. It fails on the initial write of the delta table on creating the delta log, but then is fine for subsequent updates to the table…
t
Great! Thanks for letting me know. Would you kindly share the logs that report the failure?
r
yeah let me grab them here
t
Thanks
r
logs.txt
t
Thank you, I will look into it and update you on the next steps
b
Can you check that spark.databricks.delta.multiClusterWrites.enabled false Just to verify it is not related to this issue.
r
another snippet further up where it identifies no logs exist
Copy code
2021-07-09 10:19:33 [WRAPPER-ReplId-6237a-3a591-a91d7-7] TRACE LakeFSFileSystem[OPERATION]: open(<lakefs://testing/master/deltatest/_delta_log/_last_checkpoint>)
2021-07-09 10:19:33 [WRAPPER-ReplId-6237a-3a591-a91d7-7] TRACE LakeFSFileSystem[OPERATION]: exists(<lakefs://testing/master/deltatest/_delta_log>)
2021-07-09 10:19:33 [rpc-server-4-2]  INFO InitialSnapshot: [tableId=eb8b6bf5-ed4f-4d3f-ba09-b99c769cae20] Created snapshot InitialSnapshot(path=<lakefs://testing/master/deltatest/_delta_log>, version=-1, metadata=Metadata(7f335ec6-b67a-4445-a135-261caaf54ee7,null,null,Format(parquet,Map()),null,List(),Map(),Some(1625825973686)), logSegment=LogSegment(<lakefs://testing/master/deltatest/_delta_log,-1,List(),List(),None,-1>), checksumOpt=None)
yeah I had multi cluster writes disabled
👍 1
t
@Richard Gilmore I wanted to confirm - are you running into this failure only while using the lakeFSFileSystem on Databricks, that is, not while using the s3 gateway?
r
yeah I can do the initial write using the s3 gateway and then subsequent writes using lakeFSFileSystem, it’s just the initial creation of the delta_log is failing with LakeFSFileSystem
👍 1
t
I opened this issue https://github.com/treeverse/lakeFS/issues/2219 to track that and will further investigate it. It may be related to the other open Databricks issue that we investigate https://github.com/treeverse/lakeFS/issues/2111. Thanks for reporting it!
r
regarding #2111 I’ve been able to do reads on spark 3
t
We’re you able to read partitioned parquet? Our suspect is that the issue only happens with partitioned data
r
ah will test my tables were no partitioned
t
Thanks! :)
r
Ok so ran a bit of testing here: • parquet partitioned write 👎 • parquet partitioned read 👎 • delta partitioned write 👎 • delta partitioned read 👍 (confused about this one 😃 )
t
Thank you so much for running these tests! I will update the issue with your findings (you are welcome to post a comment if you want!).