using ```spark sparkContext hadoopConfiguration set fs lakef lakeFS #help

using ```spark.sparkContext.hadoopConfiguration.se...

Richard Gilmore

07/02/2021, 4:13 PM

using

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", dbutils.secrets.get("development","LAKEFS_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", dbutils.secrets.get("development","LAKEFS_SECRET_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1/>")

This write fails silently, as in no failures but when I check location there is no data I’m able to read data using the lakefs:// prefix if it has previously been written using the the other method for writing data

Copy code

val masterPath = s"lakefs://${repo}/${master}/parquet/"
case class Test(story: String)
val masterStory = Seq(Test("I live on mastery")).toDS
println(masterPath)
masterStory.write.mode("overwrite").parquet(masterPath)

👀 1

Barak Amar

07/02/2021, 4:53 PM

Created a cluster and run the following:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.access.key", "<lakefs key>")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.secret.key", "<lakefs secret>")
spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<https://bfs.lakefs.dev/api/v1>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "<s3a key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "<s3a secret>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.region", "us-east-1")

val masterPath = s"<lakefs://test31/master/parquet2/>"
case class Test(story: String)
val masterStory = Seq(Test("Say Yes More")).toDS
println(masterPath)
masterStory.write.mode("overwrite").parquet(masterPath)

I've updated the message after the first run. Downloaded the parquet created and verified that it holds the updated message.

Barak Amar

07/02/2021, 4:53 PM

Can to add the s3a properties and restart the cluster?

Richard Gilmore

07/02/2021, 4:54 PM

we use instance profiles fs.s3a settings

Richard Gilmore

07/02/2021, 4:55 PM

I’m able to read using

<lakefs://repo/branch/path/>

Barak Amar

07/02/2021, 5:02 PM

There is a specific case in the lakeFS implementation where it uses s3 aws sdk when a file is written - in order to verify this issue is related to this client access the underlying storage using the properties. I understand it is something that we will need to support, just want to focus this is the issue and not something else.

Barak Amar

07/02/2021, 7:32 PM

https://github.com/treeverse/lakeFS/issues/2192

Barak Amar

07/02/2021, 7:35 PM

Created this issue after trying to use your sample using databricks instance profile. Feel free to add more information. Will look into how to support databrick's instance profile on Sunday, and update when this issue can be resolved.

Richard Gilmore

07/08/2021, 8:34 PM

seen this ticket had been closed so assembled this and reran, I can see the data is written to s3 but it’s not registered with lakefs

Copy code

2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3AFileSystem:V3: FS_OP_CREATE BUCKET[211459479356-databricks-data] FILE[<s3a://acc-databricks-data/lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46>] Creating output stream; permission: rw-r--r--, overwrite: false, bufferSize: 4096
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 25
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[acc-databricks-data] FILE[lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46] Closing stream; size: 579
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[acc-databricks-data] FILE[lakefs-testing/data/testing-master:7815b225-ef96-450a-b79e-08a45a73e112/rkrI2cQ27qEpbJysg7J46] Upload complete; size: 579
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO BasicWriteTaskStatsTracker: Expected 1 files, but only saw 0. This could be due to the output format not writing empty files, or files being not immediately visible in the filesystem.
2021-07-08 20:15:08 [Executor task launch worker for task 1] INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2114 bytes result sent to driver

I can read that path directly and it has the data it’s just not registered to lakefs ecosystem…

Guy Hardonag

07/08/2021, 8:53 PM

Hi @Richard Gilmore, I am reopening the ticket. someone will give it a look tomorrow morning. (it’s now 23:50 here)

👍 1

Tal Sofer

07/08/2021, 9:07 PM

Hi @Richard Gilmore, can you please try removing the trailing slash from the lakeFS endpoint URL:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1/>")

It becomes:

Copy code

spark.sparkContext.hadoopConfiguration.set("fs.lakefs.endpoint", "<http://lakefs.dev.company.com/api/v1>")

Please let us know if this fixes the problem. If this will not work we will investigate further!

Tal Sofer

07/09/2021, 6:08 AM

Yesterday, we discovered this bug https://github.com/treeverse/lakeFS/issues/2214 and I think that this may be the issue you are experiencing.

Richard Gilmore

07/09/2021, 10:06 AM

ah yep that solved it can write parquet and it’s 99% of the way there for

delta

tables. It fails on the initial write of the delta table on creating the delta log, but then is fine for subsequent updates to the table…

Tal Sofer

07/09/2021, 10:11 AM

Great! Thanks for letting me know. Would you kindly share the logs that report the failure?

Richard Gilmore

07/09/2021, 10:12 AM

yeah let me grab them here

Tal Sofer

07/09/2021, 10:14 AM

Thanks

Richard Gilmore

07/09/2021, 10:30 AM

logs.txt

Tal Sofer

07/09/2021, 10:36 AM

Thank you, I will look into it and update you on the next steps

Barak Amar

07/09/2021, 10:38 AM

Can you check that spark.databricks.delta.multiClusterWrites.enabled false Just to verify it is not related to this issue.

Richard Gilmore

07/09/2021, 10:39 AM

another snippet further up where it identifies no logs exist

Copy code

2021-07-09 10:19:33 [WRAPPER-ReplId-6237a-3a591-a91d7-7] TRACE LakeFSFileSystem[OPERATION]: open(<lakefs://testing/master/deltatest/_delta_log/_last_checkpoint>)
2021-07-09 10:19:33 [WRAPPER-ReplId-6237a-3a591-a91d7-7] TRACE LakeFSFileSystem[OPERATION]: exists(<lakefs://testing/master/deltatest/_delta_log>)
2021-07-09 10:19:33 [rpc-server-4-2]  INFO InitialSnapshot: [tableId=eb8b6bf5-ed4f-4d3f-ba09-b99c769cae20] Created snapshot InitialSnapshot(path=<lakefs://testing/master/deltatest/_delta_log>, version=-1, metadata=Metadata(7f335ec6-b67a-4445-a135-261caaf54ee7,null,null,Format(parquet,Map()),null,List(),Map(),Some(1625825973686)), logSegment=LogSegment(<lakefs://testing/master/deltatest/_delta_log,-1,List(),List(),None,-1>), checksumOpt=None)

Richard Gilmore

07/09/2021, 10:39 AM

yeah I had multi cluster writes disabled

👍 1

Tal Sofer

07/09/2021, 11:33 AM

@Richard Gilmore I wanted to confirm - are you running into this failure only while using the lakeFSFileSystem on Databricks, that is, not while using the s3 gateway?

Richard Gilmore

07/09/2021, 11:44 AM

yeah I can do the initial write using the s3 gateway and then subsequent writes using lakeFSFileSystem, it’s just the initial creation of the delta_log is failing with LakeFSFileSystem

👍 1

Tal Sofer

07/09/2021, 11:47 AM

I opened this issue https://github.com/treeverse/lakeFS/issues/2219 to track that and will further investigate it. It may be related to the other open Databricks issue that we investigate https://github.com/treeverse/lakeFS/issues/2111. Thanks for reporting it!

Richard Gilmore

07/09/2021, 12:22 PM

regarding #2111 I’ve been able to do reads on spark 3

Tal Sofer

07/09/2021, 12:26 PM

We’re you able to read partitioned parquet? Our suspect is that the issue only happens with partitioned data

Richard Gilmore

07/09/2021, 12:27 PM

ah will test my tables were no partitioned

Tal Sofer

07/09/2021, 12:30 PM

Thanks! :)

Richard Gilmore

07/09/2021, 12:56 PM

Ok so ran a bit of testing here: • parquet partitioned write 👎 • parquet partitioned read 👎 • delta partitioned write 👎 • delta partitioned read 👍 (confused about this one 😃 )

Tal Sofer

07/09/2021, 1:00 PM

Thank you so much for running these tests! I will update the issue with your findings (you are welcome to post a comment if you want!).

3 Views

Open in Slack

Previous Next