Hello everyone. Please give an advice what I can i...
# help
v
Hello everyone. Please give an advice what I can investigate in this case: Given: openjdk-8, Spark 2.4, Scala 2.11, Hadoop 2.7, AWS 1.11,
io.lakefs:api-client:0.57.2
&
io.lakefs:lakefs-assembly:0.1.6
. Overall parquet data in lakefs branch is more than 1.5TiB: read, compute and write back to the lakefs the result of computation. Works fine with spark cluster about 35 executors and 900GB But have the issue when working with this full data after cluster upgrade. Some part of data may be processed (50-70%) but full data processing is aborted. New cluster info: openjdk-17 Amazon Corretto, Spark 3.5 with the Magic committer, Scala 2.12, Hadoop 3.4, AWS 1.12,
io.lakefs:api-client:1.15.0
&
io.lakefs:lakefs-assembly:0.2.3
The issue is related to the renaming (copy and delete), socket is closed:
Copy code
Caused by: java.io.IOException: renameObject: src:<lakefs://some/path/to/temp/_temporary/0/_temporary/attempt_some/path/to/file.parquet>, dst: <lakefs://some/path/to/final/dst/file.parquet>, call to copyObject failed
...
Caused by: io.lakefs.hadoop.shade.sdk.ApiException: Message: java.net.SocketTimeoutException: timeout
...
Caused by: java.net.SocketTimeoutException: timeout
...
Caused by: java.net.SocketException: Socket closed
a
Hi @Vasyl Klindukhov, That's quite a large Spark cluster! Some questions may help get started on investigating this... • Is your lakeFS on-prem or a cloud installation? • Could you please share some logs and/or configuration of the lakeFS side? • Do you have any monitoring data from the lakeFS server around when this happened? Obviously this kind of thing will be easiest if it's a lakeFS Cloud location, because we monitor and manage that. If you feel more comfortable sending me this information privately, please do!