Hello, I am trying to use the Java API in a Scala ...
# help
u
Hello, I am trying to use the Java API in a Scala codebase, and I am getting an error that was already reported by a user: https://lakefs.slack.com/archives/C02CV7MUV4G/p1666253775100359?thread_ts=1666251707.039089&cid=C02CV7MUV4G In my case I am using SBT instead of maven and don’t have any other dependencies using okhttp3. What could be causing this issue? I am using the SBT Assembly plugin to create a uber-jar, I don’t know if I need to be careful with some merging rule, but I don’t see any dependency conflict
u
Hey @Alessandro Mangone, I’m looking into it, which package did you specify in the SBT? is it
io.lakefs:api-client:0.86.0
?
u
I am using these libraries:
Copy code
"io.lakefs" % "hadoop-lakefs-assembly" % "0.1.9",
"io.lakefs" %% "lakefs-spark-client-312-hadoop3" % "0.5.1"
which include version 0.56.0
u
u
But trying to run
sbt assembly
with those settings I run out of heap space, maybe there’s something wrong
u
Are you going to use the above API inside a Spark job?
u
yes
u
The sharing instructions where to prevent conflict with other okhttp implementation as part of supplying jar as a library
u
You can try first use the client package without any share and submit the job first
u
so you suggest to directly import the api client instead of using hadoop/spark integrations?
u
but even that dependency will contain okhttp3
u
Are you going to use the lakeFS hadoop implementation for direct upload?
u
if it is not the case - yes, I suggest you include the client first
u
There will be no share and the package and the dependencies will be used inside your spark job
u
I assume the job will work directly with lakeFS though the s3 gateway
u
I am using S3 so I wanted to try the optimized lakefs upload
u
In such a way that lakefs only handles metadata but uploads are directly submitted to s3
u
I see - so can you try to import the hadoop-lakefs-assembly
u
The jar will include the client and the shaded dependency
u
when you import the assembly the client should be shaded under io.lakefs.shaded.api
u
I assume that if you instance this one and call the api it should work as all the dependencies are shaded.
u
Ok, I’ll give it a try. Some methods seem to require some extra arguments. I’ll get back as soon as I manage to assembly and test the code
u
After some tests I managed to configure the job, but I am getting a new error:
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.lakefs.impl=null: No AbstractFileSystem configured for scheme: lakefs
But it is set in the sparkconfig?
("spark.hadoop.fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
u
If you are using
spark-submit --conf ...
the key is
spark.hadoop.fs.lakefs.impl
from inside the spark when you set the key is
fs.lakefs.impl
.
u
it depends on how you set conf keys, if you do it directly in hadoop configuration or in the configuration of the sparksession
u
yes
u
The tabs in the documentation link I posted shows some examples
u
SparkSession.builder().master("local[2]").config("spark.hadoop.fs.lakefs.impl","io.lakefs.LakeFSFileSystem")
this should work as well
u
but I’ll try the other way with hadoopConfiguration
u
It will probably do the same - I wasn't sure what is 'sparkconfig' for you.
u
So the question is why the spark cluster doesn't get these properties
u
or maybe I should also ask if I’m implementing the right way, because it is a little bit confusing when I should use
lakefs://
,
s3://
or
s3a://
When I initialize the apiclient, scheme is set to
lakefs
, but my namespace is set to
<s3://bucket>…/mycollection
u
and then when I use spark read/write, i use
lakefs://
u
The api client works with http - so it will get https://<lakefs endpoint>
u
The s3 and s3a should be configured to work directly with S3 and have cred to access the lakefs bucket
u
The lakefs:// should be used while read/write data from spark - it will use the io.lakefs.LakeFSFileSystem to get the metadata from lakefs and s3a for the data
u
so the first test I'll write is just write some data (csv/parquet) to lakefs using lakefs://<repo>/<branch>/location
u
lakefs:// should work only if we set the credentials and endpoint (spark.hadoop.fs.lakefs.endpoint=https://lakefs.example.com/api/v1)
u
Copy code
spark-shell --conf spark.hadoop.fs.s3a.access.key='AKIAIOSFODNN7EXAMPLE' \
              --conf spark.hadoop.fs.s3a.secret.key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' \
              --conf spark.hadoop.fs.s3a.endpoint='<https://s3.eu-central-1.amazonaws.com>' \
              --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
              --conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \
              --conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \
              --conf spark.hadoop.fs.lakefs.endpoint=<https://lakefs.example.com/api/v1> \
              --packages io.lakefs:hadoop-lakefs-assembly:0.1.8
              ...
u
something like the example should work
u
yes, this works quite well with your playground, but I am struggling to make it work with AWS because of the lakefs filesystem 🙂
u
and in aws you have so many different prefixes
u
the lakefs:// implementation uses s3a:// addressing for read/write the data
u
you can drop the 'spark.hadoop.fs.s3a.endpoint' form the above example - it will use the default
u
from the last error you posted it looks like the property we set wasn't passed to the location we run the code.
u
yes I suspect that too
u
can you try to submit your code with spark submit?
u
I am recompiling with config setup using hadoopConfiguration
u
and will submit asap
u
now I am getting a new error:
Caused by: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
u
can we apply these settings on the cluster level or using spark-submit?
u
I’ll try with the spark submit
u
with spark submit I’m getting the previous error:
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.lakefs.impl=null: No AbstractFileSystem configured for scheme: lakefs
Copy code
--conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
--conf spark.hadoop.fs.lakefs.access.key=XXX \
--conf "spark.hadoop.fs.lakefs.secret.key=XXX" \
--conf spark.hadoop.fs.lakefs.endpoint=<http://lakefs-int.eu-central-1.dataservices.xxx.cloud/api/v1> \
--conf spark.hadoop.lakefs.api.url=<http://lakefs-int.eu-central-1.dataservices.xxx.cloud/api/v1> \
--conf spark.hadoop.lakefs.api.access_key=XXX \
--conf "spark.hadoop.lakefs.api.secret_key=XXX" \
u
Looks like the same issue as the cluster doesn't location the implementation - can you paste the complete spark-submit command (without secrets)?
u
Copy code
/opt/entrypoint.sh ${SPARK_HOME}/bin/spark-submit \
  --name {{workflow.name}} \
  --deploy-mode client \
  --master <k8s://https>://kubernetes.default.svc:443 \
  --conf spark.kubernetes.node.selector.NodeGroup=spark-worker \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=datalake-sa \
  --conf spark.kubernetes.authenticate.executor.serviceAccountName=datalake-sa \
  --conf spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
  --conf spark.kubernetes.container.image={{inputs.parameters.spark-image}} \
  --conf spark.kubernetes.driver.pod.name=${POD_NAME} \
  --conf spark.kubernetes.namespace=${POD_NAMESPACE} \
  --conf spark.driver.host=${POD_IP} \
  --conf spark.driver.port=7077 \
  --conf spark.driver.cores={{inputs.parameters.driver-cores}} \
  --conf spark.driver.memory={{inputs.parameters.driver-memory}} \
  --conf spark.kubernetes.driver.limit.cores={{inputs.parameters.driver-cores}} \
  --conf spark.kubernetes.driver.request.cores={{inputs.parameters.driver-cores}} \
  --conf spark.executor.instances={{inputs.parameters.executor-instances}} \
  --conf spark.executor.memory={{inputs.parameters.executor-memory}} \
  --conf spark.executor.cores={{inputs.parameters.executor-cores}} \
  --conf spark.kubernetes.executor.limit.cores={{inputs.parameters.executor-cores}} \
  --conf spark.kubernetes.executor.request.cores={{inputs.parameters.executor-cores}} \
  --conf spark.kubernetes.executor.label.workflow-name={{workflow.name}} \
  --conf spark.kubernetes.executor.deleteOnTermination=true \
  --conf spark.sql.autoBroadcastJoinThreshold=-1 \
  --conf spark.ui.retainedTasks=10 \
  --conf spark.sql.ui.retainedExecutions=1 \
  --conf spark.kubernetes.memoryOverheadFactor=0.2 \
  --conf spark.sql.streaming.metricsEnabled=true \
  --class {{inputs.parameters.spark-class}} \
  --conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
  --conf spark.hadoop.fs.lakefs.access.key=XXX \
  --conf "spark.hadoop.fs.lakefs.secret.key=XXX" \
  --conf spark.hadoop.fs.lakefs.endpoint=<http://lakefs-int.eu-central-1.dataservices.xxx.cloud/api/v1> \
  --conf spark.hadoop.lakefs.api.url=<http://lakefs-int.eu-central-1.dataservices.xxx.cloud/api/v1> \
  --conf spark.hadoop.lakefs.api.access_key=XXX \
  --conf "spark.hadoop.lakefs.api.secret_key=XXX" \
  local:///opt/spark/work-dir/test.jar \
  {{inputs.parameters.job-arguments}}
u
test jar is uber jar or do we need also to pass ?
Copy code
--packages io.lakefs:hadoop-lakefs-assembly:0.1.9
u
it’s the uber jar
u
Hey @Alessandro Mangone, are you accessing the FileSystem using the FileContext abstraction? If so, you would need to change all
spark.hadoop.fs.lakefs.*
configurations to
spark.hadoop.fs.AbstractFileSystem.lakefs.*
. We haven't tested using the lakeFS Hadoop FileSystem using this abstraction, but I hope it all works out 🙂
u
no i am not , I am configuring how I pasted here - same as in your tutorials - but tomorrow I’ll perform some additional tests
u
I'm asking about the code accessing the FileSystem, not the configurations. Can you share the code?
u
Alternatively, the full stacktrace leading to:
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.lakefs.impl=null: No AbstractFileSystem configured for scheme: lakefs
u
I am not doing anything strange with the filesystem except configuring it in the spark config and then using it to write some data as in
df.write.format('delta').save('<lakefs://myrepo/mybranch/sometable>')
Here you can find the stacktrace ```Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:245) at org.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$3(TransactionalWrite.scala:342) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:297) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:245) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:101) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:212) at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:209) at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:101) at org.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:317) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:98) at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:91) at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:221) at org.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:91) at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:159) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.ap…
u
Hey @Alessandro Mangone, thank you for the details! This seems like a different error, and it means that the uber-JAR is not properly loaded into the classpath. Could you please try to download it from the following link, and include it in your submit command using the
--jars
flag? http://treeverse-clients-us-east.s3-website.us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.9.jar
u
I definitely need to review my assembly strategy, because by adding it with the --jars flag I get new nonsense errors 😐
Copy code
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
u
The lakeFS FileSystem relies on having hadoop-aws in your classpath. We don't include it in our Uber-jar since it is specific for your Hadoop version.
u
but it’s there 🙂 I am already using this codebase without any problem and I’m trying to integrate lakefs
u
I’ll review how I create the uberjar
u
is there maybe any file in
META-INF
that should be preserved?
u
You're creating an uber-jar for your project which includes the lakeFS FileSystem as a dependency?
u
yes
u
Perhaps you are shading hadoop-aws?
u
Copy code
val devDependencies = Seq(
  "org.apache.hadoop" % "hadoop-aws" % "3.3.2",
  "org.apache.hadoop" % "hadoop-common" % "3.3.2",
  "com.amazonaws" % "aws-java-sdk" % awsSDKVersion,
  "com.amazonaws" % "aws-java-sdk-bundle" % awsSDKVersion,
  "org.apache.spark" %% "spark-core" % sparkVersion % Provided,
  "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
  "org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
  "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
  "org.apache.spark" %% "spark-avro" % sparkVersion,
  //"org.apache.spark" %% "spark-hadoop-cloud" % sparkVersion,
  "io.delta" %% "delta-core" % "2.1.0",
  "io.confluent" % "kafka-schema-registry-client" % "7.2.2",
  "com.lihaoyi" %% "requests" % "0.7.1",
  "com.lihaoyi" %% "upickle" % "2.0.0",
  "com.lihaoyi" %% "os-lib" % "0.8.1",
  "com.github.scopt" %% "scopt" % "4.1.0",
  "software.amazon.msk" % "aws-msk-iam-auth" % "1.1.5",
  "io.lakefs" % "hadoop-lakefs-assembly" % "0.1.9",
  //"io.lakefs" %% "lakefs-spark-client-312-hadoop3" % "0.5.1"
)
u
no not really, I don’t have any shade rules in place
u
Can you check that hadoop-aws is indeed included in your final jar?
u
I agree that it should be, according to the dependencies.
u
And you're running spark-submit with
--master local
?
u
no, I’m running spark on k8s, so
--master <k8s://https>://kubernetes.default.svc:443
u
Right, sorry, I misread the thread. Thank you for confirming. I would like to take some time to try and reproduce this. I will get back to you here with an answer.
u
By the way, can you explain why you are using
--deploy-mode client
?
u
to have the driver running in the pod itself
u
Can you please share the stacktrace leading to the error?
u
I want to determine whether it happens on the driver or workers
u
Copy code
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
        at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
        at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
        at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
u
It seems like a failure while preparing the submit environment. Are you providing any of the spark-submit arguments as paths under s3a?
u
yes, the jar of lakefs as you suggested, I’m trying to load it from s3
u
If I use a path like
s3://
I get the error “unsupported filesystem s3”
u
For now, let's try to download it to your local machine and include it as a local path
u
Once we get it working we will solve the download thing
u
actually
u
I tried using the http link and it works now
u
Copy code
--jars <http://treeverse-clients-us-east.s3-website.us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.9.jar> \
u
so pre-loading it inside my docker image and using
--jars local://
might work
u
actually, it failed again complaining about Delta writing LogStore on a non-hadoop filesystem
u
Copy code
22/12/06 09:01:35 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
Exception in thread "main" java.io.IOException: The error typically occurs when the default LogStore implementation, that
is, HDFSLogStore, is used to write into a Delta table on a non-HDFS storage system.
In order to get the transactional ACID guarantees on table updates, you have to use the
correct implementation of LogStore that is appropriate for your storage system.
See <https://docs.delta.io/latest/delta-storage.html> for details.

....

Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.lakefs.impl=null: No AbstractFileSystem configured for scheme: lakefs
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:177)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:266)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:342)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:339)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
        at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:339)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:465)
        at io.delta.storage.HDFSLogStore.writeInternal(HDFSLogStore.java:90)
        ... 94 more
u
It seems that this version of delta does use the FileContext abstraction I mentioned before. I haven't run into this before. You will need to add
spark.hadoop.fs.AbstractFileSystem.lakefs.*
parameters with the same values as your corresponding
spark.hadoop.fs.lakefs.*
configurations.
u
Sorry it took me this long to retry, but after configuring also the AbstractFileSystem properties I get a new error
Copy code
Caused by: java.lang.NoSuchMethodException: io.lakefs.LakeFSFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
        at java.base/java.lang.Class.getConstructor0(Unknown Source)
        at java.base/java.lang.Class.getDeclaredConstructor(Unknown Source)
        at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:139)
        ... 104 more
u
If that matters, I am using the 2.1.0 open-source version of Delta
u
It seems like Delta uses this FileContext thing in order to achieve atomicity over HDFS, when writing the delta logstore. Let's try to use another logstore.
u
Can you try to add the following configuration:
Copy code
spark.delta.logStore.lakefs.impl=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
u
If it doesn't work, I'll try to spin up a cluster with open source delta on my end
u
Will do, I was looking at the same docs page 🙂
u
Thanks so much! It is working now!
u
We've come a long way 🙂
u
You're welcome
u
I think this should be added in the docs as additional setting when lakefs hadoop is used with delta ?
u
Absolutely, I will open an issue for this
u
I removed all the additional --jars/--conf, etc. and can confirm it’s not a problem of uberjar. Adding the logstore implementation is sufficient