Hello! I'm trying to query data from lakeFS, run l...
# help
a
Hello! I'm trying to query data from lakeFS, run locally, using pyspark, however, I get "java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found". I am running a spark shell with --packages io.lakefshadoop lakefs assembly0.1.12 as per the doc https://docs.lakefs.io/integrations/spark.html. Could someone explain why LakeFSFileSystem might not be found?
y
Hey @Asher Song, just to rule out any maven issues, could you try replacing the
--packages <...>
flag with:
Copy code
// wrong jar was here
@Asher Song I used the wrong jar above, please use:
Copy code
--jars <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.12.jar>
a
I still seem to get the same error
c
I have't tried building the jar, but i have been able to copy assembled jar to jars/ folder to access lakefs paths from spark. documentation here : https://docs.lakefs.io/reference/spark-client.html
y
@Chandra Akkinepalli note that this doc refers to the lakeFS metadata client, which allows listing files from lakeFS but not reading file contents
c
@Yoni Augarten, thanks for mentioning that.
y
Sure šŸ˜Š
@Asher Song, thanks for letting me know. Let me try to reproduce this
@Asher Song I managed to get to the same error simply by chance, when I specified more than a single
--packages
flag in my command. Could this be your case as well?
c
@Yoni Augarten, i am able to run something like this
spark-submit spark_changes.py --jars <http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/hadoop/hadoop-lakefs-assembly-0.1.12.jar>
and read from lakefs, i am able to follow the instructions on https://docs.lakefs.io/integrations/spark.html , created hdfs-site.xml to have appropriate settings.
šŸ™šŸ» 1
a
@Yoni Augarten Iā€™m only using a single --packages flag and Iā€™m still getting the same error.
Does my data have to be hosted on an s3 platform for the configuration that I am using (hadoop filesystem)?
a
Hi, Sorry to hear you've some difficulties. Right now lakeFSFS is only supported when lakeFS uses an s3 or compatible backing store: lakeFSFS itself will directly access that having store. What backing store do you have? We're actively seeking people to run an upcoming beta (don't tell anyone... šŸ˜‡) Also, it's a holiday over here, so I apologize in advance for lengthier than usual response times.
a
Oh I see. Currently my data is stored locally on my machine
Is there a way to connect lakefs to spark if your data is stored locally?
a
Local data storage will not work with a distributed system like Spark, at least not without forcing you to do a significant amount of configuration work. You might look at the lakeFS "everything bagel", or playground account . Again, sorry but I don't expect us to provide good response times for the upcoming day.
e
Another possibility (if your situation allows it) is to use AWS (or any other cloud provider with an S3-compatible object store, which is more or less all of them) and use the free tier for evaluation purposes
e
You can also consider a local installation of min.io