Hi everyone, Is there a way we can have BigDL to...
# help
u
Hi everyone, Is there a way we can have BigDL tool setup on LakeFS?
u
Hi Jude 🙂, Do you mean something like running DLlib (Spark) with lakeFS?
u
Yes, exactly
u
Cool, did you get a chance to read about our Spark integration?
u
I have a jupyter notebook configured to communicate with LakeFS, where I can run spark and other machine learning library. But I'm finding it difficult to setup DLlib to also run in my environment (jupyter notebook)
u
Yes I did
u
can you elaborate on what makes it difficult to set up DLlib to run in your Jupyter notebook?
u
I followed the guidelines on how to install DLLib then I discovered I can install it using pip but the drawback with this, is that I will have to unset the SPARK_HOME in my bashrc file. In this case if I do, then I won't be able to use spark to communicate with LakeFS, meaning this method of setting DLlib up with pip obviously can't work for my case.
u
Do you mind sharing the command you use in your notebook to run the DLlib Spark job?
u
I used "pip install bigdl-spark3" (bigdl built on spark3). Then after it successfully installed I tried to verify the installation using the below command "from bigdl.orca import init_orca_context sc = init_orca_context() " After I run I get module not found error
u
Meanwhile, Java home is already set on my environment as I understood it is also required to successfully run DLlib
u
So according to the official docs of DLlib, you must first start your program with something like:
Copy code
val conf = Engine.createSparkConf()
        .setAppName("Train Lenet on MNIST")
        .set("spark.task.maxFailures", "1")
      val sc = new SparkContext(conf)
      Engine.init
Do you mind trying the following:
Copy code
val conf = Engine.createSparkConf()
        .setAppName("Train Lenet on MNIST")
        .set("spark.task.maxFailures", "1")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3a.endpoint", <your lakeFS server endpoint>)
sc.hadoopConfiguration.set("fs.s3a.access.key", <your lakeFS access key>)
sc.hadoopConfiguration.set("fs.s3a.secret.key", <your lakeFS secret key>)
sc.hadoopConfiguration.set("fs.s3a.path.style.access", true)
Engine.init
?
u
Okay, I will look into this. But I am more concerned to know if bigdl actually works after installation. Can you check out the python guide doc to see what I mean? Would try this and give you feedback.
u
The change I'm suggesting will only direct Spark to the right location (lakeFS's endpoint). Unfortunately, I'm no BigDL expert or familiar with the library to the point I can actually help you realize if it works after the installation... Do you have a DLlib example app that runs successfully with S3 as its backing object store?
u
Actually, I don't have any at the moment still trying to figure that out. I think the issue here is that DLlib runs its separate spark and jupyter notebook once it is installed and those things I already have running in my environment.
u
So it sounds like you've got some discovery work to do 🧐 I think that it would be useful for you to reach out to the BigDL community to get help initializing your environment. When you have a simple app that works with S3 or other object stores, please reach out, and we would be glad to help you integrate your app with lakeFS 🙂
u
Thank you very much for the tip. I will surely do that