Hi Everyone Glad to be here I want to deploy LakeFS on digit lakeFS #help

Hi Everyone, Glad to be here.. I want to deploy La...

Temilola Onaneye

03/11/2022, 11:01 AM

Hi Everyone, Glad to be here.. I want to deploy LakeFS on digital ocean using Kubernetes or Docker.. Please is there any documentation that can guide me on achieving that. I intend to have it point to a storage on Digital Ocean and also have Delta Lake integrated

Tal Sofer

03/11/2022, 11:13 AM

Hi @Temilola Onaneye! welcome to the lake lakefs These would be the relevant documentation for deploying lakeFS: • Deploy lakeFS on k8s • Deploy lakeFS on Docker As for connecting lakeFS to an (object) storage on digital ocean - we haven’t tested it and would love to hear how this goes and assist in any way possible. What storage are you planning to use?

✅ 1

Tal Sofer

03/11/2022, 11:15 AM

You mentioned that you have Delta Lake integrated, so you may want to check out our DeltaLake-lakeFS integration docs

✅ 1

Temilola Onaneye

03/11/2022, 11:26 AM

Yeah sure, Would be sure to make use of this thread for update.. Many thanks

Tal Sofer

03/11/2022, 11:30 AM

Sure, thank you!

Temilola Onaneye

03/21/2022, 7:18 PM

@Sadiq

Temilola Onaneye

03/21/2022, 7:23 PM

@Sadiq here is the thread to use

Sadiq

03/21/2022, 7:30 PM

Hi @Tal Sofer so for more context the ‘config.yml’ file has been modified and a bucket object created

Sadiq

03/21/2022, 7:31 PM

We have been able to get the lakefs service running

Sadiq

03/21/2022, 7:35 PM

Here's the status when this command is run connecting with the storage bucket from Digital Ocean “ INFO[0000]/home/runner/work/lakeFS/lakeFS/cmd/lakefs/cmd/root.go:61 github.com/treeverse/lakefs/cmd/lakefs/cmd.initConfig() Configuration file fields.file=/home/esther/config.yaml file=/home/esther/config.yaml phase=startup INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/root.go:103 cmd/lakefs/cmd.initConfig Config loaded fields.file=/home/esther/config.yaml file=/home/esther/config.yaml phase=startup INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/root.go:110 cmd/lakefs/cmd.initConfig Config actions.enabled=true auth.cache.enabled=true auth.cache.jitter=3s auth.cache.size=1024 auth.cache.ttl=20s auth.encrypt.secret_key="******" blockstore.azure.auth_method=access-key blockstore.azure.storage_access_key="" blockstore.azure.storage_account="" blockstore.azure.try_timeout=10m0s blockstore.default_namespace_prefix="" blockstore.gs.credentials_file="" blockstore.gs.credentials_json="" blockstore.gs.s3_endpoint="https://storage.googleapis.com" blockstore.local.path="~/data/lakefs/block" blockstore.s3.credentials_file="" blockstore.s3.discover_bucket_region=true blockstore.s3.endpoint="" blockstore.s3.force_path_style=false blockstore.s3.max_retries=5 blockstore.s3.profile="" blockstore.s3.region=us-east-1 blockstore.s3.streaming_chunk_size=1048576 blockstore.s3.streaming_chunk_timeout=1s blockstore.type=nyc3.digitaloceanspaces.com committed.block_storage_prefix=_lakefs committed.local_cache.dir="~/data/lakefs/cache" committed.local_cache.max_uploaders_per_writer=10 committed.local_cache.metarange_proportion=0.1 committed.local_cache.range_proportion=0.9 committed.local_cache.size_bytes=1073741824 committed.permanent.max_range_size_bytes=20971520 committed.permanent.min_range_size_bytes=0 committed.permanent.range_raggedness_entries=50000 committed.sstable.memory.cache_size_bytes=400000000 database.connection_max_lifetime=0s database.connection_string="******" database.max_idle_connections=0 database.max_open_connections=0 fields.file=/home/esther/config.yaml file=/home/esther/config.yaml gateways.s3.domain_name="[s3.local.lakefs.io]" gateways.s3.fallback_url="" gateways.s3.region=us-east-1 installation.fixed_id="" listen_address="0.0.0.0:8000" logging.file_max_size_mb=0 logging.files_keep=100 logging.format=text logging.level=INFO logging.output="[-]" logging.trace_request_headers=false phase=startup security.audit_check_interval=12h0m0s security.audit_check_url="https://audit.lakefs.io/audit" stats.address="https://stats.treeverse.io" stats.enabled=true stats.flush_interval=30s INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/run.go:104 cmd/lakefs/cmd.glob..func8 lakeFS run version=0.61.0 INFO [2022-03-21T174019Z]lakeFS/pkg/db/connect.go:54 pkg/db.ConnectDBPool Connecting to the DB conn_max_lifetime=5m0s driver=pgx max_idle_conns=25 max_open_conns=

Sadiq

03/21/2022, 7:36 PM

Can you guide to know if we are on the right path to making this work ?

Guy Hardonag

03/21/2022, 8:03 PM

Hey @Sadiq, I believe you mixed up the configuration a bit

blockstore.type

shoud be

one of ["local", "s3", "gs", "azure", "mem]

I assume you are using S3, if so you don’t need to configure the azure configuration and google storage configuration You could check out https://docs.lakefs.io/reference/configuration.html#reference for more information about the configuration

Guy Hardonag

03/21/2022, 8:14 PM

also, https://docs.lakefs.io/deploy/k8s.html should help

Sadiq

03/21/2022, 9:22 PM

So the deal is actually make it work using a diff block store from the regular

Sadiq

03/21/2022, 9:23 PM

What's your advise on that ?

Guy Hardonag

03/21/2022, 9:42 PM

IIUC you are using DigitalOcean Spaces In that case you would want to configure S3 as your block store and use the DigitalOcean endpoint (https://nyc3.digitaloceanspaces.com)

Guy Hardonag

03/21/2022, 9:47 PM

A good example for that would be minIO In your case

Copy code

endpoint: <https://nyc3.digitaloceanspaces.com>

Copy code

access_key_id: <SPACES_KEY>
secret_access_key: <SPACES_SECRET>

Temilola Onaneye

03/28/2022, 7:33 PM

Hi @Guy Hardonag and @Tal Sofer So for updates, we have been able to make significant progress but currently unable to create a repo as we get an error every time we try and this is due to the storage name space. We put in our block storage endpoint but it doesn't work See the screenshot below

Barak Amar

03/28/2022, 7:42 PM

"When using S3-focused tools, keep in mind that S3 terminology differs from DigitalOcean terminology. An S3 “bucket” is the equivalent of an individual Space and an S3 “key” is the name of a file."

Barak Amar

03/28/2022, 7:43 PM

can you try to address the repository's storage namespace as 's3://<space>' or 's3://<space>/<path under the space you like lakefs to store you data>'

Barak Amar

03/28/2022, 7:48 PM

@Temilola Onaneye I think 'talent-graph-storage-lakefs' is your space, so you will need to remove it from the endpoint address.

Temilola Onaneye

03/30/2022, 10:07 AM

Hi @Barak Amar, we tried this and yet it still didn't work... See screenshot below..

Temilola Onaneye

03/30/2022, 10:07 AM

Screen Shot 2022-03-30 at 11.07.25.png

Barak Amar

03/30/2022, 10:08 AM

For this case - can you share the endpoint configured and the log lakefs generated after this request? It will help identify why it failed.

Temilola Onaneye

03/30/2022, 10:08 AM

The structure of our YAML file while trying to use Digital Ocean looks like this database: connection_string: "<connection-strings>" auth: encrypt: secret_key: "[Secret-keys]" blockstore: type: s3 s3: force_path_style: true endpoint: nyc3.digitaloceanspaces.com discover_bucket_region: false credentials: access_key_id: [key-id] secret_access_key: [access-key]

Barak Amar

03/30/2022, 10:09 AM

Can you update the endpoint to

Copy code

<https://nyc3.digitaloceanspaces.com>

Barak Amar

03/30/2022, 10:09 AM

The schema should matter there

Temilola Onaneye

03/30/2022, 10:47 AM

Hi @Barak Amar is it possible to get on a call with you as we have been on this for days?? You can share your calendar privately to know what time works for you.

👍 1

Temilola Onaneye

03/30/2022, 2:35 PM

Thanks @Barak Amar @Tal Sofer @Guy Hardonag It works now. We have been able to setup LakeFS on Digital Ocean using Digital Ocean Storage Spaces.

🎉 3

heart lakefs 3

Temilola Onaneye

04/04/2022, 3:30 PM

@Nnolum Esther here is the thread

Temilola Onaneye

04/05/2022, 10:29 AM

Hi @Barak Amar @Tal Sofer @Guy Hardonag The Journey continues. So the next line of action for us is to have Spark and delta Lake integrated with LakeFS. We followed the setup on the page but it doens't work

Temilola Onaneye

04/05/2022, 10:30 AM

MicrosoftTeams-image (9).png

Temilola Onaneye

04/05/2022, 10:30 AM

@Nnolum Esther @Sadiq

Lynn Rozen

04/05/2022, 11:05 AM

Hi, correct me if I'm wrong, but I can see that you are using lakeFS HadoopFS. Can you please tell me what was your usage with spark before lakeFS? How the path uri looked like? Does it starts with

s3a://

? I'm asking since I'd like to know if you are using the s3a file system that the HadoopFS supports. (If not, you should connect Spark with lakeFS trough the S3 gateway.)

Nnolum Esther

04/05/2022, 11:59 AM

We first tried following the steps in Accessing lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway And it showed similar error while trying to read the files (image below)

Nnolum Esther

04/05/2022, 12:02 PM

👀 1

Temilola Onaneye

04/05/2022, 2:06 PM

Hi @Lynn Rozen , so this is the first time we are trying to use the S3 storage on Digital Ocean with Spark and LakeFS. Our endpoint looks like "https://nyc3.digitaloceanspaces.com". We tried using the s3a connection but we get these errors.

Lynn Rozen

04/05/2022, 2:18 PM

Hi, thanks for the update! I have some follow up questions in order to fully understand the issue. 🙂 Which Spark version and Hadoop version do you use? How does your environment looks without lakeFS? Did you managed to run the same spark command directly to your storage? It seams like you need to add hadoop-aws package (that has to have the same version as the hadoop-common).

Temilola Onaneye

04/05/2022, 2:29 PM

Spark version - version 3.2.1

Temilola Onaneye

04/05/2022, 2:40 PM

Hadoop version - 3.3 Without LakeFS, we have a blockage storage on Digital Ocean and Spark installed. We first started out by deploying LakeFS to work with our object storage on digital ocean, which was successful after so many trials. Now we want to integrate Spark with LakeFS And also integrate Delta Lake with LakeFS

Lynn Rozen

04/05/2022, 5:21 PM

Can you please share your spark configuration (specifically lakeFS configuration)? I'd also like to know if adding the hadoop-aws package helped.

Temilola Onaneye

04/05/2022, 6:05 PM

Hi @Lynn Rozen After adding the hadoop-aws package to the jar folder We have this new error

Temilola Onaneye

04/05/2022, 6:07 PM

Screen Shot 2022-04-05 at 19.03.53.png,Screen Shot 2022-04-05 at 19.04.09.png

Temilola Onaneye

04/05/2022, 6:11 PM

And here is the spark configuration template spark-shell --conf spark.hadoop.fs.s3a.access.key='S3-STORAGE-ACCESS-KEY' \ --conf spark.hadoop.fs.s3a.secret.key='S3-STORAGE-SECRET-KEY' \ --conf spark.hadoop.fs.s3a.path.style.access=true \ --conf spark.hadoop.fs.s3a.endpoint='https://nyc3.digitaloceanspaces.com' ...

Lynn Rozen

04/05/2022, 6:49 PM

Ok. Lets configure Spark and lakeFS through the S3 gateway. In that case, the credentials and the endpoint should be those of lakeFS. You should include these properties:

Lynn Rozen

04/05/2022, 6:52 PM

Please make sure that you don't include lakeFS hadoop FS in your configuration 🙂

Temilola Onaneye

04/06/2022, 2:33 PM

Hi @Lynn Rozen, thanks so much It works now.. Many thanks for the help. So What advise can you share if we want to have this saved permanently somewhere and used on a large scale.

jumping lakefs 2

Temilola Onaneye

04/06/2022, 2:34 PM

Then also we want to be able to use this writing pyspark from a Jupyter notebook. We also want to integrate with Delta Lake too.

Lynn Rozen

04/06/2022, 3:23 PM

How do you run Spark now?

Temilola Onaneye

04/06/2022, 6:16 PM

Hi Lynn

Temilola Onaneye

04/06/2022, 6:17 PM

So we currently are able to run spark from our Jupyter notebook running pyspark.

Temilola Onaneye

04/06/2022, 6:21 PM

See screenshot below if we are on track to setup the necessary configuration to access lakefs

Temilola Onaneye

04/06/2022, 6:27 PM

Below is the log

Lynn Rozen

04/06/2022, 6:48 PM

I believe that if the configuration worked for you, it should also work with pyspark. The log above is a new error you get? Do you also try to access aws from the notebook?

Temilola Onaneye

04/06/2022, 6:50 PM

I get the error above when I run that cell of the notebook

Temilola Onaneye

04/06/2022, 6:52 PM

We are not using aws in anyway . Our block storage is in Digital ocean

Temilola Onaneye

04/06/2022, 6:53 PM

I think setting the lakefs credentials permanently in a config file should handle this

Temilola Onaneye

04/07/2022, 4:50 PM

Hi @Lynn Rozen @Barak Amar @Guy Hardonag Still stuck on this. Please how can I access Lakefs using pyspark

Lynn Rozen

04/07/2022, 4:53 PM

Hi 🙂 Can you please tell me what changed between the first time it worked to now? Where did you run it the first time?

Temilola Onaneye

04/07/2022, 4:55 PM

So the first time it worked, I ran it using spark shell in the terminal. So right now I am trying to access it from inside a jupyter notebook using pyspark

Temilola Onaneye

04/07/2022, 4:58 PM

I have this now

👀 1

Lynn Rozen

04/07/2022, 6:00 PM

Can you please tell me what command you are running? Because I can't reproduce the error. Also, I suggest you to print the values of the configs to make sure they are set correctly.

Temilola Onaneye

04/07/2022, 6:48 PM

Hi @Lynn Rozen, so apparently I was missing something in the spark configuration It works nows. I can access data on a LakeFS repo using pyspark from a Jupyter Notebook

Lynn Rozen

04/07/2022, 6:53 PM

Great, thanks for the update!

Temilola Onaneye

04/07/2022, 6:54 PM

I would continue tomorrow with the journey as we look at how to have delta lake configured

Temilola Onaneye

04/07/2022, 6:54 PM

Many thanks for the help

Temilola Onaneye

04/07/2022, 8:09 PM

Thanks Everyone, we have also been able to integrate delta lake and use it with LakeFS

jumping lakefs 1

Temilola Onaneye

04/07/2022, 8:10 PM

Thanks @Lynn Rozen @Barak Amar @Guy Hardonag and @Tal Sofer For the journey from setting up LakeFS on Digital ocean to integrating with Spark and Delta Lake.

jumping lakefs 1

heart lakefs 1

Lynn Rozen

04/07/2022, 8:10 PM

Sure!

🙌 1

✅ 1

Open in Slack

Previous Next