Hi Everyone, Glad to be here.. I want to deploy La...
# help
t
Hi Everyone, Glad to be here.. I want to deploy LakeFS on digital ocean using Kubernetes or Docker.. Please is there any documentation that can guide me on achieving that. I intend to have it point to a storage on Digital Ocean and also have Delta Lake integrated
t
Hi @Temilola Onaneye! welcome to the lake lakefs These would be the relevant documentation for deploying lakeFS: • Deploy lakeFS on k8s • Deploy lakeFS on Docker As for connecting lakeFS to an (object) storage on digital ocean - we haven’t tested it and would love to hear how this goes and assist in any way possible. What storage are you planning to use?
1
You mentioned that you have Delta Lake integrated, so you may want to check out our DeltaLake-lakeFS integration docs
1
t
Yeah sure, Would be sure to make use of this thread for update.. Many thanks
t
Sure, thank you!
t
@Sadiq
@Sadiq here is the thread to use
s
Hi @Tal Sofer so for more context the ‘config.yml’ file has been modified and a bucket object created
We have been able to get the lakefs service running
Here's the status when this command is run connecting with the storage bucket from Digital Ocean “ INFO[0000]/home/runner/work/lakeFS/lakeFS/cmd/lakefs/cmd/root.go:61 github.com/treeverse/lakefs/cmd/lakefs/cmd.initConfig() Configuration file fields.file=/home/esther/config.yaml file=/home/esther/config.yaml phase=startup INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/root.go:103 cmd/lakefs/cmd.initConfig Config loaded fields.file=/home/esther/config.yaml file=/home/esther/config.yaml phase=startup INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/root.go:110 cmd/lakefs/cmd.initConfig Config actions.enabled=true auth.cache.enabled=true auth.cache.jitter=3s auth.cache.size=1024 auth.cache.ttl=20s auth.encrypt.secret_key="******" blockstore.azure.auth_method=access-key blockstore.azure.storage_access_key="" blockstore.azure.storage_account="" blockstore.azure.try_timeout=10m0s blockstore.default_namespace_prefix="" blockstore.gs.credentials_file="" blockstore.gs.credentials_json="" blockstore.gs.s3_endpoint="https://storage.googleapis.com" blockstore.local.path="~/data/lakefs/block" blockstore.s3.credentials_file="" blockstore.s3.discover_bucket_region=true blockstore.s3.endpoint="" blockstore.s3.force_path_style=false blockstore.s3.max_retries=5 blockstore.s3.profile="" blockstore.s3.region=us-east-1 blockstore.s3.streaming_chunk_size=1048576 blockstore.s3.streaming_chunk_timeout=1s blockstore.type=nyc3.digitaloceanspaces.com committed.block_storage_prefix=_lakefs committed.local_cache.dir="~/data/lakefs/cache" committed.local_cache.max_uploaders_per_writer=10 committed.local_cache.metarange_proportion=0.1 committed.local_cache.range_proportion=0.9 committed.local_cache.size_bytes=1073741824 committed.permanent.max_range_size_bytes=20971520 committed.permanent.min_range_size_bytes=0 committed.permanent.range_raggedness_entries=50000 committed.sstable.memory.cache_size_bytes=400000000 database.connection_max_lifetime=0s database.connection_string="******" database.max_idle_connections=0 database.max_open_connections=0 fields.file=/home/esther/config.yaml file=/home/esther/config.yaml gateways.s3.domain_name="[s3.local.lakefs.io]" gateways.s3.fallback_url="" gateways.s3.region=us-east-1 installation.fixed_id="" listen_address="0.0.0.0:8000" logging.file_max_size_mb=0 logging.files_keep=100 logging.format=text logging.level=INFO logging.output="[-]" logging.trace_request_headers=false phase=startup security.audit_check_interval=12h0m0s security.audit_check_url="https://audit.lakefs.io/audit" stats.address="https://stats.treeverse.io" stats.enabled=true stats.flush_interval=30s INFO [2022-03-21T174019Z]lakeFS/cmd/lakefs/cmd/run.go:104 cmd/lakefs/cmd.glob..func8 lakeFS run version=0.61.0 INFO [2022-03-21T174019Z]lakeFS/pkg/db/connect.go:54 pkg/db.ConnectDBPool Connecting to the DB conn_max_lifetime=5m0s driver=pgx max_idle_conns=25 max_open_conns=
Can you guide to know if we are on the right path to making this work ?
g
Hey @Sadiq, I believe you mixed up the configuration a bit
blockstore.type
shoud be
one of ["local", "s3", "gs", "azure", "mem]
I assume you are using S3, if so you don’t need to configure the azure configuration and google storage configuration You could check out https://docs.lakefs.io/reference/configuration.html#reference for more information about the configuration
s
So the deal is actually make it work using a diff block store from the regular
What's your advise on that ?
g
IIUC you are using DigitalOcean Spaces In that case you would want to configure S3 as your block store and use the DigitalOcean endpoint (https://nyc3.digitaloceanspaces.com)
A good example for that would be minIO In your case
Copy code
endpoint: <https://nyc3.digitaloceanspaces.com>
Copy code
access_key_id: <SPACES_KEY>
secret_access_key: <SPACES_SECRET>
t
Hi @Guy Hardonag and @Tal Sofer So for updates, we have been able to make significant progress but currently unable to create a repo as we get an error every time we try and this is due to the storage name space. We put in our block storage endpoint but it doesn't work See the screenshot below
b
"When using S3-focused tools, keep in mind that S3 terminology differs from DigitalOcean terminology. An S3 “bucket” is the equivalent of an individual Space and an S3 “key” is the name of a file."
can you try to address the repository's storage namespace as 's3://<space>' or 's3://<space>/<path under the space you like lakefs to store you data>'
@Temilola Onaneye I think 'talent-graph-storage-lakefs' is your space, so you will need to remove it from the endpoint address.
t
Hi @Barak Amar, we tried this and yet it still didn't work... See screenshot below..
Screen Shot 2022-03-30 at 11.07.25.png
b
For this case - can you share the endpoint configured and the log lakefs generated after this request? It will help identify why it failed.
t
The structure of our YAML file while trying to use Digital Ocean looks like this database: connection_string: "<connection-strings>" auth: encrypt: secret_key: "[Secret-keys]" blockstore: type: s3 s3: force_path_style: true endpoint: nyc3.digitaloceanspaces.com discover_bucket_region: false credentials: access_key_id: [key-id] secret_access_key: [access-key]
b
Can you update the endpoint to
Copy code
<https://nyc3.digitaloceanspaces.com>
The schema should matter there
t
Hi @Barak Amar is it possible to get on a call with you as we have been on this for days?? You can share your calendar privately to know what time works for you.
👍 1
Thanks @Barak Amar @Tal Sofer @Guy Hardonag It works now. We have been able to setup LakeFS on Digital Ocean using Digital Ocean Storage Spaces.
🎉 3
heart lakefs 3
@Nnolum Esther here is the thread
Hi @Barak Amar @Tal Sofer @Guy Hardonag The Journey continues. So the next line of action for us is to have Spark and delta Lake integrated with LakeFS. We followed the setup on the page but it doens't work
MicrosoftTeams-image (9).png
@Nnolum Esther @Sadiq
l
Hi, correct me if I'm wrong, but I can see that you are using lakeFS HadoopFS. Can you please tell me what was your usage with spark before lakeFS? How the path uri looked like? Does it starts with
s3a://
? I'm asking since I'd like to know if you are using the s3a file system that the HadoopFS supports. (If not, you should connect Spark with lakeFS trough the S3 gateway.)
n
We first tried following the steps in Accessing lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway And it showed similar error while trying to read the files (image below)
👀 1
t
Hi @Lynn Rozen , so this is the first time we are trying to use the S3 storage on Digital Ocean with Spark and LakeFS. Our endpoint looks like "https://nyc3.digitaloceanspaces.com". We tried using the s3a connection but we get these errors.
l
Hi, thanks for the update! I have some follow up questions in order to fully understand the issue. 🙂 Which Spark version and Hadoop version do you use? How does your environment looks without lakeFS? Did you managed to run the same spark command directly to your storage? It seams like you need to add hadoop-aws package (that has to have the same version as the hadoop-common).
t
Spark version - version 3.2.1
Hadoop version - 3.3 Without LakeFS, we have a blockage storage on Digital Ocean and Spark installed. We first started out by deploying LakeFS to work with our object storage on digital ocean, which was successful after so many trials. Now we want to integrate Spark with LakeFS And also integrate Delta Lake with LakeFS
l
Can you please share your spark configuration (specifically lakeFS configuration)? I'd also like to know if adding the hadoop-aws package helped.
t
Hi @Lynn Rozen After adding the hadoop-aws package to the jar folder We have this new error
Screen Shot 2022-04-05 at 19.03.53.png,Screen Shot 2022-04-05 at 19.04.09.png
And here is the spark configuration template spark-shell --conf spark.hadoop.fs.s3a.access.key='S3-STORAGE-ACCESS-KEY' \ --conf spark.hadoop.fs.s3a.secret.key='S3-STORAGE-SECRET-KEY' \ --conf spark.hadoop.fs.s3a.path.style.access=true \ --conf spark.hadoop.fs.s3a.endpoint='https://nyc3.digitaloceanspaces.com' ...
l
Ok. Lets configure Spark and lakeFS through the S3 gateway. In that case, the credentials and the endpoint should be those of lakeFS. You should include these properties:
Please make sure that you don't include lakeFS hadoop FS in your configuration 🙂
t
Hi @Lynn Rozen, thanks so much It works now.. Many thanks for the help. So What advise can you share if we want to have this saved permanently somewhere and used on a large scale.
jumping lakefs 2
Then also we want to be able to use this writing pyspark from a Jupyter notebook. We also want to integrate with Delta Lake too.
l
How do you run Spark now?
t
Hi Lynn
So we currently are able to run spark from our Jupyter notebook running pyspark.
See screenshot below if we are on track to setup the necessary configuration to access lakefs
Below is the log
l
I believe that if the configuration worked for you, it should also work with pyspark. The log above is a new error you get? Do you also try to access aws from the notebook?
t
I get the error above when I run that cell of the notebook
We are not using aws in anyway . Our block storage is in Digital ocean
I think setting the lakefs credentials permanently in a config file should handle this
Hi @Lynn Rozen @Barak Amar @Guy Hardonag Still stuck on this. Please how can I access Lakefs using pyspark
l
Hi 🙂 Can you please tell me what changed between the first time it worked to now? Where did you run it the first time?
t
So the first time it worked, I ran it using spark shell in the terminal. So right now I am trying to access it from inside a jupyter notebook using pyspark
I have this now
👀 1
l
Can you please tell me what command you are running? Because I can't reproduce the error. Also, I suggest you to print the values of the configs to make sure they are set correctly.
t
Hi @Lynn Rozen, so apparently I was missing something in the spark configuration It works nows. I can access data on a LakeFS repo using pyspark from a Jupyter Notebook
l
Great, thanks for the update!
t
I would continue tomorrow with the journey as we look at how to have delta lake configured
Many thanks for the help
Thanks Everyone, we have also been able to integrate delta lake and use it with LakeFS
jumping lakefs 1
Thanks @Lynn Rozen @Barak Amar @Guy Hardonag and @Tal Sofer For the journey from setting up LakeFS on Digital ocean to integrating with Spark and Delta Lake.
jumping lakefs 1
heart lakefs 1
l
Sure!
🙌 1
1