Hi all, I'm currently trying to automate data inge...
# help
u
Hi all, I'm currently trying to automate data ingestion using airflow. Basically, I want to be able to use spark to read a sample database and then write it to my LakeFS s3 bucket. Any advice as regards to a way I can achieve this?
u
Hi @Jude. Welcome aboard. You can definitely get this done, I would suggest having a look at the doc that goes through the process of setting up Spark with Lakefs and your S3 bucket.
u
Okay will take a look at it and give you feedback.
u
After you add spark - you might want to look into https://docs.lakefs.io/integrations/airflow.html
u
Hello! everyone, regarding my question above as at yesterday, I've been able to implement what I stated above. Thanks to you guys for the materials you shared which really helped alot. However I'm facing one little problem. I want to be able to sense if the file I uploaded to my LakeFS s3 bucket exists using the LakeFSFileSensor operation from lakefs_provider operator. I've been able to successfully connect to my LakeFS through the Airflow UI, but I still get the error ['branch', 'LakeFS conn id', 'msg', 'repo'] is required. Then I added these parameters to my sense file dag like this: Task_sense_file = LakeFSFileSensor( task_id='sense_file', repo='test', branch='main', lakefs_conn_id='conn_1', path='s3a://test/main/foldername/users.parquet' ) after trying this I still get the same error
u
Hi @Jude. Please let me take a look at this error and get back to you
u
Alright, thank you @Itai David
u
Hi @Jude. Sorry for the long delay - I had to get my setup in place At what phase are you getting the error?
u
Either ways, I believe your task definition contains an error in the
path
- the
repo
and the
branch
should not be a part of the
path
field, as well as the
s3a://
prefix Can you please try the following instead:
Copy code
task_sense_file = LakeFSFileSensor(
   task_id='sense_file',
   repo='test',
   branch='main',
   lakefs_conn_id='conn_1',
   path='foldername/users.parquet'
)
u
Hi @Itai David sorry for my late reply and thanks for looking into the error. So I was able to discover my mistake actually, I was trying to run LakeFSFile Sensor operator against a python operator since I was communicating with LakeFS via the python client, hence the error. Update I've been able to resolve that issue and all my pipeline runs successfully as expected. Thanks again ❤️ and do have a wonderful weekend
u
Thank you @Jude, for this update 🙂
u
INFO   [2022-06-18T10:29:54Z]lakeFS/pkg/auth/authenticator.go:54 pkg/auth.ChainAuthenticator.AuthenticateUser Failed to authenticate user                   error="2 errors occurred:\n\t* built in authenticator: could not decrypt value\n\t* email authenticator: not found: no rows in result set\n\n" host="137.184.147.128:8000" method=GET path=/api/v1/repositories/test/branches request_id=23de81ad-71e8-48ed-847c-fb688ccd61c6 service_name=rest_api username=AKIAJSEJPLE4NOPT4KEQ
ERROR  [2022-06-18T10:29:54Z]lakeFS/pkg/api/auth_middleware.go:157 pkg/api.userByAuth authenticate                                  error="2 errors occurred:\n\t* built in authenticator: could not decrypt value\n\t* email authenticator: not found: no rows in result set\n\n" host="137.184.147.128:8000" method=GET path=/api/v1/repositories/test/branches request_id=23de81ad-71e8-48ed-847c-fb688ccd61c6 service=api_gateway service_name=rest_api user=AKIAJSEJPLE4NOPT4KEQ
Hi @Barak Amar, please can you help me look at this error. I am having issues uploading my data to lakefs storage using lakefs client. It used to work before until generated new access keys to access the UI. Is there something I am doing wrongly?
u
Hi @Jude, from the error it looks you are using the old key
AKIAJSEJPLE4NOPT4KEQ
did you use the new key created by superuser?
u
Yes I used it but got the same error. I even had to update the secret access in the lakefs config files, but got the same error
u
Previous we added a new super user - the command should print out a new key/secret
u
With the new key/secret you can access lakefs UI
u
At this point we should not change any secret in lakefs.yaml
u
The secret there is used to encrypt/decrypt the keys used to login the system
u
You can remove/add the previous user's key after logging in to the UI with the new admin.
u
But the old key/secret is not valid anymore
u
I thought about it and I have reverted it back to the original state it was in the config files. I made sure I had a backup of those files before making those changes.
u
Just to understand the current state better: you have recovered an old lakefs.yaml with the original encrypt secret key? and you are tying to access lakeFS using the original key/secret?
u
Yes, I have recovered the old lakefs.yaml. Now I am trying to access LakeFS using the new generated keys
u
If you are using an old lakefs.yaml it means that the encryption key changed again and new keys will no longer be valid.
u
You will need the admin's key/secret of the original encryption key (first setup) to access lakefs.
u
Or create a new admin user using the superuser command we used before.
u
Changing the lakefs.yaml encryption key will goes with the credentials keys it created.
u
Okay to be clear, I create a new admin using the super user and then use the generated keys to access LakeFS without making any modification to any files?
u
right lakefs's configuration file holds the encryption key - it is used to encrypt / decrypt the key/secret information used by the system.
u
You can use the new admin's key/secret generated by superuser in your client and UI
u
if you still need to use the old user - you can access the UI with the new admin and delete + add a new key/secret to the old user.
u
Alright, it's clear now let me try that now. Thank you for the prompt response
u
Just try it now and its still giving me invalid access key id
u
did you use the new key/secret the super user command created?
u
Yes I used it. I can also see the new admin created in the UI with the same access key id with the one I'm using
u
hi @Jude, when receiving the the invalid access key id, which access key is logged, the one you just created or
AKIAJSEJPLE4NOPT4KEQ
?
u
Let me check that
u
So I am trying to see check the logged access_key_id but can't find it just some random code. But below is the error message from my terminal
u
ERROR [2022-06-18T115124Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>InvalidAccessKeyId</Code><RequestId>tx000000000000035f2b2d8-0062adbc3c-319abd43-nyc3c</RequestId><HostId>319abd43-nyc3c-nyc3-zg03</HostId></Error>" host="137.184.147.128:8000" method=POST operation=PutObject path="/api/v1/repositories/test/branches/main/objects?path=bronzelayer%2FWEB%2Ffootball-data.parquet" request_id=2ab0fe50-8595-412d-9004-8cfcb7c7abae service_name=rest_api status_code=403 url="https://nyc3.digitaloceanspaces.com/sonalysis-lakefs/a136c647db1445b6bc0a4dd261469bd0" ERROR [2022-06-18T115124Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>InvalidAccessKeyId</Code><RequestId>tx000000000000035f2b306-0062adbc3c-319abd43-nyc3c</RequestId><HostId>319abd43-nyc3c-nyc3-zg03</HostId></Error>" host="137.184.147.128:8000" method=POST operation=PutObject path="/api/v1/repositories/test/branches/main/objects?path=bronzelayer%2FAPIs%2Fevents.parquet" request_id=5f122245-3e58-4f35-8c67-4462b9f92dc0 service_name=rest_api status_code=403 url="https://nyc3.digitaloceanspaces.com/sonalysis-lakefs/7163853e2b6f4c03b7d786d7f4229594" ERROR [2022-06-18T115149Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>InvalidAccessKeyId</Code><RequestId>tx000000000000035f2c1e2-0062adbc55-319abd43-nyc3c</RequestId><HostId>319abd43-nyc3c-nyc3-zg03</HostId></Error>" host="137.184.147.128:8000" method=POST operation=PutObject path="/api/v1/repositories/test/branches/main/objects?path=bronzelayer%2FWEB%2Ffootball-data.parquet" request_id=844b22e7-19bd-4173-8437-096c4a9a299e service_name=rest_api status_code=403 url="https://nyc3.digitaloceanspaces.com/sonalysis-lakefs/bd6a164656c3403d8da51fb094c6c193" ERROR [2022-06-18T115150Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>InvalidAccessKeyId</Code><RequestId>tx000000000000035f2c22e-0062adbc56-319abd43-nyc3c</RequestId><HostId>319abd43-nyc3c-nyc3-zg03</HostId></Error>" host="137.184.147.128:8000" method=POST operation=PutObject path="/api/v1/repositories/test/branches/main/objects?path=bronzelayer%2FAPIs%2Fevents.parquet" request_id=d0edbe98-3d66-4b86-ac0e-2f3bf640a11a service_name=rest_api status_code=403 url="https://nyc3.digitaloceanspaces.com/sonalysis-lakefs/4ed253f54094421289fe09d5fb9143b5" ERROR [2022-06-18T115217Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>InvalidAccessKeyId</Code><RequestId>tx0000000000000368ca475-0062adbc71-319989f1-nyc3c</RequestId><HostId>319989f1-nyc3c-nyc3-zg03</HostId></Error>" host="137.184.147.128:8000" method=POST operation=PutObject path="/api/v1/repositories/test/branches/main/objects?path=bronzelayer%2FAPIs%2Fevents.parquet" request_id=ccb88755-abd8-4ee7-a30d-ff2c30e4d0f1 service_name=rest_api status_code=403 url="https://nyc3.digitaloceanspaces.com/sonalysis-lakefs/5c926e48222644b698084857ff059a06" ERROR [2022-06-18T115218Z]lakeFS/pkg/block/s3/adapter.go:250 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><…
u
Can you please check in
lakefs.yaml
if the values for
s3.credentials.access_key_id
and
s3.credentials.secret_access_key
are the values provided by Digital Ocean?
u
Alright
u
Trying to get in touch with our devops team once they provide me with the info I will update you on this. @Guy Hardonag
u
Hi @Guy Hardonag so I have been able to verify that the
s3 credentials
provided by Digital Ocean is the same as what I have in the
lakefs.yaml
file.
u
Hi @Jude, It seems like lakeFS can’t access digital ocean, can you please try to access the requested location using the secret and access key provided in
lakefs.yaml
using the aws-cli?
u
You could run this command:
Copy code
echo "" | aws s3 cp - <s3://sonalysis-lakefs/test> --endpoint-url "<https://nyc3.digitaloceanspaces.com>"
u
Okay let me quickly check that out
u
make sure to set the required access-key and secret-key
u
Copy code
AWS_SECRET_ACCESS_KEY=<YOUR-SECRET> AWS_ACCESS_KEY_ID=<YOUR-ACCESS> bash -c 'echo "" | aws s3 cp - <s3://sonalysis-lakefs/test> --endpoint-url "<https://nyc3.digitaloceanspaces.com>"'
u
Copy code
database:
  connection_string:  "<postgres://postgres>:xxxxxxxxxxxxxxxxxx"
  #max_open_connections: 115
  #max_idle_connections: 25
  #connection_max_lifetime: 5m

auth:
  encrypt:
    secret_key: "MxxxxxVKXIZCPMI3SAZA2A"

blockstore:
  type: s3
  s3:
    force_path_style: true
    #endpoint: <http://nyc3.digitaloceanspaces.com|nyc3.digitaloceanspaces.com>
    endpoint: <https://nyc3.digitaloceanspaces.com>
    discover_bucket_region: false
    credentials:
      access_key_id: AKIAJSEJPLE4xxxxxxxx
      secret_access_key: 5HMeHfyE8cal28rH+0zc8ECxr3bxxxxxxxxxxxxx
Now just to be clear, this is my config file and I have intentionally changed some of the secrets for security reasons, based on the second command you dropped, I have already created a new admin using the lakefs super user, and the access key credentials it generated for me are different from what I have in the last two lines in the config file. So my question is should I use this command using the access secrets for the new admin I created initially or stick with the one I have in my config file? @Guy Hardonag
u
Hi @Jude, yes, please run the command using the credentials in the
lakefs.yaml
file.
u
Update: I was able to get it to work after running those commands, and everything was working well and the pipeline successfully pushed data to lakefs, then after about an hour I got disconnected from Postgres with this error message `panic: error while connecting to DB: failed to connect to `host=137.184.147.128 user=postgres database=postgres`: server error (FATAL: password authentication failed for user "postgres" (SQLSTATE 28P01))` This is what I and my team have been on for some hours now trying to resolve
u
Hi @Jude, did something changed that might have cause that connection failure suddenly?
u
Hi @Lynn Rozen the only thing that changed were the new access credentials which were provided to me by one of my team members to use in the config.yaml file. Everything was working just fine and lakefs was starting without issues until this error occured
u
And can you connect to the connection string configured in lakefs.yaml using some sql client with those credentials?
u
It looks like it's a Postgres issue. I actually tried connecting to the database connection string using the sql client but couldn't log in. Just trying to figure out what must have caused the Postgres password to change
u
Ok, looking forward for an update🙏
u
@Lynn Rozen we have been able to fix the Postgres issue and Lakfes works as expected now. Thanks to you and the team for all the support.
u
Glad to hear!