Hi team, I have one question about using service a...
# help
g
Hi team, I have one question about using service account to deploy service. To use S3 as blockstore, before my helm config file is like below
Copy code
secrets:
  authEncryptSecretKey: "123"
  # Use the following to fetch PostgreSQL connection string from an existing secret:
  databaseConnectionString: "postgres://***"

lakefsConfig: |
  database:
    type: "postgres"
  blockstore:
    type: "s3"
    s3:
      region: "us-west-2"
      credentials:
        access_key_id: "***"
        secret_access_key: "***"
If I want to switch to service account to access S3 bucket, do you have any example for the helm config files? Will it be something like below
Copy code
serviceAccount: <service-account-name>
secrets:
  authEncryptSecretKey: "123"
  # Use the following to fetch PostgreSQL connection string from an existing secret:
  databaseConnectionString: "postgres://***"

lakefsConfig: |
  database:
    type: "postgres"
  blockstore:
    type: "s3"
    s3:
      region: "us-west-2"
a
lakeFS support using credentials-file using the
blockstore.s3.credentials_file
and `blockstore.s3.profile`configurations. You can give a path to a configuration file that will look something like that:
Copy code
[lakefs]
role_arn = <YOUR_ROLE_ARN>
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token
role_session_name = <ROLE_SESSION_NAME>
g
Thanks for the quick reply! I will have a try to use profile. But serviceAccount is the preferred way for us. I saw the helm values can take serviceAccount value. I guess that’s supported. So trying to get some example to follow
a
@gang ye FYI: I requested engineers to review this request and provide some example to use service account.
g
OK! Thanks a lot!
i
@gang ye the lakeFS server uses the golang aws sdk. If you configured the service account for the lakeFS pod/deployment properly, it should work out of the box. So this config
Copy code
secrets:
  authEncryptSecretKey: "123"
  # Use the following to fetch PostgreSQL connection string from an existing secret:
  databaseConnectionString: "postgres://***"

lakefsConfig: |
  database:
    type: "postgres"
  blockstore:
    type: "s3"
    s3:
      region: "us-west-2"
will default to aws sdk credentials lookup, which will find the service account creds.
sunglasses lakefs 1
g
I think I figured out how to config the service account in helm config file like below
Copy code
serviceAccount:
  name: <service-account-name>
🤘 1
one quick question, does lakefs uses AWS
DefaultAWSCredentialsProviderChain
? after setting service account in helm deployment, pod env variables have the configuration below
Copy code
AWS_DEFAULT_REGION=us-west-2
AWS_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::***:role/data-experimentation
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_STS_REGIONAL_ENDPOINTS=regional
But the lakefs server cannot create the block adapter as expected.
Hi @Itai Admi I was able to set the service account in deployment. The pod contains the above env variable. And I verified the service account credential works by running another pod with awscli to list files on S3. But still lakefs server cannot initialize the block adapter. Any other missing configuration? Do you have any clue? Thanks!
i
Hey @gang ye, can you share the lakeFS config that you’re using now, including the env vars for it, if any (starting with
LAKEFS_
)? Also, can you share the logs from the server?
g
Sure. Helm config file
Copy code
image:
  repository: <http://docker.io/treeverse/lakefs|docker.io/treeverse/lakefs>
  pullPolicy: IfNotPresent

# Keys used for existingSecret
secrets:
  authEncryptSecretKey: "123"
  
lakefsConfig: |
  logging.level: TRACE
  stats.enabled: false
  database:
    type: local
  blockstore:
    type: s3
serviceAccount: 
  name: data-experimentation-sa
env variable
Copy code
~ $ printenv | grep lakefs
HOSTNAME=lakefsingress-cc96645cf-46xs7
HOME=/home/lakefs
PWD=/home/lakefs
~ $ printenv | grep AWS
AWS_ROLE_ARN=arn:aws:iam::***:role/data-experimentation
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_DEFAULT_REGION=us-west-2
AWS_REGION=us-west-2
server pod log message
Copy code
time="2023-10-20T21:08:50Z" level=info msg="Configuration file" func=github.com/treeverse/lakefs/cmd/lakefs/cmd.initConfig file="/build/cmd/lakefs/cmd/root.go:109" fields.file=/etc/lakefs/config.yaml file="/build/cmd/lakefs/cmd/root.go:109" phase=startup
time="2023-10-20T21:08:50Z" level=info msg="Config loaded" func=cmd/lakefs/cmd.initConfig file="cmd/root.go:151" fields.file=/etc/lakefs/config.yaml file="cmd/root.go:151" phase=startup
time="2023-10-20T21:08:50Z" level=info msg=Config func=cmd/lakefs/cmd.initConfig file="cmd/root.go:159" actions.enabled=true actions.lua.net_http_enabled=false auth.api.endpoint="" auth.api.supports_invites=false auth.api.token=------ auth.cache.enabled=true auth.cache.jitter=3s auth.cache.size=1024 auth.cache.ttl=20s auth.cookie_auth_verification.auth_source="" auth.cookie_auth_verification.default_initial_groups="[]" auth.cookie_auth_verification.external_user_id_claim_name="" auth.cookie_auth_verification.friendly_name_claim_name="" auth.cookie_auth_verification.initial_groups_claim_name="" auth.cookie_auth_verification.validate_id_token_claims="map[]" auth.encrypt.secret_key="******" auth.login_duration=168h0m0s auth.logout_redirect_url=/auth/login auth.oidc.default_initial_groups="[]" auth.oidc.friendly_name_claim_name="" auth.oidc.initial_groups_claim_name="" auth.oidc.validate_id_token_claims="map[]" auth.remote_authenticator.default_user_group=Viewers auth.remote_authenticator.enabled=false auth.remote_authenticator.endpoint="" auth.remote_authenticator.request_timeout=10s auth.ui_config.login_cookie_names="[internal_auth_session]" auth.ui_config.login_failed_message="The credentials don't match." auth.ui_config.login_url="" auth.ui_config.logout_url="" auth.ui_config.rbac=simplified blockstore.azure.auth_method="" blockstore.azure.disable_pre_signed=false blockstore.azure.disable_pre_signed_ui=true blockstore.azure.pre_signed_expiry=15m0s blockstore.azure.storage_access_key="" blockstore.azure.storage_account="" blockstore.azure.test_endpoint_url="" blockstore.azure.try_timeout=10m0s blockstore.gs.credentials_file="" blockstore.gs.credentials_json="" blockstore.gs.disable_pre_signed=false blockstore.gs.disable_pre_signed_ui=true blockstore.gs.pre_signed_expiry=15m0s blockstore.gs.s3_endpoint="<https://storage.googleapis.com>" blockstore.local.allowed_external_prefixes="[]" blockstore.local.import_enabled=false blockstore.local.import_hidden=false blockstore.local.path="~/lakefs/data/block" blockstore.s3.client_log_request=false blockstore.s3.client_log_retries=false blockstore.s3.credentials_file="" blockstore.s3.disable_pre_signed=false blockstore.s3.disable_pre_signed_ui=true blockstore.s3.discover_bucket_region=true blockstore.s3.endpoint="" blockstore.s3.force_path_style=false blockstore.s3.max_retries=5 blockstore.s3.pre_signed_expiry=15m0s blockstore.s3.profile="" blockstore.s3.region=us-east-1 blockstore.s3.server_side_encryption="" blockstore.s3.server_side_encryption_kms_key_id="" blockstore.s3.skip_verify_certificate_test_only=false blockstore.s3.web_identity.session_duration=0s blockstore.s3.web_identity.session_expiry_window=5m0s blockstore.type=s3 committed.block_storage_prefix=_lakefs committed.local_cache.dir="~/lakefs/data/cache" committed.local_cache.max_uploaders_per_writer=10 committed.local_cache.metarange_proportion=0.1 committed.local_cache.range_proportion=0.9 committed.local_cache.size_bytes=1073741824 committed.permanent.max_range_size_bytes=20971520 committed.permanent.min_range_size_bytes=0 committed.permanent.range_raggedness_entries=50000 committed.sstable.memory.cache_size_bytes=400000000 database.drop_tables=false database.dynamodb.aws_access_key_id=------ database.dynamodb.aws_profile="" database.dynamodb.aws_region="" database.dynamodb.aws_secret_access_key=------ database.dynamodb.endpoint="" database.dynamodb.health_check_interval=0s database.dynamodb.scan_limit=1024 database.dynamodb.table_name=kvstore database.local.enable_logging=false database.local.path="~/lakefs/metadata" database.local.prefetch_size=256 database.local.sync_writes=true database.postgres.connection_max_lifetime=5m0s database.postgres.connection_string=------ database.postgres.max_idle_connections=25 database.postgres.max_open_connections=25 database.postgres.metrics=false database.postgres.scan_page_size=0 database.type=local diff.delta.plugin="" email_subscription.enabled=true fields.file=/etc/lakefs/config.yaml file="cmd/root.go:159" gateways.s3.domain_name="[s3.local.lakefs.io]" gateways.s3.fallback_url="" gateways.s3.region=us-east-1 graveler.background.rate_limit=0 graveler.batch_dbio_transaction_markers=false graveler.commit_cache.expiry=10m0s graveler.commit_cache.jitter=2s graveler.commit_cache.size=50000 graveler.ensure_readable_root_namespace=true graveler.repository_cache.expiry=5s graveler.repository_cache.jitter=2s graveler.repository_cache.size=1000 installation.access_key_id=------ installation.fixed_id="" installation.secret_access_key=------ installation.user_name="" listen_address="0.0.0.0:8000" logging.audit_log_level=DEBUG logging.file_max_size_mb=102400 logging.files_keep=100 logging.format=text logging.level=TRACE logging.output="[-]" logging.trace_request_headers=false phase=startup plugins.default_path="~/.lakefs/plugins" plugins.properties="map[]" security.audit_check_interval=24h0m0s security.audit_check_url="<https://audit.lakefs.io/audit>" security.check_latest_version=true security.check_latest_version_cache=1h0m0s stats.address="<https://stats.lakefs.io>" stats.enabled=false stats.extended=false stats.flush_interval=30s stats.flush_size=100 tls.cert_file="" tls.enabled=false tls.key_file="" ugc.prepare_interval=1m0s ugc.prepare_max_file_size=20971520 ui.enabled=true ui.snippets="[]"
time="2023-10-20T21:08:50Z" level=info msg="lakeFS run" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:91" version=0.113.0
time="2023-10-20T21:08:50Z" level=info msg="initialized Auth service" func=pkg/auth.NewAuthService file="build/pkg/auth/service.go:187" service=auth_service
time="2023-10-20T21:08:50Z" level=debug msg="failed to collect account metadata" func=pkg/stats.NewMetadata file="build/pkg/stats/metadata.go:34" error="not found"
i
And then does it exit?
g
pod keeps restarting with read probe and liveness probe failure
i
Gotcha. I don’t think it has to do with the block adapter.
Do you have logs from the failing probes?
g
The above log is from the failing pod. Probe it shows
Copy code
Liveness probe failed: Get "<http://172.16.95.57:8000/_health>": dial tcp 172.16.95.57:8000: connect: connection refused
I guess it’s caused by server is not running, so connection won’t work
i
What version are you using? I’ll try to reproduce it
gratitude thank you 1
g
helm version 0.13.3 lakefs image version 0.113.0
I have enabled TRACE log, but there is no very helpful clue from the log. The reason I think it’s related to loading AWS credential(especially for service account case) is, if I set the blockstore to local or set the aws credential(access key, secret key) manually in lakefsConfig, it works.
i
OK, that’s weird. I’m also unable to reproduce at the moment
When the AWS creds are not set properly, lakeFS normally: 1. Exists on startup with the blockstore error 2. Fail to create a repo / perform any data operation on existing repos.
g
any config we can set to enable AWS debug log? looks like
logging.level
in lakeConfig only enables lakefs server debug log
i
Checking
Yes, can you try this lakeFS config?
Copy code
lakefsConfig: |
  logging.level: TRACE
  stats.enabled: false
  database:
    type: local
  blockstore:
    type: s3
    s3:
      client_log_retries: true
      client_log_request: true
Also, please share
Copy code
printenv | grep LAKEFS
g
Sure, one sec
Copy code
~ $ printenv | grep LAKEFS
LAKEFSINGRESS_PORT=<tcp://10.100.146.226:80>
LAKEFS_SERVICE_HOST=10.100.145.100
LAKEFSINGRESS_SERVICE_PORT=80
LAKEFSINGRESS_PORT_80_TCP_ADDR=10.100.146.226
LAKEFS_SERVICE_PORT=80
LAKEFS_PORT=<tcp://10.100.145.100:80>
LAKEFSINGRESS_PORT_80_TCP_PORT=80
LAKEFSINGRESS_PORT_80_TCP_PROTO=tcp
LAKEFS_PORT_80_TCP_ADDR=10.100.145.100
LAKEFS_PORT_80_TCP_PORT=80
LAKEFS_PORT_80_TCP_PROTO=tcp
LAKEFSINGRESS_PORT_80_TCP=<tcp://10.100.146.226:80>
LAKEFS_PORT_80_TCP=<tcp://10.100.145.100:80>
LAKEFSINGRESS_SERVICE_PORT_HTTP=80
LAKEFS_AUTH_ENCRYPT_SECRET_KEY=123
LAKEFS_SERVICE_PORT_HTTP=80
LAKEFSINGRESS_SERVICE_HOST=10.100.146.226
pod log message is same as before
Copy code
time="2023-10-20T21:34:36Z" level=info msg="Configuration file" func=<http://github.com/treeverse/lakefs/cmd/lakefs/cmd.initConfig|github.com/treeverse/lakefs/cmd/lakefs/cmd.initConfig> file="/build/cmd/lakefs/cmd/root.go:109" fields.file=/etc/lakefs/config.yaml file="/build/cmd/lakefs/cmd/root.go:109" phase=startup
time="2023-10-20T21:34:36Z" level=info msg="Config loaded" func=cmd/lakefs/cmd.initConfig file="cmd/root.go:151" fields.file=/etc/lakefs/config.yaml file="cmd/root.go:151" phase=startup
time="2023-10-20T21:34:36Z" level=info msg=Config func=cmd/lakefs/cmd.initConfig file="cmd/root.go:159" actions.enabled=true actions.lua.net_http_enabled=false auth.api.endpoint="" auth.api.supports_invites=false auth.api.token=------ auth.cache.enabled=true auth.cache.jitter=3s auth.cache.size=1024 auth.cache.ttl=20s auth.cookie_auth_verification.auth_source="" auth.cookie_auth_verification.default_initial_groups="[]" auth.cookie_auth_verification.external_user_id_claim_name="" auth.cookie_auth_verification.friendly_name_claim_name="" auth.cookie_auth_verification.initial_groups_claim_name="" auth.cookie_auth_verification.validate_id_token_claims="map[]" auth.encrypt.secret_key="******" auth.login_duration=168h0m0s auth.logout_redirect_url=/auth/login auth.oidc.default_initial_groups="[]" auth.oidc.friendly_name_claim_name="" auth.oidc.initial_groups_claim_name="" auth.oidc.validate_id_token_claims="map[]" auth.remote_authenticator.default_user_group=Viewers auth.remote_authenticator.enabled=false auth.remote_authenticator.endpoint="" auth.remote_authenticator.request_timeout=10s auth.ui_config.login_cookie_names="[internal_auth_session]" auth.ui_config.login_failed_message="The credentials don't match." auth.ui_config.login_url="" auth.ui_config.logout_url="" auth.ui_config.rbac=simplified blockstore.azure.auth_method="" blockstore.azure.disable_pre_signed=false blockstore.azure.disable_pre_signed_ui=true blockstore.azure.pre_signed_expiry=15m0s blockstore.azure.storage_access_key="" blockstore.azure.storage_account="" blockstore.azure.test_endpoint_url="" blockstore.azure.try_timeout=10m0s blockstore.gs.credentials_file="" blockstore.gs.credentials_json="" blockstore.gs.disable_pre_signed=false blockstore.gs.disable_pre_signed_ui=true blockstore.gs.pre_signed_expiry=15m0s blockstore.gs.s3_endpoint="<https://storage.googleapis.com>" blockstore.local.allowed_external_prefixes="[]" blockstore.local.import_enabled=false blockstore.local.import_hidden=false blockstore.local.path="~/lakefs/data/block" blockstore.s3.client_log_request=true blockstore.s3.client_log_retries=true blockstore.s3.credentials_file="" blockstore.s3.disable_pre_signed=false blockstore.s3.disable_pre_signed_ui=true blockstore.s3.discover_bucket_region=true blockstore.s3.endpoint="" blockstore.s3.force_path_style=false blockstore.s3.max_retries=5 blockstore.s3.pre_signed_expiry=15m0s blockstore.s3.profile="" blockstore.s3.region=us-east-1 blockstore.s3.server_side_encryption="" blockstore.s3.server_side_encryption_kms_key_id="" blockstore.s3.skip_verify_certificate_test_only=false blockstore.s3.web_identity.session_duration=0s blockstore.s3.web_identity.session_expiry_window=5m0s blockstore.type=s3 committed.block_storage_prefix=_lakefs committed.local_cache.dir="~/lakefs/data/cache" committed.local_cache.max_uploaders_per_writer=10 committed.local_cache.metarange_proportion=0.1 committed.local_cache.range_proportion=0.9 committed.local_cache.size_bytes=1073741824 committed.permanent.max_range_size_bytes=20971520 committed.permanent.min_range_size_bytes=0 committed.permanent.range_raggedness_entries=50000 committed.sstable.memory.cache_size_bytes=400000000 database.drop_tables=false database.dynamodb.aws_access_key_id=------ database.dynamodb.aws_profile="" database.dynamodb.aws_region="" database.dynamodb.aws_secret_access_key=------ database.dynamodb.endpoint="" database.dynamodb.health_check_interval=0s database.dynamodb.scan_limit=1024 database.dynamodb.table_name=kvstore database.local.enable_logging=false database.local.path="~/lakefs/metadata" database.local.prefetch_size=256 database.local.sync_writes=true database.postgres.connection_max_lifetime=5m0s database.postgres.connection_string=------ database.postgres.max_idle_connections=25 database.postgres.max_open_connections=25 database.postgres.metrics=false database.postgres.scan_page_size=0 database.type=local diff.delta.plugin="" email_subscription.enabled=true fields.file=/etc/lakefs/config.yaml file="cmd/root.go:159" gateways.s3.domain_name="[<http://s3.local.lakefs.io|s3.local.lakefs.io>]" gateways.s3.fallback_url="" gateways.s3.region=us-east-1 graveler.background.rate_limit=0 graveler.batch_dbio_transaction_markers=false graveler.commit_cache.expiry=10m0s graveler.commit_cache.jitter=2s graveler.commit_cache.size=50000 graveler.ensure_readable_root_namespace=true graveler.repository_cache.expiry=5s graveler.repository_cache.jitter=2s graveler.repository_cache.size=1000 installation.access_key_id=------ installation.fixed_id="" installation.secret_access_key=------ installation.user_name="" listen_address="0.0.0.0:8000" logging.audit_log_level=DEBUG logging.file_max_size_mb=102400 logging.files_keep=100 logging.format=text logging.level=TRACE logging.output="[-]" logging.trace_request_headers=false phase=startup plugins.default_path="~/.lakefs/plugins" plugins.properties="map[]" security.audit_check_interval=24h0m0s security.audit_check_url="<https://audit.lakefs.io/audit>" security.check_latest_version=true security.check_latest_version_cache=1h0m0s stats.address="<https://stats.lakefs.io>" stats.enabled=false stats.extended=false stats.flush_interval=30s stats.flush_size=100 tls.cert_file="" tls.enabled=false tls.key_file="" ugc.prepare_interval=1m0s ugc.prepare_max_file_size=20971520 ui.enabled=true ui.snippets="[]"
time="2023-10-20T21:34:36Z" level=info msg="lakeFS run" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:91" version=0.113.0
time="2023-10-20T21:34:36Z" level=info msg="initialized Auth service" func=pkg/auth.NewAuthService file="build/pkg/auth/service.go:187" service=auth_service
time="2023-10-20T21:34:36Z" level=debug msg="failed to collect account metadata" func=pkg/stats.NewMetadata file="build/pkg/stats/metadata.go:34" error="not found"
i
Not sure it’s related, but still worth a shot. How many instances of lakeFS pods are you using?
g
only one instance
👍 1
i
Oh, what are the values you’re using for the probes? Did you uncomment the
livenessProbe
&
readinessProbe
section in the
values.yaml
file?
g
I didn’t uncomment the livenessProbe and readinessProbe part.
i
Can you try that? perhaps it takes a bit longer for lakeFS to spinup with the service account.
g
uncomment
Copy code
extraEnvVars:
# Override K8S defaults for readinessProbe
  readinessProbe:
    failureThreshold: 10
    periodSeconds: 5
    successThreshold: 4
    timeoutSeconds: 1
# Override K8S defaults for livenessProbe
  livenessProbe:
    failureThreshold: 20
    periodSeconds: 5
    successThreshold: 4
    timeoutSeconds: 1
    initialDelaySeconds: 5
but there is format issue
if it takes time to spinup service account, then when pod restarts, I think it should work. But what i saw is, it never succeeds.
i
It’s the initialization of the S3 blockstore with service account that takes longer, and that happens with every pod restart. At least that’s my theory..
I think you need to ident it left (no whitespace before
readinessProbe
&
livenessProbe
)
g
this is the default value for the probe, I will update it manually in k8s deployment
Copy code
livenessProbe:
            httpGet:
              path: /_health
              port: http
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /_health
              port: http
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
👍 1
update to below values and restarting deployment
Copy code
livenessProbe:
            httpGet:
              path: /_health
              port: http
              scheme: HTTP
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          readinessProbe:
            httpGet:
              path: /_health
              port: http
              scheme: HTTP
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
same error
Copy code
Liveness probe failed: Get "<http://172.16.150.9:8000/_health>": dial tcp 172.16.150.9:8000: connect: connection refused
i
I’m out of ideas at the moment. I’ll consult my colleagues and provide an answer early next week. I understand you have a workaround for now, right? (the static creds)
👌 1
g
we don’t have a long existing s3 creds(security rule), only a temp one which will expire every few hours.
i
Got it, so it’s indeed a blocker
g
yeah, thanks very much for help
🙌 1
i
Hey @gang ye Regarding the the credentials, you don’t need to use static credentials. Depends on how you perform auth, we internally run on EKS with IRSA for example and that means that by adding an annotation to the serviceAccount that is attached to the lakeFS Deployment we give it all the permission it needs. It’s basically based on IAM:AssumeRole and there’s no static credentials involved. You can even configure this in the Helm chart itself (including the SA creation) for example, that’s how that part would look in
values.yaml
Copy code
serviceAccount:
  name: my-svc-acc

extraManifests:
- apiVersion: v1
  kind: ServiceAccount
  metadata:
    name: '{{ .Values.serviceAccount.name }}'
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/my-pod-role"
g
Hi @Isan Rivkin yes, we used service account with right annotation setting. The issue is, once using service account, lakefs server cannot start as expected. Logs are attached earlier.
i
So if I understood correctly the issue is that the liveness probe causes restarts due to some unknown reason (I don’t see anything helpful at the logs) is it correct? • If so can you maybe try and override the container command to “sleep 1d” of lakeFS through the chart and give some infinite amount for the readiness? Then once its up exec into the container and try to to run lakefs? Then outside of the container try to reach it with curl or something like that? • Also, do you have an HTTP Proxy in your environment? • Is the blockstore S3 or some compatible interface?
g
I think, the liveness probe and read probe themselves are not the root cause of the problem. The root cause is lakefs server cannot start when using service account (log message doesn’t provide helpful clue and the blockstore i’m using is S3) And I have used the same service account on other pod to access s3, which works as expected.
i
Could be, let’s explore that option. Can you please make sure that your namespace and service account information is correct. • The the SA needs to be in the same namespace and the N needs to exist before the SA is created. • If you configured it to use a default SA (which I don’t think is your case but just to make sure) - then that has to exist as well, and the SA should be created after the namespace. • Finally, if none of that helps can you please attach the events of the resources? I’m pretty sure we will see the answer there:
Copy code
k describe deploy <lakefs-deploy> 
k describe pod <lakefs-pod> # try to catch pod events before it restarts 
k describe replicaset <lakefs-replicaset> 
k describe sa <your service account> 
k describe svc <lakefs service>
g
Hi @Isan Rivkin Thanks for following up on the issue. • The the SA needs to be in the same namespace and the N needs to exist before the SA is created. ◦ yes, SA exists in same namespace • If you configured it to use a default SA (which I don’t think is your case but just to make sure) - then that has to exist as well, and the SA should be created after the namespace. ◦ We don’t use default SA. describe deploy
Copy code
kubectl describe deploy lakefstest -n data-experimentation
Name:                   lakefstest
Namespace:              data-experimentation
CreationTimestamp:      Mon, 23 Oct 2023 17:33:35 -0700
Labels:                 app=lakefs
                        <http://app.kubernetes.io/instance=lakefstest|app.kubernetes.io/instance=lakefstest>
                        <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                        <http://app.kubernetes.io/name=lakefs|app.kubernetes.io/name=lakefs>
                        <http://app.kubernetes.io/version=0.113.0|app.kubernetes.io/version=0.113.0>
                        <http://helm.sh/chart=lakefs-0.13.3|helm.sh/chart=lakefs-0.13.3>
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 1
                        <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: lakefstest
                        <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: data-experimentation
Selector:               app=lakefs,<http://app.kubernetes.io/instance=lakefstest,app.kubernetes.io/name=lakefs|app.kubernetes.io/instance=lakefstest,app.kubernetes.io/name=lakefs>
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=lakefs
                    <http://app.kubernetes.io/instance=lakefstest|app.kubernetes.io/instance=lakefstest>
                    <http://app.kubernetes.io/name=lakefs|app.kubernetes.io/name=lakefs>
  Annotations:      checksum/config: 5b7e985282116d067aa462b11debaeb3fc43e4fcdd55194645756e057e9bcc89
  Service Account:  data-experimentation-sa
  Containers:
   lakefs:
    Image:      <http://docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64|docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64>
    Port:       8000/TCP
    Host Port:  0/TCP
    Args:
      run
      --config
      /etc/lakefs/config.yaml
    Liveness:   http-get http://:http/_health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/_health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LAKEFS_AUTH_ENCRYPT_SECRET_KEY:  <set to the key 'auth_encrypt_secret_key' in secret 'lakefstest'>  Optional: false
    Mounts:
      /etc/lakefs from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      lakefstest
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  <none>
NewReplicaSet:   lakefstest-5c9c5b454b (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  28s   deployment-controller  Scaled up replica set lakefstest-5c9c5b454b to 1
Describe pod
Copy code
kubectl describe pod $POD_NAME -n data-experimentation
Name:             lakefstest-5c9c5b454b-hd7bj
Namespace:        data-experimentation
Priority:         0
Service Account:  data-experimentation-sa
Node:             ip-172-16-162-137.us-west-2.compute.internal/172.16.162.137
Start Time:       Mon, 23 Oct 2023 17:33:36 -0700
Labels:           app=lakefs
                  <http://app.kubernetes.io/instance=lakefstest|app.kubernetes.io/instance=lakefstest>
                  <http://app.kubernetes.io/name=lakefs|app.kubernetes.io/name=lakefs>
                  pod-template-hash=5c9c5b454b
Annotations:      checksum/config: 5b7e985282116d067aa462b11debaeb3fc43e4fcdd55194645756e057e9bcc89
                  <http://kubernetes.io/psp|kubernetes.io/psp>: 00-fully-open
Status:           Running
IP:               172.16.134.189
IPs:
  IP:           172.16.134.189
Controlled By:  ReplicaSet/lakefstest-5c9c5b454b
Containers:
  lakefs:
    Container ID:  <docker://0395f05274ef10d047b1b779d344f163656bb7f3157aad4ad63d03ba4ab0b7e>1
    Image:         <http://docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64|docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64>
    Image ID:      <docker-pullable://docker.apple.com/aiml-datainfra/lakefs@sha256:2b98fe0283384197441d83d8ac6f25014df0b16f39dd611bb48063b489940255>
    Port:          8000/TCP
    Host Port:     0/TCP
    Args:
      run
      --config
      /etc/lakefs/config.yaml
    State:          Running
      Started:      Mon, 23 Oct 2023 17:34:37 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 23 Oct 2023 17:33:36 -0700
      Finished:     Mon, 23 Oct 2023 17:34:36 -0700
    Ready:          False
    Restart Count:  1
    Liveness:       http-get http://:http/_health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/_health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LAKEFS_AUTH_ENCRYPT_SECRET_KEY:  <set to the key 'auth_encrypt_secret_key' in secret 'lakefstest'>  Optional: false
      AWS_STS_REGIONAL_ENDPOINTS:      regional
      AWS_DEFAULT_REGION:              us-west-2
      AWS_REGION:                      us-west-2
      AWS_ROLE_ARN:                    arn:aws:iam::xxx:role/aiml-data-experimentation
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/lakefs from config-volume (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-49t2q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      lakefstest
    Optional:  false
  kube-api-access-49t2q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  84s                default-scheduler  Successfully assigned data-experimentation/lakefstest-5c9c5b454b-hd7bj to ip-172-16-162-137.us-west-2.compute.internal
  Normal   Killing    54s                kubelet            Container lakefs failed liveness probe, will be restarted
  Normal   Pulled     23s (x2 over 84s)  kubelet            Container image "<http://docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64|docker.apple.com/aiml-datainfra/lakefs:0.113.0-amd64>" already present on machine
  Normal   Created    23s (x2 over 84s)  kubelet            Created container lakefs
  Normal   Started    23s (x2 over 84s)  kubelet            Started container lakefs
  Warning  Unhealthy  4s (x13 over 83s)  kubelet            Readiness probe failed: Get "<http://172.16.134.189:8000/_health>": dial tcp 172.16.134.189:8000: connect: connection refused
  Warning  Unhealthy  4s (x5 over 74s)   kubelet            Liveness probe failed: Get "<http://172.16.134.189:8000/_health>": dial tcp 172.16.134.189:8000: connect: connection refused
describe svc
Copy code
kubectl describe svc lakefstest -n data-experimentation
Name:              lakefstest
Namespace:         data-experimentation
Labels:            app=lakefs
                   <http://app.kubernetes.io/instance=lakefstest|app.kubernetes.io/instance=lakefstest>
                   <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                   <http://app.kubernetes.io/name=lakefs|app.kubernetes.io/name=lakefs>
                   <http://app.kubernetes.io/version=0.113.0|app.kubernetes.io/version=0.113.0>
                   <http://helm.sh/chart=lakefs-0.13.3|helm.sh/chart=lakefs-0.13.3>
Annotations:       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: lakefstest
                   <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: data-experimentation
Selector:          <http://app.kubernetes.io/instance=lakefstest,app.kubernetes.io/name=lakefs,app=lakefs|app.kubernetes.io/instance=lakefstest,app.kubernetes.io/name=lakefs,app=lakefs>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.100.178.193
IPs:               10.100.178.193
Port:              http  80/TCP
TargetPort:        http/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>
describe sa
Copy code
kubectl describe sa data-experimentation-sa -n data-experimentation
Name:                data-experimentation-sa
Namespace:           data-experimentation
Labels:              <none>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxx:role/aiml-data-experimentation
Image pull secrets:  <none>
Mountable secrets:   data-experimentation-sa-token-4m9d5
Tokens:              data-experimentation-sa-token-4m9d5
Events:              <none>
if you need the helm config file
Copy code
image:
  repository: <http://docker.apple.com/aiml-datainfra/lakefs|docker.apple.com/aiml-datainfra/lakefs>
  #repository: <http://docker.io/treeverse/lakefs|docker.io/treeverse/lakefs>
  tag: 0.113.0-amd64
  pullPolicy: IfNotPresent

# Keys used for existingSecret
secrets:
  authEncryptSecretKey: "123"

lakefsConfig: |
  logging.level: TRACE
  stats.enabled: false
  database:
    type: local
  blockstore:
    type: s3
    s3:
      client_log_retries: true
      client_log_request: true

serviceAccount: 
  name: data-experimentation-sa
And in pod log message, there is no message like below indicating blockstore is set up successfully
Copy code
time="2023-10-21T00:15:18Z" level=info msg="initialize blockstore adapter" func=pkg/block/factory.BuildBlockAdapter file="build/pkg/block/factory/build.go:32" type=s3
time="2023-10-21T00:15:18Z" level=info msg="initialized blockstore adapter" func=pkg/block/factory.buildS3Adapter file="build/pkg/block/factory/build.go:111" type=s3
i
Hey @gang ye thanks for getting back to me and for the patience with the delay in response, it’s due to time zone difference. After Inspecting the additional info you have provided i I have noticed few “suspects” but nothing concrete. 1. First, I must say that if you prefer we can do a short meeting to troubleshoot (see DM). 💻 2. I have released from a branch a lakeFS version (1.0.0) with 1 change: very extensive logging of all the steps in the setup process. That way we can see where it get’s stuck. 🔨 a. Can you please update the Helm Chart values with the docker image
<http://docker.io/treeverse/experimental-lakefs|docker.io/treeverse/experimental-lakefs>
and tag
1.0.0-vvv1
(attached full
values.yaml
below
) b. All the logs I added contain the field
trace_flow=true
, please run the new chart and attach the logs. If possible re-run twice to make sure we don’t crush on random places in the code based on the last log message we will see. 3. Regarding the issue itself: 🕵️‍♂️ the latest logs you sent me, I think that K8S cannot locate runtime dependencies (i.e., the var/run/secrets/kubernetes.io or service account files are missing). a. What K8S Version is the cluster? b. Are you using any special plugins for authentication / authz between pods (i.e calico)? c. This might occur when some containers inside the pod attempt to interact with an API without the default access token. d. I suspect it because the ServiceAccount you attached has values set in
Mounted Secrets
and that’s an old way of accessing SA tokens. e. If that’s the case it can fix this error by allowing all new mount creations to adhere to the default access level throughout the pod space. Ensure that new pods using custom tokens comply with this access level to prevent continuous startup failures. This can be done with setting
.Values.podSecurityContext
and
.Values.securityContext
please refer to this SOF as a nice reference to the values and maybe check other charts you have in the cluster?
Copy code
image:
  repository: <http://docker.io/treeverse/experimental-lakefs|docker.io/treeverse/experimental-lakefs>
  tag: 1.0.0-vvv1
  pullPolicy: IfNotPresent

# Keys used for existingSecret
secrets:
  authEncryptSecretKey: "123"

lakefsConfig: |
  logging.level: DEBUG
  stats.enabled: false
  database:
    type: local
  blockstore:
    type: s3
    s3:
      client_log_retries: true
      client_log_request: true

# 90 seconds grace to start, maybe something will pop up in the logs
livenessProbe:
  initialDelaySeconds: 90

serviceAccount: 
  name: data-experimentation-lakefs-sa

# this will create a new service account
extraManifests:
- apiVersion: v1
  kind: ServiceAccount
  metadata:
    name: data-experimentation-lakefs-sa
    annotations:
      # set the correct ARN
      <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: "<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxx:role/aiml-data-experimentation"
g
Hi @Isan Rivkin Thanks for the detail. I will redeploy with image 1.0.0-vvv1 today and then get back to you to see if we can find any clue. Regarding the issue, we don’t use any special authn authz. For the service account token, I once logged into the pod and I can print it out (var/run/secrets/eks.amazonaws.com/serviceaccount/token) successfully. K8s version
Copy code
kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.9
Kustomize Version: v4.5.7
Server Version: v1.23.17-eks-2d98532
Hi @Isan Rivkin One quick question, does the new image 1.0.0-vvv1 support amd64?
i
I built it specifically for amd64 yes, why?
g
Got it. Then it should work. Our k8s arch is amd64.
🙏 1
Hi @Isan Rivkin Below is the log message running on 1.0.0-vvv1
Copy code
time="2023-10-24T18:04:07Z" level=info msg="starting build of block adapter" func=pkg/block/factory.BuildBlockAdapter file="build/pkg/block/factory/build.go:31" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="blockstore type: s3" func="pkg/logging.(*logrusEntryWrapper).Infof" file="build/pkg/logging/logger.go:272" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="initialize blockstore adapter" func=pkg/block/factory.BuildBlockAdapter file="build/pkg/block/factory/build.go:36" type=s3
time="2023-10-24T18:04:07Z" level=info msg="initialized blockstore adapter" func=pkg/block/factory.buildS3Adapter file="build/pkg/block/factory/build.go:115" type=s3
time="2023-10-24T18:04:07Z" level=info msg="finished block adapter build, starting runtime collector" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:164" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post SetRuntimeCollector" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:166" trace_flow=true
time="2023-10-24T18:04:07Z" level=trace msg="dummy sender received metadata" func="pkg/stats.(*dummySender).UpdateMetadata" file="build/pkg/stats/sender.go:147" metadata="{InstallationID:d42db8a3-e6ec-4fdd-b5d1-fa06564a2e8b Entries:[{Name:is_docker Value:true} {Name:instrumentation Value:Run} {Name:lakefs_version Value:dev} {Name:lakefs_kv_type Value:local} {Name:golang_version Value:go1.20.6} {Name:os Value:linux} {Name:architecture Value:amd64} {Name:is_k8s Value:true} {Name:installation_id Value:d42db8a3-e6ec-4fdd-b5d1-fa06564a2e8b} {Name:blockstore_type Value:s3}]}" service=stats_collector
time="2023-10-24T18:04:07Z" level=info msg="post CollectMetadata, initiating catalog" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:169" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="head build block adapter trace_flow true" func="pkg/logging.(*logrusEntryWrapper).Infof" file="build/pkg/logging/logger.go:272"
time="2023-10-24T18:04:07Z" level=info msg="starting build of block adapter" func=pkg/block/factory.BuildBlockAdapter file="build/pkg/block/factory/build.go:31" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="blockstore type: s3" func="pkg/logging.(*logrusEntryWrapper).Infof" file="build/pkg/logging/logger.go:272" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="initialize blockstore adapter" func=pkg/block/factory.BuildBlockAdapter file="build/pkg/block/factory/build.go:36" type=s3
time="2023-10-24T18:04:07Z" level=info msg="initialized blockstore adapter" func=pkg/block/factory.buildS3Adapter file="build/pkg/block/factory/build.go:115" type=s3
time="2023-10-24T18:04:07Z" level=info msg="Post Catalog Initialization" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:178" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="Pre scheduler" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:180" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="Pre scheduler: deleteScheduler.StartAsync" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:186" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="Pre new actions service" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:202" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="Post new actions service" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:212" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre middlewareAuthenticator" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:217" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post middlewareAuthenticator" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:221" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre NewDefaultAuditChecker" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:231" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post NewDefaultAuditChecker" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:235" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre checkRepos" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:245" trace_flow=true
time="2023-10-24T18:04:07Z" level=debug msg="lakeFS isn't initialized, skipping mismatched adapter checks" func=cmd/lakefs/cmd.checkRepos file="cmd/run.go:387"
time="2023-10-24T18:04:07Z" level=info msg="post checkRepos" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:247" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre: updating SetHealthHandlerInfo" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:249" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post: updating SetHealthHandlerInfo" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:252" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre: init api.Serve" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:255" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="initialize OpenAPI server" func=pkg/api.Serve file="build/pkg/api/serve.go:38" service=api_gateway
time="2023-10-24T18:04:07Z" level=info msg="post: init api.Serve" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:275" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="pre SSO auth middlewares" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:285" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post SSO auth middlewares" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:300" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="initialized S3 Gateway handler" func=pkg/gateway.NewHandler file="build/pkg/gateway/handler.go:124" s3_bare_domain="[s3.local.lakefs.io]" s3_region=us-east-1
time="2023-10-24T18:04:07Z" level=info msg="pre apiAuthenticator" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:314" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="post apiAuthenticator" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:316" trace_flow=true
time="2023-10-24T18:04:07Z" level=info msg="starting HTTP server" func=cmd/lakefs/cmd.glob..func8 file="cmd/run.go:322" listen_address="0.0.0.0:8000"
It looks like the server starts successfully. I will verify the s3 access.
I verified, it works now. It looks like once bumping up version from 0.113.0 to 1.0.0 solve the issue.
jumping lakefs 4
i
Yyyyeess 🙌