Helm Chart: `Running the launcher replication-orchestrator failed` after upgrade #32203

joeybenamy · 2023-11-06T16:43:55Z

What method are you using to run Airbyte?

Kubernetes

Platform Version or Helm Chart Version

All Helm chart versions later than 0.49.6

What step the error happened?

Upgrading the Platform or Helm Chart

Revelant information

Myself and others in Slack are reporting a variety of issues on Helm charts newer than 0.49.6. For each person, downgrading to 0.49.6 resolved the issues and Airbyte is stable. On later versions than 0.49.6, Airbyte is not stable. Connectors, tests, etc. fail with various errors. Slack thread: https://airbytehq.slack.com/archives/C021JANJ6TY/p1698930804469959

Relevant log output

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false

tanderson-hp · 2023-11-06T16:45:17Z

Confirmed that I experienced the same, and downgrading to 0.49.6 resolved my issue.

tobiastroelsen · 2023-11-06T16:48:00Z

Same as above! Worked for me as fix as well

cappadona · 2023-11-07T14:02:40Z

Confirming that we were unable to deploy Airbyte when upgrading beyond 0.49.6.

Also came across this other issue which may be related:

[helm] worker pod is crashing after upgrading to 0.49.6+ (latest update: missing env variables. see reply for detail) #31988

GKTheOne · 2023-11-08T23:54:45Z

still bad: 0.49.21, 0.49.19

heruscode · 2023-11-14T17:44:56Z

Same here, I can't upgrade to the latest version

PurseChicken · 2023-11-20T20:28:05Z

Same issue here. Had to roll the chart back to 0.49.6.

storytel-siudzinskim · 2023-11-29T15:17:57Z

2 weeks ago I also reported a bug here: #32544

It looks the issue came with Airbyte 0.50.33 and is present in 0.50.34.

marcosmarxm · 2023-11-29T20:08:57Z

Hello all 👋 the team made some investigation and found a workaround

For now this is the step to fix the issue:
kubectl exec -it airbyte-minio-0 bash -n <your-namespace>

After run the following commands:

mc alias set myminio http://localhost:9000 minio minio123
mc mb myminio/state-storage

This will create a missing bucket. The team is working to release a fix for future upgrades. Thanks for the patience.

This closes airbytehq/airbyte#32203. #9469 turned on the orchestrator by default for OSS Kube deployments. Before this, OSS Kube jobs would fail whenever Airbyte is deployed. When we turned this on, I did not occur to me to test the upgrade path. What our helm charts do is recreate the airbyte-ab and minio pods each time. This would wipe the state bucket, so jobs would not be able to run after an upgrade. This PR cleans up the airbtye-db and airbyte-minio behaviour with the side effect of fixing this bug. - Instead of recreating the airbyte db and the minio pod each time, we only create these critical resource on install. Once airbyte is running, there is no situation where recreating these resources on upgrade is needed. In fact, this is harmful since all jobs running at that time will fail. This also slows down the upgrade since these resources are required before the actual Airbyte application can start up. - Pin minio to a specific version instead of always pulling the latest version. Although we haven't yet seen minio version bug issues, pinning to a specific version provides more stability. - Do the same for kubectl.

davinchia · 2023-12-01T18:12:55Z

Hi guys, we figured out what was happening:

previously the minio and database pods were reconfigured to always recreate on every helm deployment, be it an install or an upgrade. However, we were not correctly creating the bucket on an upgrade. This resulted in syncs failing.
this behaviour has been fixed as of helm version 0.50.3 - upgrading to 0.50.3 should no longer present this issue.
the minio and database pods are now only created on install (not upgrade), and we always attempt to create the required default buckets, regardless of install/upgrade. This should result in an overall more stable Airbyte experience.
please give it a shot and post any feedback here.

Thank you for your patience!

DSamuylov · 2023-12-01T23:04:28Z

Hi guys, thank you so much for your effort! unfortunately, I just tried to deploy the newest version, but still have the same issue. After starting any new job (deployment with the chart version 0.50.3), I still get:

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false

I just rolled back again to the version 0.49.6, and all works well as expected.

davinchia · 2023-12-02T00:23:19Z

@DSamuylov interesting! I just tested upgrading from 0.49.6 - 0.50.3 and was able to run a job before/after. Can you show me how you are deploying 0.50.3?

DSamuylov · 2023-12-03T01:15:24Z

@davinchia, yes, sure. I do the deployment with terraform, and here is the file defining all configuration:

resource "helm_release" "airbyte" {
  name       = "airbyte"
  repository = "https://airbytehq.github.io/helm-charts"
  chart      = "airbyte"
  version    = var.chart_version
  namespace  = var.k8s_namespace

  # Global environment variables:

  set {
    name  = "global.env_vars.DATABASE_URL"
    value = "jdbc:postgresql://${var.external_database_host}:${var.external_database_port}/${var.external_database_database}?ssl=true&sslmode=require"
  }

  # Global database settings:

  set {
    name  = "global.database.secretName"
    value = var.postgres_secrets_name
  }
  set {
    name  = "global.database.secretValue"
    value = "password"
  }
  set {
    name  = "global.database.host"
    value = var.external_database_host
  }
  set {
    name  = "global.database.port"
    value = var.external_database_port
  }

  # Global logs settings:

  # - storage type config:

  set {
    name  = "global.state.storage.type"
    value = "S3"
  }
  set {
    name  = "global.logs.storage.type"
    value = "S3"
  }
  set {
    name  = "minio.enabled"
    value = false
  }
  set {
    name  = "global.logs.minio.enabled"
    value = false
  }

  # - access key config:

  # Some pods in the deployment uses a password variable, and some read the value from k8s secret, that is why we are forced to indicate both:
  set {
    name  = "global.logs.accessKey.password"
    value = var.aws_access_key_id
  }
  set {
    name  = "global.logs.accessKey.existingSecret"
    value = var.aws_secrets_name
  }
  set {
    name  = "global.logs.accessKey.existingSecretKey"
    value = "AWS_ACCESS_KEY_ID"
  }

  # - secret key config:

  # Some pods in the deployment uses a password variable, and some read the value from k8s secret, that is why we are forced to indicate both:
  set {
    name  = "global.logs.secretKey.password"
    value = var.aws_secret_access_key
  }
  set {
    name  = "global.logs.secretKey.existingSecret"
    value = var.aws_secrets_name
  }
  set {
    name  = "global.logs.secretKey.existingSecretKey"
    value = "AWS_SECRET_ACCESS_KEY"
  }

  # - bucket config:

  set {
    name  = "global.logs.s3.enabled"
    value = true
  }
  set {
    name  = "global.logs.s3.bucket"
    value = var.aws_bucket
  }
  set {
    name  = "global.logs.s3.bucketRegion"
    value = var.aws_region
  }

  # Temporal:

  set {
    name  = "temporal.env_vars.SQL_TLS"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_DISABLE_HOST_VERIFICATION"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_ENABLED"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_ENABLE"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SSL"
    value = "true"
  }

  # External database settings:

  set {
    name  = "postgresql.enabled"
    value = false
  }

  set {
    name  = "externalDatabase.host"
    value = var.external_database_host
  }

  set {
    name  = "externalDatabase.user"
    value = var.external_database_user
  }

  set {
    name  = "externalDatabase.password"
    value = var.external_database_password
  }

  set {
    name  = "externalDatabase.existingSecret"
    value = var.postgres_secrets_name
  }

  set {
    name  = "externalDatabase.existingSecretPasswordKey"
    value = "password"
  }

  set {
    name  = "externalDatabase.database"
    value = var.external_database_database
  }

  set {
    name  = "externalDatabase.port"
    value = var.external_database_port
  }

  set {
    # When using SSL, it is mandatory to specify the URL parameters: `?ssl=true&sslmode=require"`!
    name  = "externalDatabase.jdbcUrl"
    value = "jdbc:postgresql://${var.external_database_host}:${var.external_database_port}/${var.external_database_database}?ssl=true&sslmode=require"
  }

  # Worker:

  set {
    name  = "worker.extraEnv[0].name"
    value = "STATE_STORAGE_S3_ACCESS_KEY"
  }
  set {
    name  = "worker.extraEnv[0].value"
    value = var.aws_access_key_id
  }

  set {
    name  = "worker.extraEnv[1].name"
    value = "STATE_STORAGE_S3_SECRET_ACCESS_KEY"
  }
  set {
    name  = "worker.extraEnv[1].value"
    value = var.aws_secret_access_key
  }

  set {
    name  = "worker.extraEnv[2].name"
    value = "STATE_STORAGE_S3_BUCKET_NAME"
  }
  set {
    name  = "worker.extraEnv[2].value"
    value = var.aws_bucket
  }

  set {
    name  = "worker.extraEnv[3].name"
    value = "STATE_STORAGE_S3_REGION"
  }
  set {
    name  = "worker.extraEnv[3].value"
    value = var.aws_secrets_name
  }

}

Please let me know if I could further support you with the investigation.

davinchia · 2023-12-04T05:52:49Z

@DSamuylov what is your state storage bucket name variable set to?

DSamuylov · 2023-12-04T08:45:24Z

@davinchia, do you mean the environment variable STATE_STORAGE_S3_BUCKET_NAME from here?

  set {
    name  = "worker.extraEnv[2].name"
    value = "STATE_STORAGE_S3_BUCKET_NAME"
  }
  set {
    name  = "worker.extraEnv[2].value"
    value = var.aws_bucket
  }

The variable var.aws_bucket is set to my bucket name, something like: "my-project-name". Should I provide it in a different format?

davinchia · 2023-12-05T19:13:14Z

@DSamuylov

Yes that was what I was referring to.

Follow up questions:

Were you using the orchestrator before this? We turned on the orchestrator by default in 0.49.7. This allows jobs to survive between updates and does so via a middle job state storage location. The charts default to the state_storage bucket hosted on minio.
If you were, what kind of state storage configuration were you doing? e.g. was it S3? We support using S3 and GCS as the state storage layer.

DSamuylov · 2023-12-13T23:55:09Z

@davinchia

Sorry for the delay in my reply, last days were extremely busy.

Do you mean triggering jobs and successfully syncing data? In this case, yes. Otherwise likely not. Does it require setting up some variables in values.yaml? I only set up the values as in my previous message, for example I think I completely disable miniowith this:

  set {
    name  = "minio.enabled"
    value = false
  }
  set {
    name  = "global.logs.minio.enabled"
    value = false
  }

So if some pods require access to it, they will fail.

I use AWS S3, but maybe I do not customise some of the variables that I should customise? I sent you the full list of what I customise in the previous message. I would highly appreciate if you could take a look and let me know if I should try to define some additional variables and I will report you back the results.

sebaap · 2024-01-10T16:59:05Z

We are also facing this issue after upgrading to 0.50.20, all jobs failed to start with the same error. Our instance is deployed on GKE using helm and minio is configured to GCS.
Tried the workaround here but didn't work.
Everything worked fine after downgrading to 0.49.6

yorjaggy · 2024-01-24T19:27:09Z

We were facing the same issue, we updated to Airbyte 0.50.43 with the Chart 0.50.21 and the error comes up, this time the error log was more detailed:

Caused by: io.temporal.failure.ApplicationFailure: message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false
	at io.airbyte.commons.temporal.HeartbeatUtils.withBackgroundHeartbeat(HeartbeatUtils.java:62) ~[io.airbyte-airbyte-commons-temporal-core-0.50.43.jar:?]
	...
Caused by: io.temporal.failure.ApplicationFailure: message='Running the launcher replication-orchestrator failed', type='io.airbyte.workers.exception.WorkerException', nonRetryable=false
	at io.airbyte.workers.sync.LauncherWorker.run(LauncherWorker.java:262) ~[io.airbyte-airbyte-commons-worker-0.50.43.jar:?]
	...
Caused by: io.temporal.failure.ApplicationFailure: message='The specified bucket does not exist (Service: S3, Status Code: 404, Request ID: 17AD5DF2AB182273, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8)', type='software.amazon.awssdk.services.s3.model.NoSuchBucketException', nonRetryable=false
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125) ~[sdk-core-2.20.162.jar:?]
	...

We checked the Minio version and updated to the latest, and then executed the workaround mentioned by @marcosmarxm, and it worked.

Thanks @marcosmarxm

Hello all 👋 the team made some investigation and found a workaround

For now this is the step to fix the issue: kubectl exec -it airbyte-minio-0 bash -n <your-namespace>

After run the following commands:
mc alias set myminio http://localhost:9000 minio minio123
mc mb myminio/state-storage
This will create a missing bucket. The team is working to release a fix for future upgrades. Thanks for the patience.

sg-danl · 2024-02-28T10:52:19Z

I've been pinning version 0.49.6 to get around this for the past month and a half.
(Running Airbyte OSS on AWS EKS cluster, default values.yaml for ease of replication while trying to fix.)

Trying the fix suggested by @marcosmarxm doesn't fix for me. After attempting upgrading from 0.49.6 -> latest since mid Jan (so 0.50.22+) it has never fixed the issue.

Running the minio config in bash returns:


helm % kubectl exec -it airbyte-minio-0 bash -n default
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-5.1# mc alias set myminio http://localhost:9000 minio minio123
mc: Configuration written to `/tmp/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/tmp/.mc/share`.
mc: Initialized share uploads `/tmp/.mc/share/uploads.json` file.
mc: Initialized share downloads `/tmp/.mc/share/downloads.json` file.
Added `myminio` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.

Not an expert in any of this at all, but it looks like the creation of the bucket isn't entirely the issue. Just wanted to provide additional info as this has been a long-open issue!

Edited to add:
Force removing the bucket seems to (on 0.54.15) point to the bucket being forcefully recreated almost instantaneously.


bash-5.1# mc rb myminio/state-storage
mc: <ERROR> `myminio/state-storage` is not empty. Retry this command with ‘--force’ flag if you want to remove `myminio/state-storage` and all its contents 
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.

marcosmarxm · 2024-03-01T18:12:58Z

Folks everyone having this issue. Please open a new issue and report what values and version you're using.

joeybenamy added area/platform issues related to the platform needs-triage type/bug Something isn't working labels Nov 6, 2023

octavia-squidington-iii added autoteam team/compose team/platform-move community labels Nov 6, 2023

PurseChicken mentioned this issue Nov 20, 2023

[helm] worker pod is crashing after upgrading to 0.49.6+ (latest update: missing env variables. see reply for detail) #31988

Closed

marcosmarxm changed the title ~~Recent Helm chart versions buggy~~ Helm Chart: Running the launcher replication-orchestrator failed after upgrade Nov 29, 2023

marcosmarxm removed needs-triage team/compose autoteam labels Nov 29, 2023

marcosmarxm mentioned this issue Nov 29, 2023

Running the launcher replication-orchestrator failed #32544

Closed

davinchia closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm Chart: `Running the launcher replication-orchestrator failed` after upgrade #32203

Helm Chart: `Running the launcher replication-orchestrator failed` after upgrade #32203

joeybenamy commented Nov 6, 2023 •

edited

Loading

tanderson-hp commented Nov 6, 2023

tobiastroelsen commented Nov 6, 2023

cappadona commented Nov 7, 2023

GKTheOne commented Nov 8, 2023 •

edited

Loading

heruscode commented Nov 14, 2023

PurseChicken commented Nov 20, 2023

storytel-siudzinskim commented Nov 29, 2023

marcosmarxm commented Nov 29, 2023

davinchia commented Dec 1, 2023 •

edited

Loading

DSamuylov commented Dec 1, 2023

davinchia commented Dec 2, 2023 •

edited

Loading

DSamuylov commented Dec 3, 2023

davinchia commented Dec 4, 2023

DSamuylov commented Dec 4, 2023 •

edited

Loading

davinchia commented Dec 5, 2023

DSamuylov commented Dec 13, 2023

sebaap commented Jan 10, 2024

yorjaggy commented Jan 24, 2024

sg-danl commented Feb 28, 2024 •

edited

Loading

marcosmarxm commented Mar 1, 2024

Helm Chart: Running the launcher replication-orchestrator failed after upgrade #32203

Helm Chart: Running the launcher replication-orchestrator failed after upgrade #32203

Comments

joeybenamy commented Nov 6, 2023 • edited Loading

What method are you using to run Airbyte?

Platform Version or Helm Chart Version

What step the error happened?

Revelant information

Relevant log output

tanderson-hp commented Nov 6, 2023

tobiastroelsen commented Nov 6, 2023

cappadona commented Nov 7, 2023

GKTheOne commented Nov 8, 2023 • edited Loading

heruscode commented Nov 14, 2023

PurseChicken commented Nov 20, 2023

storytel-siudzinskim commented Nov 29, 2023

marcosmarxm commented Nov 29, 2023

davinchia commented Dec 1, 2023 • edited Loading

DSamuylov commented Dec 1, 2023

davinchia commented Dec 2, 2023 • edited Loading

DSamuylov commented Dec 3, 2023

davinchia commented Dec 4, 2023

DSamuylov commented Dec 4, 2023 • edited Loading

davinchia commented Dec 5, 2023

DSamuylov commented Dec 13, 2023

sebaap commented Jan 10, 2024

yorjaggy commented Jan 24, 2024

sg-danl commented Feb 28, 2024 • edited Loading

marcosmarxm commented Mar 1, 2024

Helm Chart: `Running the launcher replication-orchestrator failed` after upgrade #32203

Helm Chart: `Running the launcher replication-orchestrator failed` after upgrade #32203

joeybenamy commented Nov 6, 2023 •

edited

Loading

GKTheOne commented Nov 8, 2023 •

edited

Loading

davinchia commented Dec 1, 2023 •

edited

Loading

davinchia commented Dec 2, 2023 •

edited

Loading

DSamuylov commented Dec 4, 2023 •

edited

Loading

sg-danl commented Feb 28, 2024 •

edited

Loading