Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Chart: Running the launcher replication-orchestrator failed after upgrade #32203

Closed
joeybenamy opened this issue Nov 6, 2023 · 20 comments
Closed
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working

Comments

@joeybenamy
Copy link

joeybenamy commented Nov 6, 2023

What method are you using to run Airbyte?

Kubernetes

Platform Version or Helm Chart Version

All Helm chart versions later than 0.49.6

What step the error happened?

Upgrading the Platform or Helm Chart

Revelant information

Myself and others in Slack are reporting a variety of issues on Helm charts newer than 0.49.6. For each person, downgrading to 0.49.6 resolved the issues and Airbyte is stable. On later versions than 0.49.6, Airbyte is not stable. Connectors, tests, etc. fail with various errors. Slack thread: https://airbytehq.slack.com/archives/C021JANJ6TY/p1698930804469959

Relevant log output

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false
@tanderson-hp
Copy link

Confirmed that I experienced the same, and downgrading to 0.49.6 resolved my issue.

@tobiastroelsen
Copy link

Same as above! Worked for me as fix as well

@cappadona
Copy link

Confirming that we were unable to deploy Airbyte when upgrading beyond 0.49.6.

Also came across this other issue which may be related:

@GKTheOne
Copy link

GKTheOne commented Nov 8, 2023

still bad: 0.49.21, 0.49.19

@heruscode
Copy link

Same here, I can't upgrade to the latest version

@PurseChicken
Copy link

Same issue here. Had to roll the chart back to 0.49.6.

@storytel-siudzinskim
Copy link

2 weeks ago I also reported a bug here: #32544

It looks the issue came with Airbyte 0.50.33 and is present in 0.50.34.

@marcosmarxm
Copy link
Member

Hello all 👋 the team made some investigation and found a workaround

For now this is the step to fix the issue:
kubectl exec -it airbyte-minio-0 bash -n <your-namespace>

After run the following commands:

mc alias set myminio http://localhost:9000 minio minio123
mc mb myminio/state-storage

This will create a missing bucket. The team is working to release a fix for future upgrades. Thanks for the patience.

@marcosmarxm marcosmarxm changed the title Recent Helm chart versions buggy Helm Chart: Running the launcher replication-orchestrator failed after upgrade Nov 29, 2023
octavia-squidington-iii pushed a commit to airbytehq/airbyte-platform that referenced this issue Nov 30, 2023
This closes airbytehq/airbyte#32203.

#9469 turned on the orchestrator by default for OSS Kube deployments. Before this, OSS Kube jobs would fail whenever Airbyte is deployed.

When we turned this on, I did not occur to me to test the upgrade path. What our helm charts do is recreate the airbyte-ab and minio pods each time. This would wipe the state bucket, so jobs would not be able to run after an upgrade.

This PR cleans up the airbtye-db and airbyte-minio behaviour with the side effect of fixing this bug.

- Instead of recreating the airbyte db and the minio pod each time, we only create these critical resource on install. Once airbyte is running, there is no situation where recreating these resources on upgrade is needed. In fact, this is harmful since all jobs running at that time will fail. This also slows down the upgrade since these resources are required before the actual Airbyte application can start up.
- Pin minio to a specific version instead of always pulling the latest version. Although we haven't yet seen minio version bug issues, pinning to a specific version provides more stability.
- Do the same for kubectl.
@davinchia
Copy link
Contributor

davinchia commented Dec 1, 2023

Hi guys, we figured out what was happening:

  • previously the minio and database pods were reconfigured to always recreate on every helm deployment, be it an install or an upgrade. However, we were not correctly creating the bucket on an upgrade. This resulted in syncs failing.
  • this behaviour has been fixed as of helm version 0.50.3 - upgrading to 0.50.3 should no longer present this issue.
  • the minio and database pods are now only created on install (not upgrade), and we always attempt to create the required default buckets, regardless of install/upgrade. This should result in an overall more stable Airbyte experience.
  • please give it a shot and post any feedback here.

Thank you for your patience!

@DSamuylov
Copy link

Hi guys, thank you so much for your effort! unfortunately, I just tried to deploy the newest version, but still have the same issue. After starting any new job (deployment with the chart version 0.50.3), I still get:

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false

I just rolled back again to the version 0.49.6, and all works well as expected.

@davinchia
Copy link
Contributor

davinchia commented Dec 2, 2023

@DSamuylov interesting! I just tested upgrading from 0.49.6 - 0.50.3 and was able to run a job before/after. Can you show me how you are deploying 0.50.3?

@DSamuylov
Copy link

@davinchia, yes, sure. I do the deployment with terraform, and here is the file defining all configuration:

resource "helm_release" "airbyte" {
  name       = "airbyte"
  repository = "https://airbytehq.github.io/helm-charts"
  chart      = "airbyte"
  version    = var.chart_version
  namespace  = var.k8s_namespace

  # Global environment variables:

  set {
    name  = "global.env_vars.DATABASE_URL"
    value = "jdbc:postgresql://${var.external_database_host}:${var.external_database_port}/${var.external_database_database}?ssl=true&sslmode=require"
  }

  # Global database settings:

  set {
    name  = "global.database.secretName"
    value = var.postgres_secrets_name
  }
  set {
    name  = "global.database.secretValue"
    value = "password"
  }
  set {
    name  = "global.database.host"
    value = var.external_database_host
  }
  set {
    name  = "global.database.port"
    value = var.external_database_port
  }

  # Global logs settings:

  # - storage type config:

  set {
    name  = "global.state.storage.type"
    value = "S3"
  }
  set {
    name  = "global.logs.storage.type"
    value = "S3"
  }
  set {
    name  = "minio.enabled"
    value = false
  }
  set {
    name  = "global.logs.minio.enabled"
    value = false
  }

  # - access key config:

  # Some pods in the deployment uses a password variable, and some read the value from k8s secret, that is why we are forced to indicate both:
  set {
    name  = "global.logs.accessKey.password"
    value = var.aws_access_key_id
  }
  set {
    name  = "global.logs.accessKey.existingSecret"
    value = var.aws_secrets_name
  }
  set {
    name  = "global.logs.accessKey.existingSecretKey"
    value = "AWS_ACCESS_KEY_ID"
  }

  # - secret key config:

  # Some pods in the deployment uses a password variable, and some read the value from k8s secret, that is why we are forced to indicate both:
  set {
    name  = "global.logs.secretKey.password"
    value = var.aws_secret_access_key
  }
  set {
    name  = "global.logs.secretKey.existingSecret"
    value = var.aws_secrets_name
  }
  set {
    name  = "global.logs.secretKey.existingSecretKey"
    value = "AWS_SECRET_ACCESS_KEY"
  }

  # - bucket config:

  set {
    name  = "global.logs.s3.enabled"
    value = true
  }
  set {
    name  = "global.logs.s3.bucket"
    value = var.aws_bucket
  }
  set {
    name  = "global.logs.s3.bucketRegion"
    value = var.aws_region
  }

  # Temporal:

  set {
    name  = "temporal.env_vars.SQL_TLS"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_DISABLE_HOST_VERIFICATION"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_ENABLED"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SQL_TLS_ENABLE"
    value = "true"
  }
  set {
    name  = "temporal.env_vars.SSL"
    value = "true"
  }

  # External database settings:

  set {
    name  = "postgresql.enabled"
    value = false
  }

  set {
    name  = "externalDatabase.host"
    value = var.external_database_host
  }

  set {
    name  = "externalDatabase.user"
    value = var.external_database_user
  }

  set {
    name  = "externalDatabase.password"
    value = var.external_database_password
  }

  set {
    name  = "externalDatabase.existingSecret"
    value = var.postgres_secrets_name
  }

  set {
    name  = "externalDatabase.existingSecretPasswordKey"
    value = "password"
  }

  set {
    name  = "externalDatabase.database"
    value = var.external_database_database
  }

  set {
    name  = "externalDatabase.port"
    value = var.external_database_port
  }

  set {
    # When using SSL, it is mandatory to specify the URL parameters: `?ssl=true&sslmode=require"`!
    name  = "externalDatabase.jdbcUrl"
    value = "jdbc:postgresql://${var.external_database_host}:${var.external_database_port}/${var.external_database_database}?ssl=true&sslmode=require"
  }

  # Worker:

  set {
    name  = "worker.extraEnv[0].name"
    value = "STATE_STORAGE_S3_ACCESS_KEY"
  }
  set {
    name  = "worker.extraEnv[0].value"
    value = var.aws_access_key_id
  }

  set {
    name  = "worker.extraEnv[1].name"
    value = "STATE_STORAGE_S3_SECRET_ACCESS_KEY"
  }
  set {
    name  = "worker.extraEnv[1].value"
    value = var.aws_secret_access_key
  }

  set {
    name  = "worker.extraEnv[2].name"
    value = "STATE_STORAGE_S3_BUCKET_NAME"
  }
  set {
    name  = "worker.extraEnv[2].value"
    value = var.aws_bucket
  }

  set {
    name  = "worker.extraEnv[3].name"
    value = "STATE_STORAGE_S3_REGION"
  }
  set {
    name  = "worker.extraEnv[3].value"
    value = var.aws_secrets_name
  }

}

Please let me know if I could further support you with the investigation.

@davinchia
Copy link
Contributor

@DSamuylov what is your state storage bucket name variable set to?

@DSamuylov
Copy link

DSamuylov commented Dec 4, 2023

@davinchia, do you mean the environment variable STATE_STORAGE_S3_BUCKET_NAME from here?

  set {
    name  = "worker.extraEnv[2].name"
    value = "STATE_STORAGE_S3_BUCKET_NAME"
  }
  set {
    name  = "worker.extraEnv[2].value"
    value = var.aws_bucket
  }

The variable var.aws_bucket is set to my bucket name, something like: "my-project-name". Should I provide it in a different format?

@davinchia
Copy link
Contributor

@DSamuylov

Yes that was what I was referring to.

Follow up questions:

  1. Were you using the orchestrator before this? We turned on the orchestrator by default in 0.49.7. This allows jobs to survive between updates and does so via a middle job state storage location. The charts default to the state_storage bucket hosted on minio.
  2. If you were, what kind of state storage configuration were you doing? e.g. was it S3? We support using S3 and GCS as the state storage layer.

@DSamuylov
Copy link

@davinchia

Sorry for the delay in my reply, last days were extremely busy.

  1. Do you mean triggering jobs and successfully syncing data? In this case, yes. Otherwise likely not. Does it require setting up some variables in values.yaml? I only set up the values as in my previous message, for example I think I completely disable miniowith this:
  set {
    name  = "minio.enabled"
    value = false
  }
  set {
    name  = "global.logs.minio.enabled"
    value = false
  }

So if some pods require access to it, they will fail.

  1. I use AWS S3, but maybe I do not customise some of the variables that I should customise? I sent you the full list of what I customise in the previous message. I would highly appreciate if you could take a look and let me know if I should try to define some additional variables and I will report you back the results.

@sebaap
Copy link

sebaap commented Jan 10, 2024

We are also facing this issue after upgrading to 0.50.20, all jobs failed to start with the same error. Our instance is deployed on GKE using helm and minio is configured to GCS.
Tried the workaround here but didn't work.
Everything worked fine after downgrading to 0.49.6

@yorjaggy
Copy link

We were facing the same issue, we updated to Airbyte 0.50.43 with the Chart 0.50.21 and the error comes up, this time the error log was more detailed:

Caused by: io.temporal.failure.ApplicationFailure: message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false
	at io.airbyte.commons.temporal.HeartbeatUtils.withBackgroundHeartbeat(HeartbeatUtils.java:62) ~[io.airbyte-airbyte-commons-temporal-core-0.50.43.jar:?]
	...
Caused by: io.temporal.failure.ApplicationFailure: message='Running the launcher replication-orchestrator failed', type='io.airbyte.workers.exception.WorkerException', nonRetryable=false
	at io.airbyte.workers.sync.LauncherWorker.run(LauncherWorker.java:262) ~[io.airbyte-airbyte-commons-worker-0.50.43.jar:?]
	...
Caused by: io.temporal.failure.ApplicationFailure: message='The specified bucket does not exist (Service: S3, Status Code: 404, Request ID: 17AD5DF2AB182273, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8)', type='software.amazon.awssdk.services.s3.model.NoSuchBucketException', nonRetryable=false
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125) ~[sdk-core-2.20.162.jar:?]
	...

We checked the Minio version and updated to the latest, and then executed the workaround mentioned by @marcosmarxm, and it worked.

Thanks @marcosmarxm

Hello all 👋 the team made some investigation and found a workaround

For now this is the step to fix the issue: kubectl exec -it airbyte-minio-0 bash -n <your-namespace>

After run the following commands:

mc alias set myminio http://localhost:9000 minio minio123
mc mb myminio/state-storage

This will create a missing bucket. The team is working to release a fix for future upgrades. Thanks for the patience.

@sg-danl
Copy link

sg-danl commented Feb 28, 2024

I've been pinning version 0.49.6 to get around this for the past month and a half.
(Running Airbyte OSS on AWS EKS cluster, default values.yaml for ease of replication while trying to fix.)

Trying the fix suggested by @marcosmarxm doesn't fix for me. After attempting upgrading from 0.49.6 -> latest since mid Jan (so 0.50.22+) it has never fixed the issue.

Running the minio config in bash returns:


helm % kubectl exec -it airbyte-minio-0 bash -n default
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-5.1# mc alias set myminio http://localhost:9000 minio minio123
mc: Configuration written to `/tmp/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/tmp/.mc/share`.
mc: Initialized share uploads `/tmp/.mc/share/uploads.json` file.
mc: Initialized share downloads `/tmp/.mc/share/downloads.json` file.
Added `myminio` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.

Not an expert in any of this at all, but it looks like the creation of the bucket isn't entirely the issue. Just wanted to provide additional info as this has been a long-open issue!

Edited to add:
Force removing the bucket seems to (on 0.54.15) point to the bucket being forcefully recreated almost instantaneously.


bash-5.1# mc rb myminio/state-storage
mc: <ERROR> `myminio/state-storage` is not empty. Retry this command with ‘--force’ flag if you want to remove `myminio/state-storage` and all its contents 
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.

@marcosmarxm
Copy link
Member

Folks everyone having this issue. Please open a new issue and report what values and version you're using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests