Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement worker polling backoff for REST API client #370

Open
jwulf opened this issue Feb 4, 2025 · 3 comments
Open

Implement worker polling backoff for REST API client #370

jwulf opened this issue Feb 4, 2025 · 3 comments
Assignees

Comments

@jwulf
Copy link
Member

jwulf commented Feb 4, 2025

A recent customer support issue highlighted the scenario of a worker with expired or misconfigured credentials hosing the gateway with unrelenting poll requests. See #366.

As a consequence, I implemented backoff on 16 UNAUTHENTICATED for the gRPC worker.

This now needs to be implemented for the REST API as well.

@jwulf
Copy link
Member Author

jwulf commented Feb 4, 2025

OK, both this feature and the gRPC one in #366 need further work.

It works as expected when the auth strategy is set to NONE, but when passed an invalid secret, the backoff actually needs to take place in the token provider code.

There is backoff logic in there, but it doesn't have test coverage - yet.

@jwulf jwulf self-assigned this Feb 4, 2025
@jwulf
Copy link
Member Author

jwulf commented Feb 4, 2025

OK, here is the problem:

The OAuth provider makes a debounced token request, and throws asynchronously when credentials are invalid. Currently, this doesn't propagate to the worker polling code that ultimately called the token request. This means that neither the backoff is activated in the exception handler nor the response is returned in the success handler, and the polling lock is not released - stalling the worker perpetually.

The token debounce was implemented to stop exactly the same scenario, but in a different part of the system: misconfigured credentials sent to the token endpoint should not be retried immediately - rather they should back off subsequent requests - to avoid DOS of the token endpoint.

The interaction between the two mechanisms needs to be worked out.

Valid credentials may be exchanged for a token, but the polling call may be denied due to not having a token that is valid for Zeebe.

Or the credentials may be invalid and no token returned.

So there are two endpoints that need to back off independently involved in a worker polling call.

@jwulf
Copy link
Member Author

jwulf commented Feb 5, 2025

OK, the two backoffs compound. The token endpoint backoff is linear: FailureCount * 1000.

The backoff on the worker poll is also linear, but a steeper curve: 2 * 1000 * FailureCount, bounded by CAMUNDA_JOB_WORKER_MAX_BACKOFF_MS (defaults to 16000).

In the case where the error is due to not being able to get a token from the token endpoint, the backoff is unbounded.

The worker poll backoff is bounded by the setting of CAMUNDA_JOB_WORKER_MAX_BACKOFF_MS. If the failure of the worker poll is due to a propagated token endpoint error, then the increasing token endpoint backoff is added to the worker poll backoff.

I have bounded the token endpoint backoff to 15s.

So in the case that a token cannot be secured, the following happens:

The OAuthTokenProvider throws a failure, and starts backing off. The worker catches this throw, throws a polling failure, and starts backing off.

The token provider will back off up to 10s, and the worker will back off by up to 15s (by default). So by default, in a failure state of invalid credentials, the combined backup will be 30s.

I have added log warning messages in both the token provider and the worker to let the user know what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant