Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal client method to restart failed workflow #15051

Merged
merged 14 commits into from
Aug 11, 2022

Conversation

benmoriceau
Copy link
Contributor

@benmoriceau benmoriceau commented Jul 26, 2022

What

Described in #14970

Add a method in TemporalClient that will allow to restart all the failed temporal workflow.

It also make ConnectionManagerUtils to be injectable (i.e. remove the statics).

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Jul 26, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets July 26, 2022 23:16 Inactive
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not personally too worried about the static methods, but I am worried about a very large set of failing jobs that need to be restarted.

@@ -446,6 +451,79 @@ public ManualOperationResult synchronousResetConnection(final UUID connectionId,
Optional.of(resetJobId), Optional.empty());
}

public void restartWorkflowByStatus(final WorkflowExecutionStatus executionStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on executionStatus being an argument

UUID.fromString(stringUUID));
}

Set<String> fetchWorkflowByStatus(final WorkflowExecutionStatus executionStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set<String> fetchWorkflowByStatus(final WorkflowExecutionStatus executionStatus) {
Set<String> fetchWorkflowsByStatus(final WorkflowExecutionStatus executionStatus) {

This is a plural method, that returns a set

Comment on lines 455 to 457
final Set<String> workflowExecutionInfos = fetchWorkflowByStatus(executionStatus);

final Set<String> nonRunningWorkflow = filterOutRunningWorkspaceId(workflowExecutionInfos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large can these sets be? Are we worried about running out of memory? Should we be iterating in batches?

Maybe the method signature could be fetchWorkflowByStatus(executionStatus, limit=100) and we keep calling it until we get 0 results. In the method below, you are already paging though results with nextPagetoken - perhaps we don't need to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limiting number here is number of connection ID. I don't expect it to be a concern in the short term.

In order to limit the memory consumption, I will change this to use UUID instead of String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned about this method returning quite a lot of data once we have 1000s of workspaces and connections, but I agree for now this is probably OK.

+ "unreachable state before starting a new workflow for this connection");
ConnectionManagerUtils.startConnectionManagerNoSignal(client, maybeConnectionId.get());
} else {
log.error("Malformated workflow id, won't terminate of restart");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gramatical Nit:

Suggested change
log.error("Malformated workflow id, won't terminate of restart");
log.error("Malformated workflow id, won't terminate or restart");

Comment on lines 481 to 483
System.err.println(workflowId);
System.err.println(StringUtils.removeStart(workflowId, "connection_manager_"));
System.err.flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use log() elsewhere... can the logger not accept a log-level to print to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the miss-understanding. This is a draft and is not meant to be merge, so it contains debug print. This will be clean up. The main thing to look at is the test and the static way the mock are done.

@benmoriceau benmoriceau temporarily deployed to more-secrets August 1, 2022 21:08 Inactive
@github-actions github-actions bot added the area/connectors Connector related issues label Aug 2, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets August 2, 2022 21:37 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets August 4, 2022 17:44 Inactive
@github-actions github-actions bot removed the area/connectors Connector related issues label Aug 4, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets August 4, 2022 17:48 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets August 4, 2022 20:11 Inactive
@benmoriceau benmoriceau marked this pull request as ready for review August 4, 2022 20:14
@benmoriceau benmoriceau temporarily deployed to more-secrets August 4, 2022 20:16 Inactive
@benmoriceau benmoriceau changed the title Implementation and test Temporal client method to restart failed workflow Aug 4, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets August 4, 2022 20:45 Inactive
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks like you sorted out the instance/static method problem too!

Copy link
Contributor

@gosusnp gosusnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying to understand the scope of the ConnectionManagerUtils, it feels like we should look a bit closer at the scope of the ConnectorManagerWorkflow and the WorkflowClient. It is not super clear what's the responsibility of each object, this will be helpful in order to get a cleaner dependency injection.

@benmoriceau benmoriceau temporarily deployed to more-secrets August 11, 2022 17:03 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets August 11, 2022 17:24 Inactive
@benmoriceau benmoriceau merged commit 246df1f into master Aug 11, 2022
@benmoriceau benmoriceau deleted the bmoric/temporal-client-restart-by-status branch August 11, 2022 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker connectors/destination/mqtt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants