Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between "script error" and "system failure" #28

Closed
jnimis opened this issue Aug 20, 2024 · 1 comment · Fixed by #29
Closed

Distinguish between "script error" and "system failure" #28

jnimis opened this issue Aug 20, 2024 · 1 comment · Fixed by #29

Comments

@jnimis
Copy link

jnimis commented Aug 20, 2024

We use Gitlab’s retry:when: rule with runner_system_failure to retry jobs when they fail due to an infrastructure problem, and we also have metrics and alerts for system failures that depend on Gitlab's classifcation. So far, using the anka-cloud-gitlab-executor, all failures (including system failures) return ERROR: Job failed: exit status 1 which registers in Gitlab as a script error. When using the anka-gitlab-runner instead, these errors were classified correctly in Gitlab.

For example, a recent job ended with this output:

2024/08/19 20:22:18 instance 16210410-9555-42d7-42a8-3a6c6740be8c is in state "Scheduling"
2024/08/19 20:22:21 instance 16210410-9555-42d7-42a8-3a6c6740be8c is in state "Error"
2024/08/19 20:22:21 error: failed to wait for instance "16210410-9555-42d7-42a8-3a6c6740be8c" to be scheduled: instance 16210410-9555-42d7-42a8-3a6c6740be8c is in an unexpected state: Error
ERROR: Job failed: exit status 1

When we query the Gitlab API for this job, we would expect the field "failure_reason" to have the value "runner_system_failure", but instead it comes back as "script_failure".

In our research of the problem, we came across these docs about how to send exit codes from a custom executor to Gitlab such that it appears as a system failure.

@NorseGaud
Copy link
Member

Fortunately it looks like our dev team added a bit of support for this:

	if err := command.Execute(ctx); err != nil {
		log.Printf("error: %s", err)
		if errors.Is(err, gitlab.ErrTransient) {
			return systemFailureExitCode
		}
		return buildFailureExitCode
	}

so, for failed to wait for instance, we'd need to use gitlab.TransientError to get it to do a system failure.

Seems fairly straight forward. I'll see if we can get a patch out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants