Distinguish between "script error" and "system failure" #28

jnimis · 2024-08-20T22:26:42Z

We use Gitlab’s retry:when: rule with runner_system_failure to retry jobs when they fail due to an infrastructure problem, and we also have metrics and alerts for system failures that depend on Gitlab's classifcation. So far, using the anka-cloud-gitlab-executor, all failures (including system failures) return ERROR: Job failed: exit status 1 which registers in Gitlab as a script error. When using the anka-gitlab-runner instead, these errors were classified correctly in Gitlab.

For example, a recent job ended with this output:

2024/08/19 20:22:18 instance 16210410-9555-42d7-42a8-3a6c6740be8c is in state "Scheduling"
2024/08/19 20:22:21 instance 16210410-9555-42d7-42a8-3a6c6740be8c is in state "Error"
2024/08/19 20:22:21 error: failed to wait for instance "16210410-9555-42d7-42a8-3a6c6740be8c" to be scheduled: instance 16210410-9555-42d7-42a8-3a6c6740be8c is in an unexpected state: Error
ERROR: Job failed: exit status 1

When we query the Gitlab API for this job, we would expect the field "failure_reason" to have the value "runner_system_failure", but instead it comes back as "script_failure".

In our research of the problem, we came across these docs about how to send exit codes from a custom executor to Gitlab such that it appears as a system failure.

The text was updated successfully, but these errors were encountered:

NorseGaud · 2024-08-20T23:01:14Z

Fortunately it looks like our dev team added a bit of support for this:

	if err := command.Execute(ctx); err != nil {
		log.Printf("error: %s", err)
		if errors.Is(err, gitlab.ErrTransient) {
			return systemFailureExitCode
		}
		return buildFailureExitCode
	}

so, for failed to wait for instance, we'd need to use gitlab.TransientError to get it to do a system failure.

Seems fairly straight forward. I'll see if we can get a patch out soon.

#28

NorseGaud added a commit that referenced this issue Aug 20, 2024

https://github.com/veertuinc/anka-cloud-gitlab-executor/issues/28

695de7c

NorseGaud mentioned this issue Aug 20, 2024

https://github.com/veertuinc/anka-cloud-gitlab-executor/issues/28 #29

Merged

NorseGaud added a commit that referenced this issue Aug 21, 2024

Merge pull request #29 from veertuinc/release/v1.2.2

20a527a

#28

NorseGaud closed this as completed in #29 Aug 21, 2024

tejassharma96 mentioned this issue Oct 18, 2024

Mark more errors as runner system failures #32

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between "script error" and "system failure" #28

Distinguish between "script error" and "system failure" #28

jnimis commented Aug 20, 2024 •

edited

Loading

NorseGaud commented Aug 20, 2024

Distinguish between "script error" and "system failure" #28

Distinguish between "script error" and "system failure" #28

Comments

jnimis commented Aug 20, 2024 • edited Loading

NorseGaud commented Aug 20, 2024

jnimis commented Aug 20, 2024 •

edited

Loading