Emit failure resolve event #604

darkowlzz · 2022-03-07T14:40:05Z

When source reconciliation failure happens, a warning event is emitted but
when resolved, there's no resolved event emitted, leaving the user uninformed
about the status of the operation.

This change checks the object status for failure conditions and
emits a normal event to notify that the failure is resolved on a
successful reconciliation. If a failure is resolved along with a new artifact,
"stored artifact" event is emitted, suppressing the failure resolve message.
New artifact event implies that the reconciliation was successful.

When using notification-controller, this appears as:

Console logs:

{"level":"error","ts":"2022-03-07T13:57:00.950Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"podinfo","namespace":"flux-system","error":"failed to checkout and determine revision: unable to clone 'ssh://git@github.com/stefanprodan/podinfo': Failed to retrieve list of SSH authentication methods: Failed getting response"}
{"level":"info","ts":"2022-03-07T13:57:07.279Z","logger":"controller.gitrepository","msg":"resolved 'GitOperationFailed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"podinfo","namespace":"flux-system"}

pjbgf

This should improve user's perception when facing intermittent issues (i.e. during github clone ops). Nice work @darkowlzz

LGTM

stefanprodan · 2022-03-07T15:04:38Z

That printscreen is confusing, why is there the error twice? Are we sending 2 events now instead of one?

stefanprodan · 2022-03-07T15:10:42Z

Maybe it could be something like resolved <reason>, stored artifact for commit <message>

darkowlzz · 2022-03-07T15:12:13Z

That printscreen is confusing, why is there the error twice? Are we sending 2 events now instead of one?

The error in the normal event is to show what was the failure before, in case the recovery took place after some time, providing user some context about what failed. But now I don't think it's needed. Would be simpler to just say something like:

Successfully recovered from earlier 'GitOperationFailed' failure

Happy to have some better messaging.

darkowlzz · 2022-03-07T15:13:24Z

Maybe it could be something like resolved <reason>, stored artifact for commit <message>

This event is emitted in reconcileSource() sub-reconciler. By this time, we haven't stored the artifact. We can't say anything about artifact.

darkowlzz · 2022-03-07T15:20:56Z

[OUTDATED]

Just for a record, with this, the ssh could not read data would appear as:

The simplified messaging could also be:

resolved 'GitOperationFailed' failure

Not sure how informative we should be.

stefanprodan · 2022-03-07T15:30:02Z

I'm not sure we should inject the previous failure in metadata, also do we send a 2nd message after this with the revision? Does this change means people will receive 2 info messages every time SC recovers? For example now SC errors out for each HelmChart at startup, I have 10 charts, so 10 error messages and 10 success messages at every upgrade. Will I receive 30 messages every time after this change?

darkowlzz · 2022-03-07T15:38:50Z

I'm not sure we should inject the previous failure in metadata, also do we send a 2nd message after this with the revision? Does this change means people will receive 2 info messages every time SC recovers? For example now SC errors out for each HelmChart at startup, I have 10 charts, so 10 error messages and 10 success messages at every upgrade. Will I receive 30 messages every time after this change?

Yes, it'll result in double notifications every time there's a failure.
I checked the code, I think we can consolidate everything as one message at the very end, along with the stored artifact message.
Will get back with the new implementation.
Thanks for the feedback.

stefanprodan

LGTM

Thanks @darkowlzz

controllers/bucket_controller.go

When source fetch failure happens, a warning event is emitted but when resolved, there's no resolved event emitted. This change introduces a new ResultProcessor, NotifySuccess, for SummarizeAndPatch helper which consolidates all the messages to be emitted as a single event. It checks for any recovery from failure condition and emits event about the recovery. In case there's a failure recovery and a new artifact, success event message only mentions the new artifact success message. Recovery is implied. This helps avoid mixing of the event types to have K8s native events friendly events that work with event counters. The gitrepo reconciler is modified to use the new NotifySuccess ResultProcessor. A change in artifact is checked in reconcile() and a Notification object is returned, populated with the success information. When the Notification is empty/zero, and no failure recovery, no event is emitted. Failure recovery without any Notification results in an event with only the recovery information. Introduce new condition "StorageOperationFailed" for use with any failures due to storage operation. With this, all the failures are associated with some condition, which is used to detect a failure recovery and reported as notification. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Update bucket, helmrepo and helmchart reconcilers to use NotifySuccess ResultProcessor and StorageOperationFailedCondition. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Since the old version of the object is now kept in a separate variable, the old object can be passed to summary helper to create patch helper. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Replace `StorageOperationFailedReason` with granular reasons for the failure. Update docs with `StorageOperationFailedCondition`. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Signed-off-by: Sunny <darkowlzz@protonmail.com>

darkowlzz · 2022-03-11T15:20:41Z

Summarizing how things changed eventually...
I've updated the PR description with a new image to show what the notifications look like.
Instead of combining recovery and artifact stored events, the recovery event message is suppressed when there's a new artifact. New artifact implies that reconciliation was successful. When there's a recovery without a new artifact, a single notification with the recovery message is sent. Keeping the recovery and new artifact events separate help in watching these events separate with their own counters in K8s native events.
Also, updated all the failures to be associated with some status condition. The storage failures are now associated with the StorageOperationFailedCondition with granular reasons for the exact failure which appear in the recovery message.
Updated all the controllers to have these changes.

hiddeco · 2022-03-11T16:01:54Z

@stefanprodan please have a look at this and see if you agree with it from a UX perspective. I directed changes this way, as I felt that "polluting" the event data with conditional strings would be non beneficial to native Kubernetes Events, as it for example causes issues with keeping counts of repetitive messages.

By emitting the same event again after "recovering" from a failure, we still produce a signal to a user while allowing Kubernetes to keep count.

darkowlzz · 2022-03-14T12:04:01Z

Putting this on hold for now. I would like to implement it in another way.
Will extract the status condition API changes into a separate PR.

darkowlzz · 2022-04-05T19:02:48Z

#624 is ready, a new simpler implementation. Closing this.

darkowlzz added the enhancement New feature or request label Mar 7, 2022

pjbgf approved these changes Mar 7, 2022

View reviewed changes

hiddeco added the area/git Git related issues and pull requests label Mar 7, 2022

darkowlzz force-pushed the git-fetch-recovery-event branch 2 times, most recently from ce5f508 to b18a7d2 Compare March 8, 2022 12:06

stefanprodan approved these changes Mar 8, 2022

View reviewed changes

darkowlzz force-pushed the git-fetch-recovery-event branch from b18a7d2 to 5921dbf Compare March 11, 2022 00:41

darkowlzz changed the title ~~gitrepo: Emit failure resolve event~~ Emit failure resolve event Mar 11, 2022

hiddeco reviewed Mar 11, 2022

View reviewed changes

controllers/bucket_controller.go Outdated Show resolved Hide resolved

darkowlzz force-pushed the git-fetch-recovery-event branch from 5921dbf to 2a9ac47 Compare March 11, 2022 14:28

darkowlzz added 5 commits March 11, 2022 20:43

Update all the reconcilers to use NotifySuccess

0369080

Update bucket, helmrepo and helmchart reconcilers to use NotifySuccess ResultProcessor and StorageOperationFailedCondition. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Create patch helper within SummarizeAndPatch

25a9610

Since the old version of the object is now kept in a separate variable, the old object can be passed to summary helper to create patch helper. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Add granular reasons for storage failures

dd72768

Replace `StorageOperationFailedReason` with granular reasons for the failure. Update docs with `StorageOperationFailedCondition`. Signed-off-by: Sunny <darkowlzz@protonmail.com>

Simplify all the conditions.MarkTrue() reason refs

73d8dff

Signed-off-by: Sunny <darkowlzz@protonmail.com>

darkowlzz force-pushed the git-fetch-recovery-event branch from f13b1a1 to 73d8dff Compare March 11, 2022 15:13

hiddeco requested a review from stefanprodan March 11, 2022 16:00

darkowlzz added the hold Issues and pull requests put on hold label Mar 14, 2022

darkowlzz closed this Apr 5, 2022

darkowlzz deleted the git-fetch-recovery-event branch April 5, 2022 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit failure resolve event #604

Emit failure resolve event #604

darkowlzz commented Mar 7, 2022 •

edited

Loading

pjbgf left a comment

stefanprodan commented Mar 7, 2022

stefanprodan commented Mar 7, 2022 •

edited

Loading

darkowlzz commented Mar 7, 2022

darkowlzz commented Mar 7, 2022 •

edited

Loading

darkowlzz commented Mar 7, 2022 •

edited

Loading

stefanprodan commented Mar 7, 2022

darkowlzz commented Mar 7, 2022

stefanprodan left a comment

darkowlzz commented Mar 11, 2022

hiddeco commented Mar 11, 2022 •

edited

Loading

darkowlzz commented Mar 14, 2022 •

edited

Loading

darkowlzz commented Apr 5, 2022

Emit failure resolve event #604

Emit failure resolve event #604

Conversation

darkowlzz commented Mar 7, 2022 • edited Loading

pjbgf left a comment

Choose a reason for hiding this comment

stefanprodan commented Mar 7, 2022

stefanprodan commented Mar 7, 2022 • edited Loading

darkowlzz commented Mar 7, 2022

darkowlzz commented Mar 7, 2022 • edited Loading

darkowlzz commented Mar 7, 2022 • edited Loading

stefanprodan commented Mar 7, 2022

darkowlzz commented Mar 7, 2022

stefanprodan left a comment

Choose a reason for hiding this comment

darkowlzz commented Mar 11, 2022

hiddeco commented Mar 11, 2022 • edited Loading

darkowlzz commented Mar 14, 2022 • edited Loading

darkowlzz commented Apr 5, 2022

darkowlzz commented Mar 7, 2022 •

edited

Loading

stefanprodan commented Mar 7, 2022 •

edited

Loading

darkowlzz commented Mar 7, 2022 •

edited

Loading

darkowlzz commented Mar 7, 2022 •

edited

Loading

hiddeco commented Mar 11, 2022 •

edited

Loading

darkowlzz commented Mar 14, 2022 •

edited

Loading