Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make disk full errors more prominent #330

Merged
merged 5 commits into from
Feb 2, 2023

Conversation

spilchen
Copy link
Collaborator

This change makes disk full errors more prominent in the operator. The focus is on checking the disk space for the local PV. We handle this with a couple of strategies:

  • parsing the output from admintools. Sometimes, disk full errors will be seen in the output.
  • add a reconcile actor that will check the disk space available in the local PV and log a k8s event if its below a threshold (currently 10mb)

I did the second approach because there are some times when vertica may have trouble starting and the error doesn't surface in the admintools output.

@spilchen spilchen self-assigned this Jan 31, 2023
@spilchen spilchen requested a review from roypaulin January 31, 2023 20:15
Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

return stdout, err
}
r.VRec.Eventf(r.Vdb, corev1.EventTypeNormal, events.NodeRestartSucceeded,
"Successfully called 'admintools -t restart_node' and it took %ds", int(elapsedTimeInSeconds))
return stdout, nil
}

// logATFailureEvent will log k8s events for the admintools restart error.
func (r *RestartReconciler) logATFailureEvent(atCmd, op string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to use this for other admintools commands? Like if there is a need to have the same kind of function for another reconciler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bunch of parsing we do for create_db. I debated about throwing everything into a new package that they both use. Do you think I should do that cleanup? It will probably make things easier when we move away from admintools.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do like that idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to do that on this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll include it in this PR. I'm working on this change now.

VRec *VerticaDBReconciler
Vdb *vapi.VerticaDB
PFacts *PodFacts
NumEvents int // Number of events written
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you use this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the unit test to prove that we recorded the correct number of events.

Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

ctrl "sigs.k8s.io/controller-runtime"
)

type ATErrors struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment for this struct? Same for some functions below.

RestartNodesNotDownRequeueWaitTimeInSeconds = 10
)

// LogFailureError is called when admintools had attempted an option but
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function's name is LogFailure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was an early name of this function. I forget to change this.

@spilchen spilchen merged commit 0270376 into vertica:main Feb 2, 2023
@spilchen spilchen deleted the disk-full-handling-jan23 branch February 2, 2023 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants