Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the stability of the operator in big clusters #283

Merged
merged 10 commits into from
Nov 9, 2022

Conversation

spilchen
Copy link
Collaborator

@spilchen spilchen commented Nov 7, 2022

This change improves the stability of the operator for big clusters. When used to create big clusters (> 40 nodes), the number of things it had to do per pod starts adding up. Calls to 'kubectl exec' to get various state (cmds package) can take a while. There is also a lot of memory pressure when 'kubectl exec' is called frequently -- we send the command to the REST endpoint that uses gzip to compress/uncompress the input/output. The compression library, which we don't have any control over, uses a lot of memory. Go is a garbage collection language, so the memory is eventually freed, but the GC can take some time to kick in.

The approach taken in this change is to batch as many 'kubectl exec' commands as we can. This change impacts the podfacts collector, installer reconciler and create db reconciler. The end result is that there is less memory pressure and it is quicker for a reconcile iteration to get through its various state checks.

Refactor of the main.go module. Most of the command line argument handling was moved to a separate package called opcfg. This was changed because I wanted to flow down the command line arguments into the controllers so that we can log things only when in dev mode. So, we need to put the arguments in an exported struct from a new package that the controllers would have access to.

Included in this is a change to make it easier to run the operator locally. A new script was added that handles proper setup and teardown.

Matt Spilchen added 10 commits November 6, 2022 08:29
- This adds the initPolicy of CreateSkipPackageInstall. This option will
  allow us to create a DB and skip the package installation.
- Batch up commands to the pods for directory prep. This will make it
  faster to setup the pods for the create.
This hides some chatty messages from the operator log. We normally would
dump out the contents of the admintools.conf. But this is suppressed,
unless the devMode flag is set in the operator.

Changes to the operator main.go to flow down the entire config struct
(now called OperatorConfig).
ExecInPod is quite expensive both from a runtime perspective and a
memory perspective. Batching up as many of those has shown that
the operator is more stable on large clusters.
@spilchen spilchen requested a review from roypaulin November 7, 2022 14:40
@spilchen spilchen self-assigned this Nov 7, 2022
Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just have a few comments, more like questions.

var (
execOut bytes.Buffer
execErr bytes.Buffer
)

// Copying a file is simply a cat of the contents from stdin
command := []string{"sh", "-c", fmt.Sprintf("cat > %s", destFile)}
var sb strings.Builder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain me the following lines? Do we always expect a single command. Looks like with several commands it will crash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeCmd could be empty, in which case the nil check would skip things. If there are multiple things passed in for executCmd, they just get concatenated together to form the command to run. Think of it like extra arguments for the command.

The end result will look like:
cat > destFile && cmd arg1 arg2 ...

}
return nil
// Copy and execute the script
_, _, err = pRunner.CopyToPod(ctx, pf.name, names.ServerContainer, tmp.Name(), paths.EulaAcceptanceScript,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you copy and execute the script in a single function. Is that as you mentioned to reduce the number of "exec"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Before we would do a 'kubectl exec' to copy the file in. Then another 'kubectl exec' to run the command we copied in. This just combines the two.

mkdir -p $INT_TEST_OUTPUT_DIR
OP="${INT_TEST_OUTPUT_DIR}/local-verticadb-operator.log"

if helm list -o json | jq --exit-status '.[] | select(.chart|test("^verticadb-operator")) | .name'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check this if we run the operator in our local env?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to make sure the operator isn't already deployed through a helm chart. I wasted a bunch of time because I was running the operator locally and had it deployed in helm too. This is just a check to make sure two operators aren't running at the same time.

@spilchen spilchen merged commit efb566b into vertica:main Nov 9, 2022
@spilchen spilchen deleted the big-cluster-investigation branch November 9, 2022 12:16
spilchen pushed a commit that referenced this pull request Nov 18, 2022
This fixes a bug that was introduced earlier in this release (#283) when we did
the podfacts refactor. We were failing to create the http conf dir in the
installer reconciler if the http server is enabled. The plan is enable the http
e2e tests, which would have caught this, when we use the 12.0.2 server release
in the CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants