Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stabilize e2e test case sandbox-basic #962

Merged
merged 10 commits into from
Oct 29, 2024

Conversation

LiboYu2
Copy link
Collaborator

@LiboYu2 LiboYu2 commented Oct 17, 2024

Increased the timeout for steps that frequently fail.
Add some latency to stabilize the cluster before moving to the unsandboxing step.
I ran the test cases locally for 5 times in a roll and they all passed.

@LiboYu2 LiboYu2 changed the title VER-95582 stabilize e2e test case sandbox-basic stabilize e2e test case sandbox-basic Oct 17, 2024
@@ -13,6 +13,7 @@

apiVersion: kuttl.dev/v1beta1
kind: TestSuite
kindNodeCache: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the downloaded images to be cached on the node. This will speed up test case execution.

@@ -15,3 +15,4 @@ apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
- command: bash -c "../../../scripts/wait-for-verticadb-steady-state.sh -n verticadb-operator -t 360 $NAMESPACE"
- command: sleep 120
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this latency to stabilize the cluster before step 60 starts.

Copy link
Collaborator

@roypaulin roypaulin Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add this sleep. It will take the test longer to complete. If this frequently fails due to insufficient timeout, then just increase the timeout in the script call above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from timeout. The timeout is the maximum wait time for the whole test step to finish. This sleep call gives the cluster some time to stabilize its state before the next step starts. It makes the test run longer but it makes the test pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script on line 17 waits for the operator to be steady(gives the cluster some time to stabilize its state before the next step) meaning there is no error and nothing going on. There is no benefit in adding another wait after. If this step fails and more time is needed, you should increase the time passed as argument to the script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from that wait-for-verticadb-steady-state.sh

timeout $TIMEOUT bash -c -- "while ! $LOG_CMD | \ grep $WEBHOOK_FILTER | \ grep $DEPRECATION_FILTER | \ grep $VDB_FILTER | \ tail -1 | grep --quiet '\"result\": {\"Requeue\":false,\"RequeueAfter\":0}, \"err\": null'; do sleep 1; done" & pid=$! wait $pid

This is from "man timeout":
timeout - run a command with a time limit

If the script runs longer than the $TIMEOUT, an error will be reported.
If the script finishes within the $TIMEOUT, the next step starts right away.
What I want to achieve is to add a latency between the two steps to make sure the latter
step is not impacted by the previous step. Increasing the timeout will not achieve that.

@cchen-vertica
Copy link
Collaborator

As discussed, we should use an environment variable for all extended timeout values (900), and try to print all the events on vdb.

@roypaulin
Copy link
Collaborator

As discussed, we should use an environment variable for all extended timeout values (900), and try to print all the events on vdb.

What do you mean by "printing all the events on vdb"

1 removed 2 min sleep added previously
2 increased low disk space to avoid low disk volume event
3 bumped the spam filter threshold from 25 to 100 to make sure
  kuttl framework will receive all the events
@LiboYu2
Copy link
Collaborator Author

LiboYu2 commented Oct 23, 2024

As discussed, we should use an environment variable for all extended timeout values (900), and try to print all the events on vdb.

What do you mean by "printing all the events on vdb"

Cai's point is to let vdb receive all the events.

@LiboYu2
Copy link
Collaborator Author

LiboYu2 commented Oct 23, 2024

As discussed, we should use an environment variable for all extended timeout values (900), and try to print all the events on vdb.

Totally forgot about that:) I just tried it out. It turns out environmental variables can only be used in commands. https://kuttl.dev/docs/testing/steps.html#running-commands

We cannot use the environmental variables as a configuration parameter. Here is the error I got after using environmental variables: harness.go:397: loading /home/lyu/test-repos/vertica-kubernetes/tests/e2e-leg-10/sandbox-basic/60-assert.yaml: error converting unstructured object TestAssert:/ (/home/lyu/test-repos/vertica-kubernetes/tests/e2e-leg-10/sandbox-basic/60-assert.yaml): error converting TestAssert:/ from unstructured error: unrecognized type: int

@@ -23,7 +23,7 @@ spec:
initPolicy: CreateSkipPackageInstall
communal: {}
local:
requestSize: 250Mi
requestSize: 270Mi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will 20Mi make any difference? We shouldn't make this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the vdb received low disk space event when I ran the test locally. When free disk space is less than 10MB, that event will be sent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then give it more like 500Mi.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500Mi is great for ci server. But when we run the test locally, it may be a challenge if the free disk space is not large enough. When parallel is set to 2, there will 20 pods at most at the same time. How about 300Mi?

@@ -10,7 +10,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kuttl.dev/v1beta1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file should be removed since you already extended the timeout in 60-assert.yaml.

@@ -239,7 +240,7 @@ func main() {
if opcfg.GetLoggingFilePath() != "" {
log.Printf("Now logging in file %s", opcfg.GetLoggingFilePath())
}

var multibroadcaster = record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{BurstSize: 100})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just make this configurable through an env var. Look at examples in opcfg/config.go5(DEPLOY_WITH? WEBHOOKS_ENABLED). See how and where they are changed and do the same for the burstsize. By default it will be 25(k8s default), if a value lower than 25 is passed we still pick 25.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change. Please take a look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiboYu2, I do not see the change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to push it. My bad. Now you should be able to see it.

# this will set up the threshold used by the spam filer
# default threshold is 25, which is too low to run the test cases
# some test cases rely on event verification
BROADCASTER_BURST_SIZE=100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed here.

@@ -14,3 +14,4 @@ CONCURRENCY_VERTICARESTOREPOINTSQUERY
CONCURRENCY_VERTICASCRUTINIZE
CONCURRENCY_SANDBOXCONFIGMAP
CONCURRENCY_VERTICAREPLICATOR
BROADCASTER_BURST_SIZE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to tell you. You also need to add this to template-helm-chart.sh. Also need to add a new parameter to values.yaml. Look at WEBHOOKS_ENABLED(line 227 in template-helm-chart.sh) as an example and do the same for BROADCASTER_BURST_SIZE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I added a new commit to address that. You can take a look. I built locally and deployed to my local kind cluster. The burst size is 100. No external environmental variable was used.

// GetBroadcasterBurstSize returns the customizable burst size for broadcaster.
func GetBroadcasterBurstSize() int {
burstSize := lookupIntEnvVar("BROADCASTER_BURST_SIZE", envCanNotExist)
if burstSize < 25 {
Copy link
Collaborator

@HaoYang0000 HaoYang0000 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We can make a constant DEFAULT_BURST_SIZE = 25 for maintaining the int, and we don't need an else block here, we can do:
if burstSize >= DEFAULT_BURST_SIZE {
return burstSize
}
return DEFAULT_BURST_SIZE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change.

@@ -11,6 +11,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kuttl.dev/v1beta1
kind: TestAssert
timeout: 900
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the default timeout for all job is 1000, any reason we want to reduce the time out here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is 600s in kuttl-test.yaml

@@ -86,6 +86,10 @@ webhook:
# can be used to skip that and still deploy the operator.
enable: true

# this will increase the default threshold used by spam filter in controller runtime from 25 to 100
# this is important for running e2e test as some test cases require verification of events
burstSize: 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this under controllers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

LiboYu2 and others added 3 commits October 28, 2024 11:55
Co-authored-by: Roy Paulin <rnguetsopken@opentext.com>
…ica/vertica-kubernetes into VER-95582-debug-daily-build-failure
@cchen-vertica
Copy link
Collaborator

The unit tests failed. Some go-lint errors exist. You need to fix those errors.

@roypaulin roypaulin merged commit 4ace14d into main Oct 29, 2024
39 checks passed
@roypaulin roypaulin deleted the VER-95582-debug-daily-build-failure branch October 29, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants