-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stabilize e2e test case sandbox-basic #962
Conversation
@@ -13,6 +13,7 @@ | |||
|
|||
apiVersion: kuttl.dev/v1beta1 | |||
kind: TestSuite | |||
kindNodeCache: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the downloaded images to be cached on the node. This will speed up test case execution.
@@ -15,3 +15,4 @@ apiVersion: kuttl.dev/v1beta1 | |||
kind: TestStep | |||
commands: | |||
- command: bash -c "../../../scripts/wait-for-verticadb-steady-state.sh -n verticadb-operator -t 360 $NAMESPACE" | |||
- command: sleep 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this latency to stabilize the cluster before step 60 starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to add this sleep. It will take the test longer to complete. If this frequently fails due to insufficient timeout, then just increase the timeout in the script call above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from timeout. The timeout is the maximum wait time for the whole test step to finish. This sleep call gives the cluster some time to stabilize its state before the next step starts. It makes the test run longer but it makes the test pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script on line 17 waits for the operator to be steady(gives the cluster some time to stabilize its state before the next step) meaning there is no error and nothing going on. There is no benefit in adding another wait after. If this step fails and more time is needed, you should increase the time passed as argument to the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from that wait-for-verticadb-steady-state.sh
timeout $TIMEOUT bash -c -- "while ! $LOG_CMD | \ grep $WEBHOOK_FILTER | \ grep $DEPRECATION_FILTER | \ grep $VDB_FILTER | \ tail -1 | grep --quiet '\"result\": {\"Requeue\":false,\"RequeueAfter\":0}, \"err\": null'; do sleep 1; done" & pid=$! wait $pid
This is from "man timeout":
timeout - run a command with a time limit
If the script runs longer than the $TIMEOUT, an error will be reported.
If the script finishes within the $TIMEOUT, the next step starts right away.
What I want to achieve is to add a latency between the two steps to make sure the latter
step is not impacted by the previous step. Increasing the timeout will not achieve that.
As discussed, we should use an environment variable for all extended timeout values (900), and try to print all the events on vdb. |
What do you mean by "printing all the events on vdb" |
1 removed 2 min sleep added previously 2 increased low disk space to avoid low disk volume event 3 bumped the spam filter threshold from 25 to 100 to make sure kuttl framework will receive all the events
Cai's point is to let vdb receive all the events. |
Totally forgot about that:) I just tried it out. It turns out environmental variables can only be used in commands. https://kuttl.dev/docs/testing/steps.html#running-commands We cannot use the environmental variables as a configuration parameter. Here is the error I got after using environmental variables: harness.go:397: loading /home/lyu/test-repos/vertica-kubernetes/tests/e2e-leg-10/sandbox-basic/60-assert.yaml: error converting unstructured object TestAssert:/ (/home/lyu/test-repos/vertica-kubernetes/tests/e2e-leg-10/sandbox-basic/60-assert.yaml): error converting TestAssert:/ from unstructured error: unrecognized type: int |
@@ -23,7 +23,7 @@ spec: | |||
initPolicy: CreateSkipPackageInstall | |||
communal: {} | |||
local: | |||
requestSize: 250Mi | |||
requestSize: 270Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will 20Mi make any difference? We shouldn't make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the vdb received low disk space event when I ran the test locally. When free disk space is less than 10MB, that event will be sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then give it more like 500Mi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500Mi is great for ci server. But when we run the test locally, it may be a challenge if the free disk space is not large enough. When parallel is set to 2, there will 20 pods at most at the same time. How about 300Mi?
@@ -10,7 +10,10 @@ | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
apiVersion: kuttl.dev/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file should be removed since you already extended the timeout in 60-assert.yaml.
cmd/operator/main.go
Outdated
@@ -239,7 +240,7 @@ func main() { | |||
if opcfg.GetLoggingFilePath() != "" { | |||
log.Printf("Now logging in file %s", opcfg.GetLoggingFilePath()) | |||
} | |||
|
|||
var multibroadcaster = record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{BurstSize: 100}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just make this configurable through an env var. Look at examples in opcfg/config.go5(DEPLOY_WITH? WEBHOOKS_ENABLED). See how and where they are changed and do the same for the burstsize. By default it will be 25(k8s default), if a value lower than 25 is passed we still pick 25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change. Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LiboYu2, I do not see the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to push it. My bad. Now you should be able to see it.
tests/kustomize-defaults.cfg
Outdated
# this will set up the threshold used by the spam filer | ||
# default threshold is 25, which is too low to run the test cases | ||
# some test cases rely on event verification | ||
BROADCASTER_BURST_SIZE=100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed here.
@@ -14,3 +14,4 @@ CONCURRENCY_VERTICARESTOREPOINTSQUERY | |||
CONCURRENCY_VERTICASCRUTINIZE | |||
CONCURRENCY_SANDBOXCONFIGMAP | |||
CONCURRENCY_VERTICAREPLICATOR | |||
BROADCASTER_BURST_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to tell you. You also need to add this to template-helm-chart.sh
. Also need to add a new parameter to values.yaml. Look at WEBHOOKS_ENABLED
(line 227 in template-helm-chart.sh) as an example and do the same for BROADCASTER_BURST_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I added a new commit to address that. You can take a look. I built locally and deployed to my local kind cluster. The burst size is 100. No external environmental variable was used.
pkg/opcfg/config.go
Outdated
// GetBroadcasterBurstSize returns the customizable burst size for broadcaster. | ||
func GetBroadcasterBurstSize() int { | ||
burstSize := lookupIntEnvVar("BROADCASTER_BURST_SIZE", envCanNotExist) | ||
if burstSize < 25 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: We can make a constant DEFAULT_BURST_SIZE = 25 for maintaining the int, and we don't need an else block here, we can do:
if burstSize >= DEFAULT_BURST_SIZE {
return burstSize
}
return DEFAULT_BURST_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change.
@@ -11,6 +11,10 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
apiVersion: kuttl.dev/v1beta1 | |||
kind: TestAssert | |||
timeout: 900 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the default timeout for all job is 1000, any reason we want to reduce the time out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is 600s in kuttl-test.yaml
@@ -86,6 +86,10 @@ webhook: | |||
# can be used to skip that and still deploy the operator. | |||
enable: true | |||
|
|||
# this will increase the default threshold used by spam filter in controller runtime from 25 to 100 | |||
# this is important for running e2e test as some test cases require verification of events | |||
burstSize: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this under controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
Co-authored-by: Roy Paulin <rnguetsopken@opentext.com>
…ica/vertica-kubernetes into VER-95582-debug-daily-build-failure
The unit tests failed. Some go-lint errors exist. You need to fix those errors. |
Increased the timeout for steps that frequently fail.
Add some latency to stabilize the cluster before moving to the unsandboxing step.
I ran the test cases locally for 5 times in a roll and they all passed.