-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Immediately scale up autoscaler for cold-start #2963
Immediately scale up autoscaler for cold-start #2963
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greghaynes: 0 warnings.
In response to this:
When the autoscaler gets a stat to scale a service up from 0 we should
act on this immediately in order to decrease cold-start time.Closes #2961
Release Note
Immediately scale from zero in autoscaler.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach is lgtm but I think there's a bug here wrt. to how we get to know if it's the activator sending the metrics.
On that note: Is it even important that the activator is sending the metric in the end? Is there any harm dropping that requirement and force a tick on any incoming metric when at 0?
pkg/autoscaler/multiscaler.go
Outdated
scaler.scaler.Record(ctx, stat) | ||
if scaler.metric.Status.DesiredScale == 0 && stat.PodName == ActivatorPodName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're scaling the activator, this won't work anymore I believe. You presumably need to do the same check as the autoscaler does, which is: if strings.HasPrefix(stat.PodName, ActivatorPodName) {
Has this worked locally?
Can the first half of this be: scaler.getLatestScale() == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find - I removed the pod name check entirely since it really shouldnt matter where the stat came from (as you point out).
+1 to @markusthoemmes comment, but when we reverse the flow of metrics (@yanweiguo is working on this) it shouldn't matter (any push is activator then), so will it matter? @greghaynes I put this in 0.4, so LMK if that timeframe is a problem. |
4be0da9
to
fcaf0c1
Compare
Thanks for making the change. Can you add a test for this to bump coverage up again? |
/retest |
@greghaynes I'm curious what speedup you see with this? |
@mattmoor In theory it should be ~.5 sec avg (0-1sec depending on when a stat is sent). Ive been trying to test this out the past day on my shared k8s cluster and unfortunately my k8s controlplane is way too noisy to get useful data out of (another interesting argument for us doing local scheduling). I'll try and set up a local private cluster to see, but if anyone has one and wants to check we now have a perf test for this: |
Ping @greghaynes I think there's one outstanding comment to merge this. |
For clarity, that comment is:
|
/assign @markusthoemmes So it appears in my review lists and I keep track of it. We should land this in 0.4 (as scheduled) |
Thanks for the reminder - ill get to this today. |
fcaf0c1
to
17e9996
Compare
/test pull-knative-serving-go-coverage |
I think not commenting may mean same-coverage? If so we should file a bug that it doesn't remove the old comment, I reran it to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should refine the test a bit to prevent regression in the future.
pkg/autoscaler/multiscaler_test.go
Outdated
// We got the signal! | ||
case <-time.After(30 * time.Millisecond): | ||
t.Fatalf("timed out waiting for Watch()") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test would also pass without the changes made? I get coverage on the newly added path only intermittently. The multiscaler runs on a tick interval of 1ms so it will be quick either way. Should we increase the tick interval of the multiscaler and assert we get a tick in a substantially quicker time? (For example: tickInterval of 100ms, make the others wait for 200ms and this wait for 30ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 let's add a parallel test to this with a tick interval larger than the overall test timeout. Then we should be able to properly test this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I am having a hard time coming up with a way to do this without chopping up or adding some multiscaler access. When we create a metric which in turn creates a scaler we need to go through one tickScaler cycle in order for the scaler desiredScale to be set to 0. If I up the tick interval then we'll have to wait for that time period. AFAICT theres no good interface for me to inject this tick cycle, do you have any thoughts on this @markusthoemmes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greghaynes yeah, I realize it's not as easy. The following adjustments trigger the path we want in the test pretty reliably:
diff --git a/pkg/autoscaler/multiscaler_test.go b/pkg/autoscaler/multiscaler_test.go
index a39ba0d7..840220d5 100644
--- a/pkg/autoscaler/multiscaler_test.go
+++ b/pkg/autoscaler/multiscaler_test.go
@@ -110,9 +110,11 @@ func TestMultiScalerScaling(t *testing.T) {
}
func TestMultiScalerScaleToZero(t *testing.T) {
+ tickInterval := 50 * time.Millisecond
+
ctx := context.TODO()
ms, stopCh, uniScaler := createMultiScaler(t, &autoscaler.Config{
- TickInterval: time.Millisecond * 1,
+ TickInterval: tickInterval,
EnableScaleToZero: true,
})
defer close(stopCh)
@@ -160,18 +162,10 @@ func TestMultiScalerScaleToZero(t *testing.T) {
select {
case <-done:
// We got the signal!
- case <-time.After(30 * time.Millisecond):
+ case <-time.After(tickInterval * 2):
t.Fatalf("timed out waiting for Watch()")
}
- // Verify that we stop seeing "ticks"
- select {
- case <-done:
- t.Fatalf("Got unexpected tick")
- case <-time.After(30 * time.Millisecond):
- // We got nothing!
- }
-
desiredScaleMutex.Lock()
desiredScale = 1
desiredScaleMutex.Unlock()
@@ -188,7 +182,7 @@ func TestMultiScalerScaleToZero(t *testing.T) {
select {
case <-done:
// We got the signal!
- case <-time.After(30 * time.Millisecond):
+ case <-time.After(tickInterval / 2):
t.Fatalf("timed out waiting for Watch()")
}
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markusthoemmes hrm, check out what I just pushed - I added a method to directly set the desiredscale of a scalerrunner which lets us actually test this code path 100% of the time.
Ping, I'd like to see this land in 0.4 |
17e9996
to
000e179
Compare
9616881
to
b691dff
Compare
The following is the coverage report on pkg/.
|
723019a
to
3b777b3
Compare
When the autoscaler gets a stat to scale a service up from 0 we should act on this immediately in order to decrease cold-start time. Closes knative#2961
3b777b3
to
ebc0a58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing these changes
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: greghaynes, markusthoemmes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When the autoscaler gets a stat to scale a service up from 0 we should
act on this immediately in order to decrease cold-start time.
Closes #2961
Release Note