-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Activator daemonset #2557
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: greghaynes If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greghaynes: 2 warnings.
In response to this:
Proposed Changes
- Run activator as a daemonset
- Create handler for activator which creates a revision pod using the static-pod kubelet api.
Release Note
Local scheduling of pods on cold-start for better cold-start fastpathing.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
"github.com/knative/serving/pkg/localscheduler" | ||
) | ||
|
||
type LocalSchedulerHandler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: exported type LocalSchedulerHandler should have comment or be unexported. More info.
pkg/localscheduler/schedule.go
Outdated
} | ||
} | ||
|
||
func Schedule(rev *servingv1.Revision, logger *zap.SugaredLogger) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: exported function Schedule should have comment or be unexported. More info.
This still has a ways to go - meant as a strawman POC to explore the implementation. |
name: config-observability | ||
- name: localpods | ||
hostPath: | ||
path: /etc/kubernetes/manifests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe, what I mean is would this be considered a security risk? Don't we have the k8s backplane and all that jazz in that directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also another question most notably due to my ignorance here but do we need a localmount PV for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need a pv since we dont really need to do some magic to persist the data - the hypervisor does it for us. Theres definitely some issues that need to be worked out here though:
- This path isn't something we can depend on (its operator specific / set via kubelet config).
- As you point out this allows us to create (and read) arbitrary static pod definitions on the cluster. Potential security implications here.
|
||
f, err := os.Create(fmt.Sprintf("/etc/localpods/%s", podName)) | ||
if err != nil { | ||
logger.Errorf("Opening local pod definition (%q): %v", rev.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was an error in creating the file, shouldn't we bail out here?
The localscheduler creates a pod using the static pod kubelet API immediately upon activator receiving a request for a revision.
e8a63c0
to
3187edc
Compare
@greghaynes: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I did some testing of this locally on a 1 worker cluster (with a separate controller node) using calico and got cold-start times of about 2.5 seconds. On this same cluster using knative master with istio sidecar injection disabled I got a cold start time of about 6 seconds. This test was performed by running Raw data: Daemonset + static-pod cold start times:
Using master (25656e4):
Some details of what this patch does currently: This runs the activator as a daemonset. Once a request is received it writes out a pod using the static-pod definition API (https://kubernetes.io/docs/tasks/administer-cluster/static-pod/). This pod contains only a user container (no queue-proxy nor istio-proxy). A watch is also created for the creation of this pod, once it is seen with Status.Phase == "Running" a request is made to Status.PodIP of the created pod. |
@greghaynes can you please elaborate on where the gain in cold start latency would come from? |
@tcnghia Off the top of my head - we eliminate the autoscaler waiting to detect the request via stat, knative controllers operating on revision being set active, the k8s replicaset controller, k8s scheduler, any potential queue-proxy startup cost. Its very possible im missing some other things were avoiding here too if anyone can think of others... |
You said the 6 seconds is with istio-injection? What's that number without istio injection for comparison? |
@markusthoemmes The 6 seconds is without istio injection |
@greghaynes Whoops, sorry. I shall not read complicated issues before 8am. That's a fairly big shave off in cold-start time indeed. Interesting stuff, thanks for the protoyping 🎉 |
I assume you mean via API Server. Couldn't the pod be ready before it's reflected in the API? I know the kubelet handles readiness probes, perhaps we could have our local daemonset find/probe the user container directly (as an approximation of what a kubelet-level API might give).
Why no queue-proxy? |
Also 🎉 |
@mattmoor Yes, I expect the pod is ready earlier than we see via the API as our first request to the pod succeeds. I'm not sure what you have in mind about finding the pod directly: I don't know of an API kubelet exposes for us to read directly from it and any type of checking the underlying container system I think will be a pain to support with the various CRI backends. Maybe we could add another container to the pod which requests out to the activator on startup to signal it is ready... The main reason for not having queue-proxy is I just didn't bother to add it to the pod definition :). I doubt there's much cost to having it so this isn't a big issue but I am not sure it buys us anything here as our activator is/can be essentially doing the same job. |
@greghaynes Having the queue-proxy would enable us to have cooperative readiness without too much CRI-incompatible hackery. If we have the queue-proxy report readiness to the DaemonSet on startup (after doing it's own readiness probing of the user container), then the activator would effectively know in ~realtime that the pod is ready for traffic. |
@mattmoor Ah, good idea - I was thinking of a similar pattern but putting it in queue proxy makes a lot of sense. |
@greghaynes, #2557 (comment) this is really great to see. At the end of the day, the work that needs to be done (if the container is disk warm) is two network hops and starting a process. Client to ingress, ingress to activator. This should be 100s of ms. Anything more is an opportunity for optimization. So we should keep digging until we understand exactly where all this extra time comes from (docker, k8s controllers, retries ...). If we find the k8s control plane is getting in our way, which is our current theory, then we may indeed need local scheduling support in k8s. But we'll want to run it by SIG Arch first. |
@josephburnett I agree we should isolate this down to us/the upstream where its being introduced. I think docker is out of the picture here as all of that happens within kubelet (although they are likely a large part of the remaining 2.5 cold-start delay). AFAIK the potential sources here are: autoscaler delay detecting request stat > 0; knative controllers (I think just revision); k8s controllers (deployment, replication, scheduler). I can try whipping up a test to get a good measurement for just scaling a deployment from 0 then we see what part of this time savings is knative vs k8s. Once we get some validation that there's k8s control plane delay here I think we'll need to do our own local workaround in parallel to driving better api support upstream - I worry that the timeline for being ably to get and rely on any fix we drive in to k8s is going to be very long (not only do we have to actually get the changes accepted/released, we also have to wait for k8s platforms to roll out the new k8s). ISTM while we are driving these changes upstream local scheduling should be an optional/feature-flagged thing we support as we know its not a great api to rely on and we can then make it opt-in + marked as unstable? |
@greghaynes sounds good. FYI: there is no reason for the autoscaler to wait for activator to update the stat. The activator should push a stat immediately upon getting a request from a revision scaled to zero (like this) and the autoscaler should immediately recompute the desired scale for the same (like this). Some of this was just added latency from moving activation to the autoscaler. Previously the activator flipped serving state which immediately trigger scale up. |
@greghaynes another middle-ground option you might try is to have the Activator create the Pod by calling the Master, but populating the Spec.NodeName with the current node and Spec.SchedulerName="Activator". So the Activator does the scheduling but loops in the master so it has a chance to do Envoy injection. Then the Activator can just wait for the Kubelet to pick up and start the Pod, then proxy traffic to it. This might reduce the difference between the two serving paths while maintaining the local-scheduling decision. |
I finally got some time to dig on where the delay is coming from. I upgraded my cluster and reran the previous testing and things are overall a bit faster afterwards. I was seeing avg cold start times (using plain master) of about 3.6s and with this patch rebased on master was seeing about 1.8s (so still ~50% or 1.8s gains). Looking at logs of the plain master run, out of the 3 test runs an average of 1.6s was spent between the time when ingressgateway saw the request and when autoscaler attempted to scale from 0 to 1. This means #2659 and #2615 should get us ~90% of the way to the perf we see here. |
Neato, thanks for the insights @greghaynes. Glad to hear we can drive things into the existing architecture without having to work around too much. |
@greghaynes Were you able to prototype the tighter feedback loop from the queue-proxy? |
@mattmoor We created #2615 for that issue and @imikushin mentioned hed like to poke at it |
@greghaynes could you give a status update on this one? Which state is it in, is it being worked on actively? |
@markusthoemmes Hey, sure. My plan was to get tracing (#2726 and #3491) merged first. Then to get some new data on what this actually buys us in terms of performance to send upstream and ask sig-arch/k8s if anyone had any thoughts. Based on my one-off measurements this doesn't buy us a lot of performance in the cluster without load case but it is needed because controlplane latency is extremely variable based on load. Beyond that planning there hasn't been any progress on this as tracing has taken (much) longer than expected to get merged... |
@greghaynes are we still following through on this? We have activator scalability coming up on the scaling roadmap and I was wondering about this PR in relation to that. |
I feel like this is vastly outdated and the method is kinda accepted as one of autoscaling's feature streams forward. I expect a lot of PRs falling out of that work that'll supersede this one so I'll naively close it for now. /close If you feel differently, please feel free to reopen @greghaynes |
@markusthoemmes: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #2497
Fixes #2498
Proposed Changes
Release Note