Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Activator daemonset #2557

Closed
wants to merge 2 commits into from

Conversation

greghaynes
Copy link
Contributor

Fixes #2497
Fixes #2498

Proposed Changes

  • Run activator as a daemonset
  • Create handler for activator which creates a revision pod using the static-pod kubelet api.

Release Note

Local scheduling of pods on cold-start for better cold-start fastpathing.

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2018
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 27, 2018
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: greghaynes
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mattmoor

If they are not already assigned, you can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greghaynes: 2 warnings.

In response to this:

Fixes #2497
Fixes #2498

Proposed Changes

  • Run activator as a daemonset
  • Create handler for activator which creates a revision pod using the static-pod kubelet api.

Release Note

Local scheduling of pods on cold-start for better cold-start fastpathing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

"github.com/knative/serving/pkg/localscheduler"
)

type LocalSchedulerHandler struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported type LocalSchedulerHandler should have comment or be unexported. More info.

}
}

func Schedule(rev *servingv1.Revision, logger *zap.SugaredLogger) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported function Schedule should have comment or be unexported. More info.

@greghaynes
Copy link
Contributor Author

This still has a ways to go - meant as a strawman POC to explore the implementation.

name: config-observability
- name: localpods
hostPath:
path: /etc/kubernetes/manifests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe, what I mean is would this be considered a security risk? Don't we have the k8s backplane and all that jazz in that directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another question most notably due to my ignorance here but do we need a localmount PV for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need a pv since we dont really need to do some magic to persist the data - the hypervisor does it for us. Theres definitely some issues that need to be worked out here though:

  • This path isn't something we can depend on (its operator specific / set via kubelet config).
  • As you point out this allows us to create (and read) arbitrary static pod definitions on the cluster. Potential security implications here.


f, err := os.Create(fmt.Sprintf("/etc/localpods/%s", podName))
if err != nil {
logger.Errorf("Opening local pod definition (%q): %v", rev.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was an error in creating the file, shouldn't we bail out here?

The localscheduler creates a pod using the static pod kubelet API
immediately upon activator receiving a request for a revision.
@knative-prow-robot
Copy link
Contributor

@greghaynes: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-go-coverage 3187edc link /test pull-knative-serving-go-coverage
pull-knative-serving-build-tests 3187edc link /test pull-knative-serving-build-tests
pull-knative-serving-upgrade-tests 3187edc link /test pull-knative-serving-upgrade-tests
pull-knative-serving-unit-tests 3187edc link /test pull-knative-serving-unit-tests
pull-knative-serving-integration-tests 3187edc link /test pull-knative-serving-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@greghaynes
Copy link
Contributor Author

greghaynes commented Nov 29, 2018

I did some testing of this locally on a 1 worker cluster (with a separate controller node) using calico and got cold-start times of about 2.5 seconds. On this same cluster using knative master with istio sidecar injection disabled I got a cold start time of about 6 seconds. This test was performed by running time curl <hello-world-go endpoint> after the application had scaled down to 0.

Raw data:

Daemonset + static-pod cold start times:

  1. 2.364
  2. 2.426
  3. 2.304

Using master (25656e4):

  1. 5.966
  2. 5.605
  3. 6.372

Some details of what this patch does currently:

This runs the activator as a daemonset. Once a request is received it writes out a pod using the static-pod definition API (https://kubernetes.io/docs/tasks/administer-cluster/static-pod/). This pod contains only a user container (no queue-proxy nor istio-proxy). A watch is also created for the creation of this pod, once it is seen with Status.Phase == "Running" a request is made to Status.PodIP of the created pod.

@tcnghia
Copy link
Contributor

tcnghia commented Nov 29, 2018

@greghaynes can you please elaborate on where the gain in cold start latency would come from?

@greghaynes
Copy link
Contributor Author

@tcnghia Off the top of my head - we eliminate the autoscaler waiting to detect the request via stat, knative controllers operating on revision being set active, the k8s replicaset controller, k8s scheduler, any potential queue-proxy startup cost.

Its very possible im missing some other things were avoiding here too if anyone can think of others...

@markusthoemmes
Copy link
Contributor

You said the 6 seconds is with istio-injection? What's that number without istio injection for comparison?

@greghaynes
Copy link
Contributor Author

@markusthoemmes The 6 seconds is without istio injection

@markusthoemmes
Copy link
Contributor

@greghaynes Whoops, sorry. I shall not read complicated issues before 8am.

That's a fairly big shave off in cold-start time indeed. Interesting stuff, thanks for the protoyping 🎉

@mattmoor
Copy link
Member

A watch is also created

I assume you mean via API Server. Couldn't the pod be ready before it's reflected in the API?

I know the kubelet handles readiness probes, perhaps we could have our local daemonset find/probe the user container directly (as an approximation of what a kubelet-level API might give).

no queue-proxy nor istio-proxy

Why no queue-proxy?

@mattmoor
Copy link
Member

Also 🎉

@greghaynes
Copy link
Contributor Author

@mattmoor Yes, I expect the pod is ready earlier than we see via the API as our first request to the pod succeeds. I'm not sure what you have in mind about finding the pod directly: I don't know of an API kubelet exposes for us to read directly from it and any type of checking the underlying container system I think will be a pain to support with the various CRI backends. Maybe we could add another container to the pod which requests out to the activator on startup to signal it is ready...

The main reason for not having queue-proxy is I just didn't bother to add it to the pod definition :). I doubt there's much cost to having it so this isn't a big issue but I am not sure it buys us anything here as our activator is/can be essentially doing the same job.

@mattmoor
Copy link
Member

@greghaynes Having the queue-proxy would enable us to have cooperative readiness without too much CRI-incompatible hackery. If we have the queue-proxy report readiness to the DaemonSet on startup (after doing it's own readiness probing of the user container), then the activator would effectively know in ~realtime that the pod is ready for traffic.

@greghaynes
Copy link
Contributor Author

@mattmoor Ah, good idea - I was thinking of a similar pattern but putting it in queue proxy makes a lot of sense.

@josephburnett
Copy link
Contributor

@greghaynes, #2557 (comment) this is really great to see. At the end of the day, the work that needs to be done (if the container is disk warm) is two network hops and starting a process. Client to ingress, ingress to activator. This should be 100s of ms. Anything more is an opportunity for optimization.

So we should keep digging until we understand exactly where all this extra time comes from (docker, k8s controllers, retries ...). If we find the k8s control plane is getting in our way, which is our current theory, then we may indeed need local scheduling support in k8s. But we'll want to run it by SIG Arch first.

@greghaynes
Copy link
Contributor Author

@josephburnett I agree we should isolate this down to us/the upstream where its being introduced. I think docker is out of the picture here as all of that happens within kubelet (although they are likely a large part of the remaining 2.5 cold-start delay). AFAIK the potential sources here are: autoscaler delay detecting request stat > 0; knative controllers (I think just revision); k8s controllers (deployment, replication, scheduler). I can try whipping up a test to get a good measurement for just scaling a deployment from 0 then we see what part of this time savings is knative vs k8s.

Once we get some validation that there's k8s control plane delay here I think we'll need to do our own local workaround in parallel to driving better api support upstream - I worry that the timeline for being ably to get and rely on any fix we drive in to k8s is going to be very long (not only do we have to actually get the changes accepted/released, we also have to wait for k8s platforms to roll out the new k8s). ISTM while we are driving these changes upstream local scheduling should be an optional/feature-flagged thing we support as we know its not a great api to rely on and we can then make it opt-in + marked as unstable?

@josephburnett
Copy link
Contributor

@greghaynes sounds good.

FYI: there is no reason for the autoscaler to wait for activator to update the stat. The activator should push a stat immediately upon getting a request from a revision scaled to zero (like this) and the autoscaler should immediately recompute the desired scale for the same (like this).

Some of this was just added latency from moving activation to the autoscaler. Previously the activator flipped serving state which immediately trigger scale up.

@josephburnett
Copy link
Contributor

@greghaynes another middle-ground option you might try is to have the Activator create the Pod by calling the Master, but populating the Spec.NodeName with the current node and Spec.SchedulerName="Activator". So the Activator does the scheduling but loops in the master so it has a chance to do Envoy injection. Then the Activator can just wait for the Kubelet to pick up and start the Pod, then proxy traffic to it.

This might reduce the difference between the two serving paths while maintaining the local-scheduling decision.

@greghaynes
Copy link
Contributor Author

I finally got some time to dig on where the delay is coming from. I upgraded my cluster and reran the previous testing and things are overall a bit faster afterwards. I was seeing avg cold start times (using plain master) of about 3.6s and with this patch rebased on master was seeing about 1.8s (so still ~50% or 1.8s gains). Looking at logs of the plain master run, out of the 3 test runs an average of 1.6s was spent between the time when ingressgateway saw the request and when autoscaler attempted to scale from 0 to 1. This means #2659 and #2615 should get us ~90% of the way to the perf we see here.

@markusthoemmes
Copy link
Contributor

Neato, thanks for the insights @greghaynes. Glad to hear we can drive things into the existing architecture without having to work around too much.

@mattmoor
Copy link
Member

@greghaynes Were you able to prototype the tighter feedback loop from the queue-proxy?

@greghaynes
Copy link
Contributor Author

@mattmoor We created #2615 for that issue and @imikushin mentioned hed like to poke at it

@markusthoemmes
Copy link
Contributor

@greghaynes could you give a status update on this one? Which state is it in, is it being worked on actively?

@greghaynes
Copy link
Contributor Author

@markusthoemmes Hey, sure. My plan was to get tracing (#2726 and #3491) merged first. Then to get some new data on what this actually buys us in terms of performance to send upstream and ask sig-arch/k8s if anyone had any thoughts. Based on my one-off measurements this doesn't buy us a lot of performance in the cluster without load case but it is needed because controlplane latency is extremely variable based on load. Beyond that planning there hasn't been any progress on this as tracing has taken (much) longer than expected to get merged...

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 15, 2019
@markusthoemmes
Copy link
Contributor

@greghaynes are we still following through on this? We have activator scalability coming up on the scaling roadmap and I was wondering about this PR in relation to that.

@markusthoemmes
Copy link
Contributor

I feel like this is vastly outdated and the method is kinda accepted as one of autoscaling's feature streams forward. I expect a lot of PRs falling out of that work that'll supersede this one so I'll naively close it for now.

/close

If you feel differently, please feel free to reopen @greghaynes

@knative-prow-robot
Copy link
Contributor

@markusthoemmes: Closed this PR.

In response to this:

I feel like this is vastly outdated and the method is kinda accepted as one of autoscaling's feature streams forward. I expect a lot of PRs falling out of that work that'll supersede this one so I'll naively close it for now.

/close

If you feel differently, please feel free to reopen @greghaynes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Indicates the PR's author has signed the CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype local scheduling of pods Remove k8s controlplane performance variance from our cold-start time
8 participants