Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate removing Activator from data path #3276

Closed
4 tasks done
tanzeeb opened this issue Feb 19, 2019 · 20 comments
Closed
4 tasks done

Investigate removing Activator from data path #3276

tanzeeb opened this issue Feb 19, 2019 · 20 comments
Assignees
Labels
area/networking kind/feature Well-understood/specified features, ready for coding. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tanzeeb
Copy link
Contributor

tanzeeb commented Feb 19, 2019

In what area(s)?

/area networking

Describe the feature

Currently, the activator has the following responsibilities:

  1. Receiving & buffering requests for inactive Revisions.
  2. Reporting metrics to the autoscaler.
  3. Retrying requests to a Revision after the autoscaler scales such Revision based on the reported metrics.

Arguably, only "Reporting metrics to the autoscaler" is a problem specific to Serving.

Are there Istio/Envoy features we can leverage to reduce the work done by the activator?

The receiving, buffering and retrying of requests accounts for the majority of the code in the activator. Since the activator is in the data path on cold start, there are a lot of edge cases that the activator has to account for (eg. handling both http1 and http2, retrying on 503s, buffering and rewinding request bodies) that are handled by Istio/Envoy in the warm start data path.

Some approaches to explore:

/assign
cc @tcnghia @markusthoemmes

Edit: Updated with more approaches

@tanzeeb tanzeeb added the kind/feature Well-understood/specified features, ready for coding. label Feb 19, 2019
@markusthoemmes
Copy link
Contributor

/cc @vvraskin

I agree, let's try to do better 🙂

@mattmoor
Copy link
Member

This has been on @tcnghia TODO list for a while, I'd love to see something better :)

@mattmoor mattmoor added this to the Needs Triage milestone Feb 19, 2019
@tcnghia
Copy link
Contributor

tcnghia commented Feb 20, 2019

Thanks for tackling this.

Yes, I think a lot of the existing quirks from golang's ReverseProxy would have be avoided if activator / queue-proxy were Envoys instead of our own homebrew proxies. So this is certain a good direction to go.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Feb 21, 2019

Update:

HTTPRoute mirror-ing is a dead end.

The idea was to add the Activator as a mirror and leave the main routes unchanged, like so:

    route:
    - destination:
        host: helloworld-test-image-w5cvg-service.default.svc.cluster.local
        port:
          number: 80
      weight: 100
    mirror:
      host: activator-service.knative-serving.svc.cluster.local
      port:
        number: 80

The activator would call ActivateEndpoint and return immediately, without proxying the request. The hope was that the autoscaler would scale up the instance before Envoy ran out of retries.

Unfortunately this returns a 503 "no healthy upstream" from Envoy immediately.

Getting the impression that Istio/Envoy retries apply on the selected upstream host, and not on the route as a whole. That is, if there are 3 endpoints are available for a route, and endpoint 2 is chosen for a given request, then Envoy will only retry the request on endpoint 2.

Will investigate the retries a bit more: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http_routing#retry-semantics

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Feb 23, 2019

Internal redirection is a no-go. From the conditions:

For a redirect to be handled successfully it must pass the following checks:

  1. Be a 302 response.
  2. Have a location header with a valid, fully qualified URL matching the scheme of the original request.
  3. The request must have been fully processed by Envoy.
  4. The request must not have a body.
  5. The request must have not been previously redirected, as determined by the presence of an x-envoy-original-url header.

We cannot guarantee the fourth condition.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Feb 28, 2019

Lua filters are promising, but with a few caveats and obstacles:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: activator
  namespace: default
spec:
  filters:
  - filterConfig:
      inlineCode: |
        function envoy_on_request(handle)
          headers, body = handle:httpCall(
            "outbound|80||activator-service.knative-serving.svc.cluster.local",
            {
              [":method"] = "GET",
              [":path"] = "/",
              [":authority"] = handle:headers():get(":authority"),
              ["knative-serving-revision"] = "helloworld-test-image-w5cvg",
              ["knative-serving-namespace"] = "default",
            },
            "",
            60000)
        end
    filterName: envoy.lua
    filterType: HTTP
    listenerMatch:
      listenerProtocol: HTTP
      listenerType: GATEWAY

This filter will execute before a request and block until the target revision is activated.

Obstacles:

  1. Revision/Namespace are hard coded in the proof of concept above. We need to find a way to pass this metadata dynamically from the VirtualService
  2. All requests, cold or warm, require a call to the activator. We need to find a way to skip these calls for warm requests.

Caveats:

  1. We're introducing a direct dependency on an Envoy-based ingress.

Update:
With Istio 1.1, we can truly place the filter at the end of the chain. This means we can read the namespace/revision headers appended to the request, and activate dynamically. This will address obstacle 1, and we can use another header for obstacle 2.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Mar 1, 2019

Route retries are not as promising.

Modifying the activator to 1) activate the revision, and then 2) return a 503 to trigger a retry does indeed trigger a retry. However, the updated VirtualService configuration is not picked up by the router and so the retries are sent right back to the activator.

Going to explore the Envoy code to better understand if and how the cluster configuration can be updated mid-route.

Update:

  1. Cluster configuration can't be updated with Istio. In our usage, the activator and the revision k8s service are two separate clusters.
  2. Route configuration does not get updated mid-route.

Looks like route retries are a dead-end as well.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Mar 20, 2019

Update: Pausing until Istio 1.1 lands.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented May 7, 2019

The following Lua filter works with Istio 1.1:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: lua-test
  namespace: istio-system
spec:
  filters:
  - listenerMatch:
      listenerType: GATEWAY
      listenerProtocol: HTTP
    filterName: envoy.lua
    insertPosition:
      index: BEFORE
      relativeTo: envoy.router
    filterType: HTTP
    filterConfig:
      inlineCode: |
        function envoy_on_request(handle)
          op = handle:headers():get("x-envoy-decorator-operation")

          rev, ns = string.match(op, "([^.]+)%.([^.]+)")
          host, port = string.match(op, "([^:]+)%:([^/]+)/")

          headers, body = handle:httpCall(
            "outbound|" .. port .. "||" .. host,
            {
              [":method"] = "GET",
              [":path"] = "/",
              [":authority"] = handle:headers():get(":authority"),
              ["knative-serving-revision"] = rev,
              ["knative-serving-namespace"] = ns,
            },
            "",
            60000)
        end

Unfortunately, the obstacles mentioned earlier still apply. The Lua filter is not able to get route information from the virtual service.

@tanzeeb
Copy link
Contributor Author

tanzeeb commented May 10, 2019

Interesting update: with SKS, http retries work very well:

Replacing the current reverse proxy:

  httpStatus = a.proxyRequest(w, r.WithContext(reqCtx), target)

with

  httpStatus = http.StatusServiceUnavailable

gets us the exact same functionality.

Will send in a PR to discuss.

@lvjing2
Copy link
Contributor

lvjing2 commented May 12, 2019

It is attractive to remote both activator and queue-proxy if we can, but it so then we will rely on the envoy sidecar too much, what should we do if we changed the sidecar?

@tanzeeb
Copy link
Contributor Author

tanzeeb commented May 13, 2019

It is attractive to remote both activator and queue-proxy if we can, but it so then we will rely on the envoy sidecar too much, what should we do if we changed the sidecar?

Yes, this is the discussion I want to provoke. Should we lean into Istio/Envoy and take advantage of all of its features, or do we decouple from Istio/Envoy and re-implement the features ourselves? There are pros and cons to both.

That being said, it's a bit too early to say we can replace both the queue-proxy and activator:

  1. I've been playing with the Lua filters for a while, they seem pretty unstable. I see a lot of crashes in the ingressgateway logs.
  2. The http retries are breaking the gRPC cold start e2e tests.

@tcnghia
Copy link
Contributor

tcnghia commented Jun 26, 2019

@tanzeeb given the blockers I'd put this in Ice Box and will revisit after 0.8/v1.

@tcnghia tcnghia modified the milestones: Needs Triage, Ice Box Jun 26, 2019
@jchesterpivotal
Copy link
Contributor

jchesterpivotal commented Jul 19, 2019

@tanzeeb
Copy link
Contributor Author

tanzeeb commented Jul 19, 2019

Looking at the presentation, an adapter for Mixer sounds very promising. This would be similar to the Envoy filter approach, but would have the additional route metadata from Istio that is missing in Envoy.

@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2019
@knative-housekeeping-robot

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

@knative-prow-robot knative-prow-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2020
@vagababov
Copy link
Contributor

@tanzeeb I think our general architecture gears towards keeping activators and enhancing their functionality.
If you're not really pursuing this — we should close it out.

@knative-housekeeping-robot

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

@knative-prow-robot
Copy link
Contributor

@knative-housekeeping-robot: Closing this issue.

In response to this:

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso dprotaso removed this from the Ice Box milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/feature Well-understood/specified features, ready for coding. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants