Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate to tcp probe in queue-proxy when exec probe was used for readinessprobe in user-container #5712

Merged
merged 5 commits into from
Dec 12, 2019

Conversation

nak3
Copy link
Contributor

@nak3 nak3 commented Sep 28, 2019

Proposed Changes

When users defined exec probe for readinessprobe, it becomes nil and
get unexpected error. To make matters worse, the pod never becomes ready.

This patch changes to translates exec probe in user-container to tcp
probe as per following code comment.

// Should never be reachable. Exec probes to be translated to
// TCP probes when container is built.

/lint

Fixes #5711

Release Note

NONE

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 28, 2019
@knative-prow-robot knative-prow-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/API API objects and controllers labels Sep 28, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nak3: 0 warnings.

In response to this:

Proposed Changes

When users defined exec probe for readinessprobe, it becomes nil and
get unexpected error. To make matters worse, the pod never becomes ready.

This patch changes to translates exec probe in user-container to tcp
probe as per following code comment.

// Should never be reachable. Exec probes to be translated to
// TCP probes when container is built.

/lint

Fixes #5711

Release Note

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmoor
Copy link
Member

@nak3 What's the unexpected error you see?

Would it be possible to create an e2e test that exhibits the behavior you describe? I can think of ways we could abuse the relationship and the exec probe's wait period to create a solid window of 5xx errors.

@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 30, 2019
@nak3
Copy link
Contributor Author

nak3 commented Sep 30, 2019

@nak3 What's the unexpected error you see?

The queue-proxy keeps producing no probe found. I've also described it in #5711.

I added the e2e test for exec prober for the readiness probe.

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nak3 Does "no probe found" mean this never becomes ready?

for _, tc := range testCases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will detach from the logstream, I'd move that in here as part of the change to t.Run

corev1.Handler{
Exec: &corev1.ExecAction{
Command: []string{"/ko-app/runtime", "probe"},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case for TCP while you are in here?

I feel like for conformance we should have variants of these tests that FAIL the readiness probes. Could you either add that or open an issue to follow up?

cc @dgerd

Copy link
Contributor

@joshrider joshrider Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've sort of been here before, but the runtime contract suggests that only httpGet and tcpSocket probes should be used. We've quietly left exec probes in, but do we want to codify their allowance into our conformance tests?

@@ -355,7 +355,12 @@ func applyReadinessProbeDefaults(p *corev1.Probe, port int32) {
p.TCPSocket.Host = localAddress
p.TCPSocket.Port = intstr.FromInt(int(port))
case p.Exec != nil:
//User-defined ExecProbe will still be run on user-container.
// User-defined ExecProbe will still be run on user-container.
// Translate to TCP probes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we initially had something like this on #4731 but it was decided to leave it out since it didn't necessarily tell us what we needed to know. See: #4731 (comment)

@joshrider
Copy link
Contributor

@nak3 Does "no probe found" mean this never becomes ready?

It looks like something with an execProbe will never become ready.

There's an (outdated) comment in the readinessProbe code that suggests that an execProbe will be translated to a tcp probe, but I think we just missed pulling that out when we reverted the functionality linked above. I think we missed changing pkg/queue/readiness/probe.go and, as a result, left in a bug.

A possible fix that lets us maintain the undocumented support of execProbes would be to succeed in cases where the queue-proxy does not find an httpGet, tcpSocket, or execAction probe, having ProbeContainer return true instead of false.

@tcnghia
Copy link
Contributor

tcnghia commented Sep 30, 2019

It looks like the runtime contract forbids execProbe, so we should just refuse

If specified, liveness and readiness probes are REQUIRED to be of the httpGet or tcpSocket types, and MUST target the inbound container port; platform providers SHOULD disallow other probe methods

https://github.com/knative/serving/blob/master/docs/runtime-contract.md#meta-requests

@toVersus
Copy link
Contributor

Yeah, but how can I probe the gRPC server? Current widespread strategy is to use execProbe with grpc-health-probe bundled in the image.

@joshuarider
Copy link

I brought up the case of the runtime contract vs. execProbes a few months ago and the consensus was to leave them be. Is it time to either revisit this or consider loosening the runtime contract?

@tcnghia
Copy link
Contributor

tcnghia commented Oct 1, 2019

/cc @evankanderson

@nak3
Copy link
Contributor Author

nak3 commented Oct 2, 2019

/test pull-knative-serving-unit-tests

@nak3
Copy link
Contributor Author

nak3 commented Oct 20, 2019

So, there are three options for the change. When exec probe is used in user container,

a. Translate it to TCP probe in queue-proxy (current PR)
b. Don't translate it for queue-proxy, but return true instead of false here when no probe to let pod ready.
c. Do not allow exec probe anyway as documented. (current code)

Even if we do not support exec probe, I would like to change the current error message. The current error #5711 is very unclear error.

For a and b, I can open an issue to track it if it is difficult to make the decision.

@markusthoemmes
Copy link
Contributor

FWIW I do prefer the current implementation (option A above). It makes sure that we test for general connectivity on the configured ports regardless on the user's probes.

I do wonder why we forbid exec probes or at least discourage them. @mattmoor any idea on that? Do we need to take this to the API WG in that regard?

@nak3
Copy link
Contributor Author

nak3 commented Nov 6, 2019

As per @dgerd 's #4086 (comment), I think option B above was the current expected workaround discussed in #4086 actually.

@dgerd any thoughts for this?

@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 11, 2019
@tcnghia
Copy link
Contributor

tcnghia commented Nov 26, 2019

/test pull-knative-serving-smoke-tests

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2019
@markusthoemmes
Copy link
Contributor

/assign @dgerd

@dgerd
Copy link

dgerd commented Dec 10, 2019

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2019
@dgerd
Copy link

dgerd commented Dec 10, 2019

/test pull-knative-serving-upgrade-tests

2 similar comments
@nak3
Copy link
Contributor Author

nak3 commented Dec 10, 2019

/test pull-knative-serving-upgrade-tests

@nak3
Copy link
Contributor Author

nak3 commented Dec 10, 2019

/test pull-knative-serving-upgrade-tests

@nak3
Copy link
Contributor Author

nak3 commented Dec 11, 2019

/test pull-knative-serving-upgrade-tests

    probe_test.go:71: CheckSLO() = SLI for "TestProbe" = 0.912052, wanted >= 1.000000

I think this is not related to this change. I will rebase if next re-test failed.

nak3 added 5 commits December 11, 2019 10:42
…adinessprobe in user-container

When users defined exec probe for readinessprobe, it becomes `nil` and
get unexpected error. To make matters worse, the pod never becomes ready.

This patch changes to translates exec probe in user-container to tcp
probe.
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2019
@nak3
Copy link
Contributor Author

nak3 commented Dec 11, 2019

/retest

@nak3
Copy link
Contributor Author

nak3 commented Dec 11, 2019

/test pull-knative-serving-upgrade-tests

It looks like TestProbe in pull-knative-serving-upgrade-tests is broken. Same error happens on other PRs.

#6162

Let's confirm it once again and will file the issue if it fail.

@nak3
Copy link
Contributor Author

nak3 commented Dec 11, 2019

/retest

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 12, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markusthoemmes, nak3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-test-reporter-robot

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-integration-tests 0/3

Failed non-flaky tests preventing automatic retry of pull-knative-serving-integration-tests:

test/conformance/api/v1.TestUpdateConfigurationMetadata

@nak3
Copy link
Contributor Author

nak3 commented Dec 12, 2019

/test pull-knative-serving-integration-tests

@knative-prow-robot knative-prow-robot merged commit 353f8fb into knative:master Dec 12, 2019
@nak3 nak3 deleted the fix-exec-probe branch December 12, 2019 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ReadinesProbe with exec probe never become ready