Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Pod tracing #3491

Closed
wants to merge 2 commits into from
Closed

[WIP] Pod tracing #3491

wants to merge 2 commits into from

Conversation

greghaynes
Copy link
Contributor

@greghaynes greghaynes commented Mar 21, 2019

When performing cold start the majority of our time is spent waiting for
our pod to be created. Unfortunately, because we do not control anything
beyond talking to kube-api, we have to perform our own resource tracking
to create spans which tie in to our existing cold-start tracing.

This depends on #2726 and #3881

Release Note

NONE

@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 21, 2019
@knative-prow-robot knative-prow-robot added area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Mar 21, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greghaynes: 20 warnings.

In response to this:

Release Note

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

if enable, ok := cfgMap[enableKey]; ok {
if enableBool, err := strconv.ParseBool(enable); err != nil {
return nil, fmt.Errorf("Failed parsing tracing config %q: %v", enableKey, err)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint indent: if block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary). More info.

if debug, ok := cfgMap[debugKey]; ok {
if debugBool, err := strconv.ParseBool(debug); err != nil {
return nil, fmt.Errorf("Failed parsing tracing config %q", debugKey)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint indent: if block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary). More info.

}
}

func CreateTracer(cfg *config.Config, reporter zipkinreporter.Reporter, serviceName, hostPort string) (*zipkin.Tracer, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported function CreateTracer should have comment or be unexported. More info.

"go.uber.org/zap"
)

type TracerRefGetter func(context.Context) (*TracerRef, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported type TracerRefGetter should have comment or be unexported. More info.

}, nil
}

type ReporterFactory func(cfg *config.Config) (zipkinreporter.Reporter, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported type ReporterFactory should have comment or be unexported. More info.

reporterRef *refcountedReporter
}

func (tr *TracerRef) Ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method TracerRef.Ref should have comment or be unexported. More info.

tr.reporterRef.refs.RLock()
}

func (tr *TracerRef) Done() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method TracerRef.Done should have comment or be unexported. More info.

@greghaynes greghaynes changed the title Pod tracing [WIP] Pod tracing Mar 21, 2019
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2019
@greghaynes greghaynes force-pushed the pod-tracing branch 2 times, most recently from d09b9e2 to 7eef4d1 Compare March 28, 2019 22:52
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Mar 28, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: greghaynes
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mattmoor

If they are not already assigned, you can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/monitoring and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 24, 2019
@greghaynes greghaynes force-pushed the pod-tracing branch 2 times, most recently from 06f7e61 to d90bb9a Compare April 24, 2019 21:37
Gregory Haynes added 2 commits May 15, 2019 09:38
This span clearly shows time spent waiting for our endpoint to become
ready
During cold start it is very useful to see spans on how long the k8s
resources we wait for (pods) take in various states relative to our
cold-start request.
@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2019
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/tracing/config/tracing.go 82.6% 80.0% -2.6
pkg/tracing/pod_tracer.go Do not exist 100.0%

@knative-prow-robot
Copy link
Contributor

@greghaynes: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-unit-tests c17b16f link /test pull-knative-serving-unit-tests
pull-knative-serving-build-tests c17b16f link /test pull-knative-serving-build-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@markusthoemmes
Copy link
Contributor

This seems vastly out of date. Any status @greghaynes?

@greghaynes
Copy link
Contributor Author

I've gotten a lot of mileage out of this PR which is why I've left it (lots of testing features / fixes using this). The reason it has remained WIP though is I am not sure how to do this in a way where it isn't very fragile. Right now we are manually finding a pod and tracing it. If a pod is scaling down while the request comes in to scale up from 0 it is difficult to pick the right pod. Theres also some potential races where requests come in to activator but the pod is already scaled up.

My main question is - if I were to write some more code to pick the correct pod and handle this race, is this something we'd want to maintain? I have been sort of on the fence about this - its optional code and I'd worry itll be rarely run and has a somewhat complex state machine...

@jonjohnsonjr
Copy link
Contributor

image

@markusthoemmes
Copy link
Contributor

Is this approach valid at this point? We're running however many requests through the activator as we want and even potentially always keep it in path. It feels like this is generating a lot of overhead and the traces cannot be mapped one to one to the requests anyway.

So I think the races you mentioned are amplified at this point. I'm not sure how we'd go about making this as robust as we'd like it to be 🤔. Is this still producing very vital information for us?

@markusthoemmes
Copy link
Contributor

I'm inclined to close this. It's vastly outdated and if we wanted to adopt it, I think we'd need to do some designwork around this.

@nimakaviani
Copy link
Contributor

I have done a rebase fairly recently that I have been using for some performance assessment: https://github.com/nimakaviani/serving/tree/pod-tracing-redux

But @greghaynes 's points on the complex state machine and whether or not we want to maintain it still hold. Lets decide on its faith in the next WG meeting. I add it to the agenda.

@markusthoemmes
Copy link
Contributor

Going to close this as @nimakaviani seems to have a working version himself. Feel free to reopen once we determined the future of this?

/close

@knative-prow-robot
Copy link
Contributor

@markusthoemmes: Closed this PR.

In response to this:

Going to close this as @nimakaviani seems to have a working version himself. Feel free to reopen once we determined the future of this?

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants