Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect /var/log without fluentd sidecar #4156

Merged

Conversation

JRBANCEL
Copy link
Contributor

@JRBANCEL JRBANCEL commented May 23, 2019

Fixes #818

Proposed Changes

Have fluend daemon set capture user-container's /var/log logs directly instead of injecting a fluentd sidecar.

Changes:

  • Replace fluentd sidecar container by init container that simply creates a symlink so that the host node has a path to user-container's /var/log volume with the information needed by kubernetes_metadata_filter. See this for more details.
  • Add necessary config to fluend daemon set config to capture the logs directly from the Knative pod's volume

Result

Given some container logging to stdout/stderr and various files inside /var/log:
Code

The resulting logs are captured and enriched by fluentd daemon set.
I also added some fluentd magic so that the logs coming from /var/log have a stream field populated with the path of the log file so that it matches what docker does with stream being either stdout or stderr.

ES

Next

Add E2E test as part of #647

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 23, 2019
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 23, 2019
@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/monitoring labels May 23, 2019
Copy link
Contributor

@yanweiguo yanweiguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Left some nits.

@evankanderson
Copy link
Member

/approve

/hold

You can /hold cancel when you have an LGTM and are ready for this to go it. I just didn't want Yanwei to be caught by surprise and have this go in with pending comments.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 24, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson, JRBANCEL

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2019
@JRBANCEL
Copy link
Contributor Author

JRBANCEL commented May 28, 2019

Please review again.
What has not changed since initial PR:

  • fluentd sidecar code removal
  • fluentd daemon set config

What has changed:

  • the init container increases startup latency, the symlink creation logic is now done by queue-proxy after it is done with its startup

@JRBANCEL JRBANCEL force-pushed the collect-var-log-without-sidecar branch from aa4b3c7 to 0988310 Compare May 29, 2019 00:30
@JRBANCEL
Copy link
Contributor Author

/retest

@yanweiguo
Copy link
Contributor

/lgtm
/retest

@knative-prow-robot knative-prow-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels May 29, 2019
@JRBANCEL JRBANCEL force-pushed the collect-var-log-without-sidecar branch from 4a9250f to 338183c Compare May 29, 2019 17:48
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test -- I'm wondering if the linking logic it should live in cmd or pkg.

Also, I'm wondering if a small design doc would help explain this -- doing this from another container is very clever, but I worry that future readers will be confused.

It's particularly nice that this doesn't require additional privilege for the queue-proxy, just a shared volume.

@JRBANCEL
Copy link
Contributor Author

Also, I'm wondering if a small design doc would help explain this -- doing this from another container is very clever, but I worry that future readers will be confused.

Yes, I was thinking about updating the doc and add more details about how and why it works.

Right now, this is blocked because we want the default GKE fluentd configuration to capture those /var/log logs out of the box. I'll update their fluentd config and submit a PR.

@@ -99,6 +106,17 @@ func initEnv() {
servingService = os.Getenv("SERVING_SERVICE") // KService is optional
userTargetPort = util.MustParseIntEnvOrFatal("USER_PORT", logger)
userTargetAddress = fmt.Sprintf("127.0.0.1:%d", userTargetPort)
userContainerName = util.GetRequiredEnvOrFatal("USER_CONTAINER_NAME", logger)

enableVarLogCollection, _ = strconv.ParseBool(os.Getenv("ENABLE_VAR_LOG_COLLECTION")) // Optional, default is false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the simplicity of this trick, what are the trade-offs that would lead us to not simply enable this by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Cost: why pay for logs you don't need by default?
  • Privacy: I am not sure collecting everything in /var/log by default is safe, who knows what is logged there
  • Backwards compatibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the log collection and so forth be controlled by either the fluentd config or by deleting the daemonset?

Copy link
Contributor Author

@JRBANCEL JRBANCEL Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one can modify the fluentd config, but:

  1. It is global, there was a discussion about making varlog collection a flag per revision instead of globally. This makes the implementation of such feature easy.
  2. It is not always possible (GKE default fluentd config cannot be modified for example). More generally, it is not necessarily the same people operating Knative and fluentd (the infra).

@JRBANCEL JRBANCEL force-pushed the collect-var-log-without-sidecar branch from 766e689 to 3f9b182 Compare June 5, 2019 17:13
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/metrics/config.go 90.5% 86.7% -3.8
pkg/reconciler/revision/reconcile_resources.go 91.7% 92.2% 0.6

JRBANCEL added a commit to JRBANCEL/docs that referenced this pull request Jun 5, 2019
@JRBANCEL
Copy link
Contributor Author

JRBANCEL commented Jun 5, 2019

Can someone give this a new /lgtm before new conflicts show up?

@n3wscott
Copy link
Contributor

n3wscott commented Jun 5, 2019

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2019
@JRBANCEL
Copy link
Contributor Author

JRBANCEL commented Jun 5, 2019

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2019
@JRBANCEL
Copy link
Contributor Author

JRBANCEL commented Jun 5, 2019

/retest

1 similar comment
@JRBANCEL
Copy link
Contributor Author

JRBANCEL commented Jun 6, 2019

/retest

@knative-prow-robot knative-prow-robot merged commit a1cd971 into knative:master Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/monitoring area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long-term story for /var/log support