-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Task Manager instrumentation #99160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left a comment about adding some tests. Do you think you can provide instructions for generating the charts you have in your PR locally? Is that possible? I'd love to see it all working.
Hi @elastic/kibana-alerting-services Did anyone discuss the state of this PR with @dgieselaar ? |
Sorry, I dropped the ball on that. I'm happy to add unit tests for the apm calls but I still don't know how to get APM running to see the traces. I can look through the docs |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
/** | ||
* The serialized traceparent string of the current APM transaction or span. | ||
*/ | ||
traceparent?: string; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a note in the Readme.md about using this field to trace the execution of tasks in apm.
Other plugins might want to utilise this.
return await this.store.schedule(modifiedTask); | ||
return await this.store.schedule({ | ||
...modifiedTask, | ||
traceparent: agent.currentTraceparent ?? '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does an empty string tell APM that there is no trace here? Or is this going to cause all tasks that don't have a traceparent to appear in APM as related because they all get grouped under an "empty string" transaction? 😬
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but added a couple of questions.
I'd appreciate if another member of @elastic/kibana-alerting-services took a look as @ymao1 as not just reviewer here but also contributer. :)
try { | ||
this.task = this.definition.createTaskRunner(modifiedContext); | ||
const result = await this.task.run(); | ||
const result = await withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why is this.task!
necessary in this PR, but not in master? (on the above line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Pretty cool to see how useful this integration is and will continue to be for us and users!
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]History
To update your PR or re-run it, just comment with: |
* Instrument task manager * Don't refresh after SO updates * Update snapshot test, remove beforeRun instrumentation * Revert "Don't refresh after SO updates" This reverts commit 54f848d. * Fix task_store unit test * Adding tests and updating ConcreteTaskInstance interface with traceparent * Reverting unnecessary changes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Ying Mao <ying.mao@elastic.co>
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
* Instrument task manager * Don't refresh after SO updates * Update snapshot test, remove beforeRun instrumentation * Revert "Don't refresh after SO updates" This reverts commit 54f848d. * Fix task_store unit test * Adding tests and updating ConcreteTaskInstance interface with traceparent * Reverting unnecessary changes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Ying Mao <ying.mao@elastic.co> Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co> Co-authored-by: Ying Mao <ying.mao@elastic.co>
* master: (90 commits) Fix UI breaks on providing long search keyword in 'Search Box' (elastic#101385) Adds css class to EuiDescriptionListDescription in order to break word on exception details card (elastic#101481) [Lens] Increase timings for drag and drop tests (elastic#101380) [Lens] Fix editor react error on configuration panel (elastic#101367) [Fleet] Move integrations to a separate app (elastic#99848) Fix incorrect message displayed on importing Timeline Templates (elastic#101288) [Cases] RBAC (elastic#95058) [APM] Visual improvements for new APM layout with left navigation (elastic#101360) [master] More precise alerts matching (elastic#99820) [Lens] Value in legend (elastic#101353) Revert "[Reporting] ILM policy for managing reporting indices (elastic#100130)" (elastic#101358) [Discover] Fix header row of data grid in Firefox (elastic#101374) Add link to advanced setting in Discover (elastic#101154) Url service locators (elastic#101045) [Timelion] Update the removal message to mention the exact version (elastic#100994) [Security Solution][Detection Engine] Test cases for alias failure test cases where we don't copy aliases correctly (elastic#101437) [Event Log] Adding `type_id` to saved object array in event log (elastic#100939) [Reporting] Add `location.url` info to console message logs (elastic#101427) [Security Solutions][Detection Engine] Fixes timestamp bugs within source indexes when the formats are not ISO8601 format (elastic#101349) Improve Task Manager instrumentation (elastic#99160) ...
In #90955, I dialed down the cardinality of Task Manager transaction names due to the performance impact of breakdown metrics. Now that some performance issues are fixed in the APM Agent, and the need for better observability of the Task Manager is evident, it makes sense to revert those changes and add additional instrumentation.
The following changes were made:
run
,markAvailableTasksAsClaimed
,markTaskAsRunning
run
transaction typesAdditionally, I've setmoved to #99919refresh
tofalse
in some situations, as by default the SO client will usewait_for
, and wait on a shard refresh. That could unnecessarily delay task completion, and consume resources that could otherwise be freed up.For the screenshots, I disabled http/https instrumentation to prevent the http spans from showing up, and I locally worked around an instrumentation issue in the Node.js agent that messes up span relationships (see elastic/apm-agent-nodejs#1964 for a description of the problem).
Before:
After: