Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] ML rule can miss anomaly documents if its interval/lookback is too short #158152

Open
Tracked by #165878
banderror opened this issue May 21, 2023 · 6 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:ML Rule Security Solution Machine Learning rule type impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. sdh-linked Team:Detection Engine Security Solution Detection Engine Area Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@banderror
Copy link
Contributor

Summary

A user had an ML rule with interval = 15 mins and lookback time 1 min. This rule was based on an ML job with a fixed interval of 15 mins. The rule had anomaly_threshold = 90.

Despite the fact that there were anomaly documents with record_score >= 90, the rule had missed them and hadn't generated any alerts.

The user increased the rule interval to 22 mins which fixed the issue and the rule started to generate alerts.

This feels like a bug in the ML rule type/executor. If the lookback time for a given ML rule depends on the corresponding job parameters and has to be higher than a certain value, our app should tell the user about that and/or set its value automatically.

@banderror banderror added bug Fixes for quality problems that affect the customer experience triage_needed impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Feature:ML Rule Security Solution Machine Learning rule type Team:Detection Engine Security Solution Detection Engine Area labels May 21, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@yctercero
Copy link
Contributor

This seems like one where we should either fix or triage to know how to add documentation as a known issue for 8.9. Depending on how far back the issue goes, may be worth adding it as a known issue for older releases.

cc @peluja1012

@rylnd
Copy link
Contributor

rylnd commented May 26, 2023

After some discussion with @marshallmain and the rest of the team, I think I can describe a generalization of this issue (or at least the hypothesis):

# 15m job, anomalies A, B
|-A--|B---|----|----|----|

# 15m interval rule, no lookback, offset from job
# Xs represent execution windows for executions 1, 2
--1----2----3----|----|
XXX
   XXXXX

In the above diagram, we can see the execution of both an ML job and an ML rule. In rule execution 1, the rule will be looking in the range of anomaly A; however, the ML job may still be processing that 15m bucket, and anomaly A may not yet exist or it might be an "interim" anomaly; in either case, it will not be alerted upon. Similarly, in execution 2, anomaly B may not be found for the same reasons, and anomaly A is now outside of execution 2's window. Neither of these anomalies will be found by the rule.

By increasing the rule lookback time (as was the fix in the referenced SDH), we ensure that subsequent rule execution will alert on a finalized anomaly from a previous bucket window.

Assuming the above is correct, I would agree that we should validate that a rule's execution window (interval + lookback) is at least double the job's bucket_span, and issue a warning if that's not the case.

Another note: if the above is true, then #90316 arguably exacerbated this problem (assuming that interim anomaly scores are always <= finalized scores).

@yctercero yctercero removed their assignment May 31, 2023
@rylnd
Copy link
Contributor

rylnd commented Jun 2, 2023

I was discussing this with @yctercero and I think we've come up with a potential solution: if ML were to maintain an updated_at or finalized_at (name TBD) time field, that was updated at the same time as is_interim on anomaly documents, the detection rules could use that field to sort results and eliminate the "late arrival/finalization" problem described above.

@marshallmain does that make sense to you? @jgowdyelastic would that be a reasonable request/task for the ML team?

@darnautov
Copy link
Contributor

Hi @rylnd, I reckon that first it's worth trying to adjust the lookback interval logic to the same approach we have in the Anomaly Detection rule type.

The lookback interval is set to be double the bucket span and sum it with the query delay.

You can read more about it in this blog post.

We also share the alerting service from the ML app, perhaps you could reuse it as well. Let me know if I can help you with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:ML Rule Security Solution Machine Learning rule type impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. sdh-linked Team:Detection Engine Security Solution Detection Engine Area Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

No branches or pull requests

5 participants