Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable metric reporting interval #284

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

jw3
Copy link
Contributor

@jw3 jw3 commented Jan 26, 2024

Provides recurring metric reports at a specified interval.

Uses a nonblocking timerfd alongside pthread_cond_timedwait to create timed reporting intervals. The interval is configured through the report_interval key in fapolicyd.conf. This value represents the length of interval in seconds between reports. If set to 0 reporting intervals are disabled, this is the default.

Reports are written to the fapolicyd.state file. The output format and contents of this file are unchanged by this PR. This is the same file that was written at shutdown and upon receiving SIGUSR1. The signal handling function remains unchanged.

Error handling is based around disabling interval reporting to preserve the decision thread. When an unrecoverable reporting error occurs the interval reports are disabled and the decision thread will use the non-interval report handler for the remainder of it's execution.

The structuring of the reporting loop gives priority to the fan handling loop. When a reporting interval expires, the fan handler runs prior to writing the report to prioritize the fan handling and also to ensure the cache stats are as recent as possible. Also the reporting loop is only entered when no fan events have been recorded.

This PR also restructure the initialization of the decision thread and reporting loop to ensure a fresh fapolicyd.state file when fapolicyd starts. The previous behavior was that the file from the last shutdown was still present until either a SIGUSR1 or shutdown.

@stevegrubb
Copy link
Member

Hmm. If you send SIGUSR1 to fapolicyd, the results are in the file /run/fapolicyd/fapolicyd.state.

@jw3
Copy link
Contributor Author

jw3 commented Jan 27, 2024

Ah, 👀, thanks @stevegrubb.. thought that was only done at shutdown.

Is there value in having the decision log interleaved with cache stats? This PR would allow logging that type of consolidated timeline.

I'm also a little concerned with adding noise to the logs when using SIGUSR1, (1) when bumping the event loop to ensure stats are not stale (see fapolicyd-cli ~line 850) and (2) when reading the contents of the state file.

The noise from both of these can be filtered out, though that creates a tricky case where log entries are interesting but match the filter, eg. profiling a profiler.

Are there any technical issues with the implementation provided in this PR?


Ill take a look at revising the cli logic, adding the new command doesn't seem right.

@jw3 jw3 force-pushed the cmd_dump_stats branch 2 times, most recently from 4e060d3 to e2b2801 Compare January 27, 2024 21:13
@stevegrubb
Copy link
Member

"fapolicyd-cli --check-status" does the SIGUSR1 thing and then displays the resulting file. It does not dump the LRU's. That only happens at shutdown and if fapolicyd.conf has do_stat_report = 1, which is the default. When set it to 0, you get just the metrics on shutdown. But no matter the value, SIGUSR1 is a concise report.

@jw3
Copy link
Contributor Author

jw3 commented Jan 30, 2024

Thanks for clarifying that @stevegrubb. I may have been using the term stats incorrectly. My stats term usage was based on lru.c function dump_queue_stats.

So with that said, I'm not looking to use values from the fapolicyd-access.log report. My only interest is in the metrics written to the fapolicyd.state file.

Those metrics are also logged at DEBUG level on shutdown. I was hoping they could also be logged upon request during runtime, that would allow application to continuously poll the cache state during runtime.

The fallback if this PR cannot merge, is to poll with the SIGUSR1 signal, which is feasible. The downside of the fallback is the file access operations of (1) a contrived file read to bump the event loop (see fapolicyd-cli ~line 850), and (2) reading the metrics from disk; each of which adds irrelevant file accesses, ie. noise, to the access log.

@jw3
Copy link
Contributor Author

jw3 commented Jan 30, 2024

Background in fapolicy-analyzer issue

@stevegrubb
Copy link
Member

I like the idea of making periodic metrics available. There are systems like statsd/prometheus/collectd/fluentd/etc that could help gain visibility into how the daemon (and by extention the system) is performing. By sending this to syslog it is not trivial to dig out the last occurrence. The intention of the SIGUSR1 interface was to have an on-demand way of seeing the daemon's internal metrics. The output is short and always the latest status. This would be ideal for collecting metrics.

The fapolicyd-cli app looks at the pid file, checks what it points to, if it is fapolicyd, it sends a sigusr1. Fapolicyd receives the signal on the inbound/main thread. The report is run by the decision thread so that it is serialized with any updates to the metrics. If the queue is empty, it sleeps indefinitely on a conditional wait. In order to wake it up, the cli app accesses a file (the fapolicyd config file). This can skew the statistics for the access report. (This is only written when the daemon exits and has no effect on performance data.) I suppose we can change the file it accesses, but to which one?

It might be possible to trigger the conditional wait from the signal handler or the main loop without needing a file access. This would let the report run immediately without skewing access statistics. We might need to do something in the decision thread to be OK if the queue is still empty.

Another possibility would be to add a configuration item for periodic reports. What this would do is setup a periodic timerfd which could be polled. When the timer fires, run a new report. This would avoid the need to create an access which skews statistics. The serialization of the report still needs to be handled in the decision thread. This might still need the conditional wait work mentioned in the previous paragraph. Then plugins can setup a notification on the report changing to trigger a re-read/parsing.

@jw3
Copy link
Contributor Author

jw3 commented Jan 31, 2024

Another possibility would be to add a configuration item for periodic reports. What this would do is setup a periodic timerfd which could be polled. When the timer fires, run a new report. This would avoid the need to create an access which skews statistics. The serialization of the report still needs to be handled in the decision thread. This might still need the conditional wait work mentioned in the previous paragraph.

Yes, this sounds right.

I made an attempt at sketching that out in this PR, it has sharp edges but is a working example. It needs some more isolated testing. I have a lot of events flowing on my system, so hard to say for sure if the pthread_cond_timedwait is working as expected. I'll test in isolation some and see what it looks like.

fapolicyd-cli --check-status remains functional.

@jw3 jw3 changed the title Add command to log cache stats Configurable metric reporting interval Jan 31, 2024
@stevegrubb
Copy link
Member

Thanks. I'll look over the changes in the coming week.

@jw3
Copy link
Contributor Author

jw3 commented Feb 3, 2024

Ok thanks. It seems to be working as expected, both when testing in isolation on a single directory and also when running it on my full system.

@radosroka
Copy link
Member

@jw3 do you have anything to add? If not we could probably merge it.

@jw3 jw3 force-pushed the cmd_dump_stats branch from a5b8cbc to 526e0d3 Compare April 22, 2024 21:12
@jw3
Copy link
Contributor Author

jw3 commented Apr 22, 2024

@radosroka I think it's ready to merge.

@radosroka
Copy link
Member

@radosroka I think it's ready to merge.

Would you mind squashing the commits into one?

@jw3 jw3 force-pushed the cmd_dump_stats branch 2 times, most recently from c102878 to 8eaa07b Compare April 23, 2024 15:01
@jw3
Copy link
Contributor Author

jw3 commented Apr 23, 2024

@radosroka it is squashed and rebased 👍

Provides recurring metric reports at a specified interval using a nonblocking timerfd and pthread_cond_timedwait. The interval is configured through the report_interval key in fapolicyd.conf. This value represents the length of interval in seconds between reports. If set to 0 reporting intervals are disabled, this is the default.

Reports are written to the fapolicyd.state file. The output format and contents of this file are unchanged by this PR. This is the same file that was written at shutdown and upon receiving SIGUSR1. The signal handling function remains unchanged.

Error handling is based around disabling interval reporting to preserve the decision thread. When an unrecoverable reporting error occurs the interval reports are disabled and the decision thread will use the non-interval report handler for the remainder of it's execution.

The structuring of the reporting loop gives priority to the fan handling loop. When a reporting interval expires, the fan handler runs prior to writing the report to prioritize the fan handling and also to ensure the cache stats are as recent as possible. Also the reporting loop is only entered when no fan events have been recorded.

This PR also restructure the initialization of the decision thread and reporting loop to ensure a fresh fapolicyd.state file when fapolicyd starts. The previous behavior was that the file from the last shutdown was still present until either a SIGUSR1 or shutdown.
@jw3 jw3 force-pushed the cmd_dump_stats branch from 8eaa07b to 7dbd8b0 Compare April 23, 2024 21:30
@jw3
Copy link
Contributor Author

jw3 commented Apr 23, 2024

rebased on @stevegrubb's latest commits

Resolved a few minor conflicts in notify.c, nothing significant.

@radosroka radosroka merged commit 5d58216 into linux-application-whitelisting:main Apr 24, 2024
24 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants