-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve error handling, and more! #524
Conversation
fb0692b
to
496bd1f
Compare
ae6665a
to
1d81e24
Compare
1d81e24
to
d646545
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are looking good.
Only some changes requested due to the usage of the format!
macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing the change requests.
@MicaiahReid Just briefly reading this, but is this a typo? |
@MicaiahReid @CharlieC3 Just chiming in! |
It's in the PR description. Admittedly I haven't gone very deep in the code to verify, but though it may be best just to verify directly you or Micaiah :) |
@CharlieC3 From what I've seen, Grafana hasn't ever had a problem recognizing these as errors. The panic message referenced in #521 wasn't using this library, so Grafana wasn't picking it up. Now all logs are done through slog. I was curious if the |
haha that is because of a quick fix I put up in our log collection 😄 If there's no way around this I'm happy to keep it in, but if it's something easy to configure, I'd rather the logging be compatible with Grafana by default without needing extra configuration at the infra level. |
Quick investigation on that: Here is where it defines the log level constants: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2112 And it seems that the short naming is the default: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2291. I've never used |
Seems like it's possible to configure slog to use Grafana-compatible logging levels using Again, I'm somewhat flexible on this if it's too difficult at a service level, but I think aligning ourselves to what Grafana categorizes by default is an easy way to get compatibility across all our services and fall under a new standard for Hiro without the need for one-offs for individual services at an infra level. |
Cool @CharlieC3, I didn't realize we had custom patches that were required to make our logging work. I've created this PR in the logging library that is used by a few of our tools - this will just print the full log level prefix so that you won't need to have these patches. (granted, it will require users of this lib to upgrade + use the new feature, but Chainhook will be the first on that front!) |
Nice! Thanks so much, seems like a perfect way to handle this. |
This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service: - Revisits log levels throughout the tool (fixes #498, fixes #521). The general approach for the logs were: - `crit` - fatal errors that will crash mission critical component of Chainhook. In these cases, Chainhook should automatically kill all main threads (not individual scanning threads, which is tracked by #404) to crash the service. - `erro` - something went wrong the could lead to a critical error, or that could impact all users - `warn` - something went wrong that could impact an end user (usually due to user error) - `info` - control flow logging and updates on the state of _all_ registered predicates - `debug` - updates on the state of _a_ predicate - Crash the service if a mission critical thread fails (see #517 (comment) for a list of these threads). Previously, if one of these threads failed, the remaining services would keep running. For example, if the event observer handler crashed, the event observer API would keep running. This means that the stacks node is successfully emitting blocks that Chainhook is acknowledging but not ingesting. This causes gaps in our database Fixes #517 - Removes an infinite loop with bitcoin ingestion, crashing the service instead: Fixes #506 - Fixes how we delete predicates from our db when one is deregistered. This should reduce the number of logs we have on startup. Fixes #510 - Warns on all reorgs. Fixes #519
This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service: - Revisits log levels throughout the tool (fixes #498, fixes #521). The general approach for the logs were: - `crit` - fatal errors that will crash mission critical component of Chainhook. In these cases, Chainhook should automatically kill all main threads (not individual scanning threads, which is tracked by #404) to crash the service. - `erro` - something went wrong the could lead to a critical error, or that could impact all users - `warn` - something went wrong that could impact an end user (usually due to user error) - `info` - control flow logging and updates on the state of _all_ registered predicates - `debug` - updates on the state of _a_ predicate - Crash the service if a mission critical thread fails (see #517 (comment) for a list of these threads). Previously, if one of these threads failed, the remaining services would keep running. For example, if the event observer handler crashed, the event observer API would keep running. This means that the stacks node is successfully emitting blocks that Chainhook is acknowledging but not ingesting. This causes gaps in our database Fixes #517 - Removes an infinite loop with bitcoin ingestion, crashing the service instead: Fixes #506 - Fixes how we delete predicates from our db when one is deregistered. This should reduce the number of logs we have on startup. Fixes #510 - Warns on all reorgs. Fixes #519
## [1.4.0](v1.3.1...v1.4.0) (2024-03-27) ### Features * detect http / rpc errors as early as possible ([ad78669](ad78669)) * use stacks.rocksdb for predicate scan ([#514](#514)) ([a4f1663](a4f1663)), closes [#513](#513) [#485](#485) ### Bug Fixes * enable debug logs in release mode ([#537](#537)) ([fb49e28](fb49e28)) * improve error handling, and more! ([#524](#524)) ([86b5c78](86b5c78)), closes [#498](#498) [#521](#521) [#404](#404) [/github.com//issues/517#issuecomment-1992135101](https://github.com/hirosystems//github.com/hirosystems/chainhook/issues/517/issues/issuecomment-1992135101) [#517](#517) [#506](#506) [#510](#510) [#519](#519) * log errors on block download failure; implement max retries ([#503](#503)) ([0fc38cb](0fc38cb)) * **metrics:** update latest ingested block on reorg ([#515](#515)) ([8f728f7](8f728f7)) * order and filter blocks used to seed forking block pool ([#534](#534)) ([a11bc1c](a11bc1c)) * seed forking handler with unconfirmed blocks to improve startup stability ([#505](#505)) ([485394e](485394e)), closes [#487](#487) * skip db consolidation if no new dataset was downloaded ([#513](#513)) ([983a165](983a165)) * update scan status for non-triggering predicates ([#511](#511)) ([9073f42](9073f42)), closes [#498](#498)
## [1.4.0](hirosystems/chainhook@v1.3.1...v1.4.0) (2024-03-27) ### Features * detect http / rpc errors as early as possible ([4a3e7e8](hirosystems/chainhook@4a3e7e8)) * use stacks.rocksdb for predicate scan ([#514](hirosystems/chainhook#514)) ([bff1372](hirosystems/chainhook@bff1372)), closes [#513](hirosystems/chainhook#513) [#485](hirosystems/chainhook#485) ### Bug Fixes * enable debug logs in release mode ([#537](hirosystems/chainhook#537)) ([5ff21e0](hirosystems/chainhook@5ff21e0)) * improve error handling, and more! ([#524](hirosystems/chainhook#524)) ([2cfedd7](hirosystems/chainhook@2cfedd7)), closes [#498](hirosystems/chainhook#498) [#521](hirosystems/chainhook#521) [#404](hirosystems/chainhook#404) [/github.com/hirosystems/chainhook/issues/517#issuecomment-1992135101](https://github.com/hirosystems//github.com/hirosystems/chainhook/issues/517/issues/issuecomment-1992135101) [#517](hirosystems/chainhook#517) [#506](hirosystems/chainhook#506) [#510](hirosystems/chainhook#510) [#519](hirosystems/chainhook#519) * log errors on block download failure; implement max retries ([#503](hirosystems/chainhook#503)) ([0454a4f](hirosystems/chainhook@0454a4f)) * **metrics:** update latest ingested block on reorg ([#515](hirosystems/chainhook#515)) ([7653766](hirosystems/chainhook@7653766)) * order and filter blocks used to seed forking block pool ([#534](hirosystems/chainhook#534)) ([ff775de](hirosystems/chainhook@ff775de)) * seed forking handler with unconfirmed blocks to improve startup stability ([#505](hirosystems/chainhook#505)) ([063e0e2](hirosystems/chainhook@063e0e2)), closes [#487](hirosystems/chainhook#487) * skip db consolidation if no new dataset was downloaded ([#513](hirosystems/chainhook#513)) ([f34dee3](hirosystems/chainhook@f34dee3)) * update scan status for non-triggering predicates ([#511](hirosystems/chainhook#511)) ([9647e72](hirosystems/chainhook@9647e72)), closes [#498](hirosystems/chainhook#498)
This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service:
crit
- fatal errors that will crash mission critical component of Chainhook. In these cases, Chainhook should automatically kill all main threads (not individual scanning threads, which is tracked by add the ability to abort an initial scan if a predicate is deregistered #404) to crash the service.erro
- something went wrong the could lead to a critical error, or that could impact all userswarn
- something went wrong that could impact an end user (usually due to user error)info
- control flow logging and updates on the state of all registered predicatesdebg
- updates on the state of a predicate