fix: improve error handling, and more! #524

MicaiahReid · 2024-03-13T20:19:36Z

This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service:

Revisits log levels throughout the tool (fixes [service] Revisit log levels #498, fixes Classify & format panic logs correctly #521). The general approach for the logs were:
- crit - fatal errors that will crash mission critical component of Chainhook. In these cases, Chainhook should automatically kill all main threads (not individual scanning threads, which is tracked by add the ability to abort an initial scan if a predicate is deregistered #404) to crash the service.
- erro - something went wrong the could lead to a critical error, or that could impact all users
- warn - something went wrong that could impact an end user (usually due to user error)
- info - control flow logging and updates on the state of all registered predicates
- debg - updates on the state of a predicate
Crash the service if a mission critical thread fails (see Identify mission critical threads, and force process termination when they crash #517 (comment) for a list of these threads). Previously, if one of these threads failed, the remaining services would keep running. For example, if the event observer handler crashed, the event observer API would keep running. This means that the stacks node is successfully emitting blocks that Chainhook is acknowledging but not ingesting. This causes gaps in our database Fixes Identify mission critical threads, and force process termination when they crash #517
Removes an infinite loop with bitcoin ingestion, crashing the service instead: Fixes Potential infinite loop in bitcoin block ingestion #506
Fixes how we delete predicates from our db when one is deregistered. This should reduce the number of logs we have on startup. Fixes [service] deleting an interrupted predicate does not actually remove it from redis db #510
Warns on all reorgs. Fixes Emit warnings for re-orgs (Bitcoin + Stacks) #519

csgui

Changes are looking good.

Only some changes requested due to the usage of the format! macro.

components/chainhook-cli/src/config/mod.rs

csgui

LGTM! Thanks for addressing the change requests.

CharlieC3 · 2024-03-27T15:32:43Z

erro - something went wrong the could lead to a critical error, or that could impact all users

@MicaiahReid Just briefly reading this, but is this a typo? erro isn't a term recognized by Grafana, so it won't be properly formatted in our logs by default. I mentioned this in the referenced issue here, but I may have not called it out enough.

csgui · 2024-03-27T16:32:58Z

@MicaiahReid @CharlieC3 Just chiming in!
Where are you reading that erro? I saw (in someplace that I can't recall right now) that a logger was truncating the error word to erro.

CharlieC3 · 2024-03-27T17:02:23Z

Where are you reading that erro?

It's in the PR description. Admittedly I haven't gone very deep in the code to verify, but though it may be best just to verify directly you or Micaiah :)

MicaiahReid · 2024-03-27T17:24:00Z

@CharlieC3
The logging library (slog that Chainhook (as well as the devnet API and I believe a few of our other rust-based tools) uses truncates the log type to four characters (ERROR => ERRO, DEBUG => DEBG).

From what I've seen, Grafana hasn't ever had a problem recognizing these as errors. The panic message referenced in #521 wasn't using this library, so Grafana wasn't picking it up. Now all logs are done through slog.

I was curious if the CRIT log level would be a problem - slog doesn't have a FATAL option.

CharlieC3 · 2024-03-27T17:36:31Z

Grafana hasn't ever had a problem recognizing these as errors

haha that is because of a quick fix I put up in our log collection 😄 If there's no way around this I'm happy to keep it in, but if it's something easy to configure, I'd rather the logging be compatible with Grafana by default without needing extra configuration at the infra level.

csgui · 2024-03-27T17:51:42Z

Quick investigation on that:

Here is where it defines the log level constants: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2112

And it seems that the short naming is the default: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2291. I've never used slog, but maybe/probably, there is a formatter where that info can be defined.

CharlieC3 · 2024-03-27T18:50:05Z

Seems like it's possible to configure slog to use Grafana-compatible logging levels using slog.Leveler like so: https://betterstack.com/community/guides/logging/logging-in-go/#creating-custom-log-levels

Again, I'm somewhat flexible on this if it's too difficult at a service level, but I think aligning ourselves to what Grafana categorizes by default is an easy way to get compatibility across all our services and fall under a new standard for Hiro without the need for one-offs for individual services at an infra level.

MicaiahReid · 2024-03-27T19:27:33Z

Cool @CharlieC3, I didn't realize we had custom patches that were required to make our logging work. I've created this PR in the logging library that is used by a few of our tools - this will just print the full log level prefix so that you won't need to have these patches. (granted, it will require users of this lib to upgrade + use the new feature, but Chainhook will be the first on that front!)

CharlieC3 · 2024-03-27T19:38:47Z

Nice! Thanks so much, seems like a perfect way to handle this.

This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service: - Revisits log levels throughout the tool (fixes #498, fixes #521). The general approach for the logs were: - `crit` - fatal errors that will crash mission critical component of Chainhook. In these cases, Chainhook should automatically kill all main threads (not individual scanning threads, which is tracked by #404) to crash the service. - `erro` - something went wrong the could lead to a critical error, or that could impact all users - `warn` - something went wrong that could impact an end user (usually due to user error) - `info` - control flow logging and updates on the state of _all_ registered predicates - `debug` - updates on the state of _a_ predicate - Crash the service if a mission critical thread fails (see #517 (comment) for a list of these threads). Previously, if one of these threads failed, the remaining services would keep running. For example, if the event observer handler crashed, the event observer API would keep running. This means that the stacks node is successfully emitting blocks that Chainhook is acknowledging but not ingesting. This causes gaps in our database Fixes #517 - Removes an infinite loop with bitcoin ingestion, crashing the service instead: Fixes #506 - Fixes how we delete predicates from our db when one is deregistered. This should reduce the number of logs we have on startup. Fixes #510 - Warns on all reorgs. Fixes #519

## [1.4.0](v1.3.1...v1.4.0) (2024-03-27) ### Features * detect http / rpc errors as early as possible ([ad78669](ad78669)) * use stacks.rocksdb for predicate scan ([#514](#514)) ([a4f1663](a4f1663)), closes [#513](#513) [#485](#485) ### Bug Fixes * enable debug logs in release mode ([#537](#537)) ([fb49e28](fb49e28)) * improve error handling, and more! ([#524](#524)) ([86b5c78](86b5c78)), closes [#498](#498) [#521](#521) [#404](#404) [/github.com//issues/517#issuecomment-1992135101](https://github.com/hirosystems//github.com/hirosystems/chainhook/issues/517/issues/issuecomment-1992135101) [#517](#517) [#506](#506) [#510](#510) [#519](#519) * log errors on block download failure; implement max retries ([#503](#503)) ([0fc38cb](0fc38cb)) * **metrics:** update latest ingested block on reorg ([#515](#515)) ([8f728f7](8f728f7)) * order and filter blocks used to seed forking block pool ([#534](#534)) ([a11bc1c](a11bc1c)) * seed forking handler with unconfirmed blocks to improve startup stability ([#505](#505)) ([485394e](485394e)), closes [#487](#487) * skip db consolidation if no new dataset was downloaded ([#513](#513)) ([983a165](983a165)) * update scan status for non-triggering predicates ([#511](#511)) ([9073f42](9073f42)), closes [#498](#498)

MicaiahReid requested review from lgalabru and csgui March 15, 2024 19:59

MicaiahReid force-pushed the fix-handle-errors branch from fb0692b to 496bd1f Compare March 15, 2024 20:01

MicaiahReid marked this pull request as ready for review March 15, 2024 20:01

MicaiahReid mentioned this pull request Mar 15, 2024

write tests to confirm that failed mission-critical threads crash all services #526

Open

MicaiahReid had a problem deploying to Development-mainnet March 15, 2024 20:39 — with GitHub Actions Error

MicaiahReid had a problem deploying to Development-testnet March 15, 2024 20:39 — with GitHub Actions Error

MicaiahReid had a problem deploying to Development-mainnet March 18, 2024 15:14 — with GitHub Actions Error

MicaiahReid had a problem deploying to Development-testnet March 18, 2024 15:14 — with GitHub Actions Error

MicaiahReid force-pushed the fix-handle-errors branch from ae6665a to 1d81e24 Compare March 18, 2024 15:14

MicaiahReid had a problem deploying to Development-mainnet March 18, 2024 15:31 — with GitHub Actions Error

MicaiahReid had a problem deploying to Development-testnet March 18, 2024 15:31 — with GitHub Actions Error

MicaiahReid force-pushed the fix-handle-errors branch from 1d81e24 to d646545 Compare March 19, 2024 15:25

MicaiahReid had a problem deploying to Development-mainnet March 19, 2024 15:42 — with GitHub Actions Failure

MicaiahReid temporarily deployed to Development-testnet March 19, 2024 15:42 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Development-mainnet March 19, 2024 15:54 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Development-testnet March 19, 2024 15:54 — with GitHub Actions Inactive

MicaiahReid had a problem deploying to Staging-mainnet March 19, 2024 15:55 — with GitHub Actions Error

MicaiahReid had a problem deploying to Staging-testnet March 19, 2024 15:55 — with GitHub Actions Error

MicaiahReid added 11 commits March 19, 2024 16:40

spelling fixes

f1a6b4d

fix overflow subraction

8863c67

fix: continue on failed predicate registration

b7e2a02

fix: terminate service on ProcessBitcoinBlock failure

af42d42

spelling fix

23165ae

remove panic on missing btc block headers

dfa63ae

log crit error on ObserverEvent loop failure

07bdc0d

fix: crash service if scan loops crash

972979a

some cleanup

94a98b4

fix: shutdown predicate registration api on Terminate event

b0313a4

fix: terminate event observer on failure

ed04d76

MicaiahReid had a problem deploying to Development-testnet March 20, 2024 19:45 — with GitHub Actions Error

csgui requested changes Mar 27, 2024

View reviewed changes

components/chainhook-cli/src/config/mod.rs Outdated Show resolved Hide resolved

components/chainhook-cli/src/config/mod.rs Outdated Show resolved Hide resolved

fix tsv assertion formatting

4c4f944

MicaiahReid requested a review from csgui March 27, 2024 13:33

MicaiahReid temporarily deployed to Development-mainnet March 27, 2024 13:50 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Development-testnet March 27, 2024 13:50 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Staging-mainnet March 27, 2024 13:52 — with GitHub Actions Inactive

MicaiahReid had a problem deploying to Staging-testnet March 27, 2024 13:52 — with GitHub Actions Failure

MicaiahReid temporarily deployed to Development-mainnet March 27, 2024 13:53 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Development-testnet March 27, 2024 13:53 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Staging-mainnet March 27, 2024 13:53 — with GitHub Actions Inactive

MicaiahReid temporarily deployed to Staging-testnet March 27, 2024 13:53 — with GitHub Actions Inactive

csgui approved these changes Mar 27, 2024

View reviewed changes

MicaiahReid merged commit d6b8816 into develop Mar 27, 2024
10 checks passed

MicaiahReid deleted the fix-handle-errors branch March 27, 2024 15:26

MicaiahReid mentioned this pull request Mar 27, 2024

feat: add new features to hiro-system-kit hirosystems/clarinet#1396

Merged

MicaiahReid mentioned this pull request Jun 6, 2024

[service] Interrupted predicates aren't removed from chainhook store #497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve error handling, and more! #524

fix: improve error handling, and more! #524

MicaiahReid commented Mar 13, 2024 •

edited

Loading

csgui left a comment

csgui left a comment

CharlieC3 commented Mar 27, 2024

csgui commented Mar 27, 2024 •

edited

Loading

CharlieC3 commented Mar 27, 2024

MicaiahReid commented Mar 27, 2024

CharlieC3 commented Mar 27, 2024

csgui commented Mar 27, 2024 •

edited

Loading

CharlieC3 commented Mar 27, 2024 •

edited

Loading

MicaiahReid commented Mar 27, 2024 •

edited

Loading

CharlieC3 commented Mar 27, 2024

fix: improve error handling, and more! #524

fix: improve error handling, and more! #524

Conversation

MicaiahReid commented Mar 13, 2024 • edited Loading

csgui left a comment

Choose a reason for hiding this comment

csgui left a comment

Choose a reason for hiding this comment

CharlieC3 commented Mar 27, 2024

csgui commented Mar 27, 2024 • edited Loading

CharlieC3 commented Mar 27, 2024

MicaiahReid commented Mar 27, 2024

CharlieC3 commented Mar 27, 2024

csgui commented Mar 27, 2024 • edited Loading

CharlieC3 commented Mar 27, 2024 • edited Loading

MicaiahReid commented Mar 27, 2024 • edited Loading

CharlieC3 commented Mar 27, 2024

MicaiahReid commented Mar 13, 2024 •

edited

Loading

csgui commented Mar 27, 2024 •

edited

Loading

csgui commented Mar 27, 2024 •

edited

Loading

CharlieC3 commented Mar 27, 2024 •

edited

Loading

MicaiahReid commented Mar 27, 2024 •

edited

Loading