Investigate Grafana Tempo of how it can enhance our node monitoring #4322

bingyanglin · 2024-12-02T11:46:10Z

Investigate the Grafana Tempo and check how it can enhance our node monitoring.

daria305 · 2024-12-23T08:48:05Z

This issue is closely correlated with #4321.
Grafana tempo and telemetry is present in the codebase. The only examples of usage is in docker/grafana-local with tempo configured for traces and prometheus for metrics for the local network setup.

Currently encountered problems:

tempo was crushing on the start, problem was an outdated docker-compose template and tempo template using restricted directory
data is not visible in grafana, despite of working tempo
- problem might be mismatched ports, still ongoing

daria305 · 2025-01-07T09:51:53Z

Tracing with tempo and telemetry works with full node build from source.

To start tracing we need to start a node with TRACE_FILTER=off environment variable set, only then we can control tracing with admin commands.
Then we can enable tracing for specified time period when needed `curl -X POST 'http://127.0.0.1:1337/enable-tracing?filter=iota-node=trace,info&duration=20s'
We currently do not have grafana setup with tempo merged and ready, the working demo is on the branch core-node/test/fullnode-grafana

Some remarks for current grafana setup:

Grafana setup placed in docker/grafana-local was created for docker/iota-private-network, I did not manage to run docker local setup, as it failed on the bootstrap.sh. This might need fixing, or removing as we already have one liner local network setup through iota cmd tool iota start.
grafana local setup did not worked with "iota start" local network, also the admin port 1337 is not reachable for any of the nodes

daria305 · 2025-01-09T12:55:22Z

Ongoing tasks:

issue: grafana in separate docker compose, node in another docker compose, admin not work, metrics works
- can traefik be useful for the admin port not responding?
- prometheus cannot reach node's endpoint inside other docker compose on linux, (host.docker.internal:9184)
  --> no, traefik wont help in this case, as admin port is exposed only with binding to localhost, we create separate PR to discuss exposing it.
basic docker setup with bash script, similar to what hornet had:
- node-docker-setup/iota/env_template at main · iotaledger/node-docker-setup · GitHub
- but start with the easy way - separate directories
- work around the admin issue
  --> scripts and runnable demo is ready on a separate branch, that's enough for testing and palying with tempo, also we do not create any production grafana setup
test: grafana, tempo, working with any node setup
we can use .env var in volume - simplify grafana run

Issues to investigate later

issue: after running opentelemetry for a longer time this message is flowing the terminal, making logs unredable: OpenTelemetry trace error occurred. cannot send message to batch processor as the channel is full
possible problem: try enable tracing more than once, only first request visible in grafana
issue: run the node without TRACE_FILTER=off then enabling is not working (tempo don't have data) - the response for the user indicate that it works

muXxer · 2025-01-15T15:38:03Z

How does the tracing work?
What kind of different types of tracings are there? (are there different endpoints for tracing e.g. admin interface etc)
How can the different ways be activated?
What exactly is returned and how can it be visualised?

No need to change the setup for now. We will decide later how to proceed after it is clear what kind of tracings are available. For example for debugging we might need different tracing, which doesn't need to be added to the normal node monitoring setup.

daria305 · 2025-01-27T07:57:14Z

Collected information and guidelines summarized here.
Instructions covers use cases for:

adding new spans
enabling tracing with opentelemetry send via OTLP (can be explored through Grafana Tempo) and saved to a file
collecting latencies from spans by PrometheusSpanLatencyLayer layer exposed as Prometheus metrics

daria305 · 2025-02-12T10:07:14Z

As the last step we tried out the tokio console, steps updated in here.
Steps also listed below:

On the node side:

Build and run IOTA node using: a special rust (tokio_unstable) flag, --feature flag enable tokio-console feature, and run it with TOKIO_CONSOLE=1 environment variable.

The whole command:

TOKIO_CONSOLE=1 RUSTFLAGS="--cfg tokio_unstable" cargo run --bin iota-node --features tokio-console -- --config-path fullnode.yaml

Console side:
Clone the console repo.
Run the console:

cargo run

bingyanglin added the node Issues related to the Core Node team label Dec 2, 2024

daria305 self-assigned this Dec 3, 2024

jkrvivian self-assigned this Jan 13, 2025

daria305 mentioned this issue Jan 21, 2025

feat(config): Change config for admin interface from port to address #4947

Merged

3 tasks

daria305 mentioned this issue Jan 30, 2025

feat(admin): Return error status code from admin console if tracing is disabled on startup. #5092

Merged

2 tasks

daria305 closed this as completed Feb 12, 2025

daria305 mentioned this issue Feb 13, 2025

Investigate https://opentelemetry.io/ of how it can enhance our monitoring and debugging #4321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Grafana Tempo of how it can enhance our node monitoring #4322

Investigate Grafana Tempo of how it can enhance our node monitoring #4322

bingyanglin commented Dec 2, 2024 •

edited

Loading

daria305 commented Dec 23, 2024 •

edited

Loading

daria305 commented Jan 7, 2025 •

edited

Loading

daria305 commented Jan 9, 2025 •

edited

Loading

muXxer commented Jan 15, 2025

daria305 commented Jan 27, 2025 •

edited

Loading

daria305 commented Feb 12, 2025

Investigate Grafana Tempo of how it can enhance our node monitoring #4322

Investigate Grafana Tempo of how it can enhance our node monitoring #4322

Comments

bingyanglin commented Dec 2, 2024 • edited Loading

daria305 commented Dec 23, 2024 • edited Loading

daria305 commented Jan 7, 2025 • edited Loading

daria305 commented Jan 9, 2025 • edited Loading

muXxer commented Jan 15, 2025

daria305 commented Jan 27, 2025 • edited Loading

daria305 commented Feb 12, 2025

bingyanglin commented Dec 2, 2024 •

edited

Loading

daria305 commented Dec 23, 2024 •

edited

Loading

daria305 commented Jan 7, 2025 •

edited

Loading

daria305 commented Jan 9, 2025 •

edited

Loading

daria305 commented Jan 27, 2025 •

edited

Loading