Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Grafana Tempo of how it can enhance our node monitoring #4322

Closed
bingyanglin opened this issue Dec 2, 2024 · 6 comments
Closed
Assignees
Labels
node Issues related to the Core Node team

Comments

@bingyanglin
Copy link
Contributor

bingyanglin commented Dec 2, 2024

Investigate the Grafana Tempo and check how it can enhance our node monitoring.

@bingyanglin bingyanglin added the node Issues related to the Core Node team label Dec 2, 2024
@daria305 daria305 self-assigned this Dec 3, 2024
@daria305
Copy link
Contributor

daria305 commented Dec 23, 2024

This issue is closely correlated with #4321.
Grafana tempo and telemetry is present in the codebase. The only examples of usage is in docker/grafana-local with tempo configured for traces and prometheus for metrics for the local network setup.

Currently encountered problems:

  • tempo was crushing on the start, problem was an outdated docker-compose template and tempo template using restricted directory
  • data is not visible in grafana, despite of working tempo
    • problem might be mismatched ports, still ongoing

@daria305
Copy link
Contributor

daria305 commented Jan 7, 2025

Tracing with tempo and telemetry works with full node build from source.

  • To start tracing we need to start a node with TRACE_FILTER=off environment variable set, only then we can control tracing with admin commands.
  • Then we can enable tracing for specified time period when needed `curl -X POST 'http://127.0.0.1:1337/enable-tracing?filter=iota-node=trace,info&duration=20s'
  • We currently do not have grafana setup with tempo merged and ready, the working demo is on the branch core-node/test/fullnode-grafana

Some remarks for current grafana setup:

  • Grafana setup placed in docker/grafana-local was created for docker/iota-private-network, I did not manage to run docker local setup, as it failed on the bootstrap.sh. This might need fixing, or removing as we already have one liner local network setup through iota cmd tool iota start.
  • grafana local setup did not worked with "iota start" local network, also the admin port 1337 is not reachable for any of the nodes

@daria305
Copy link
Contributor

daria305 commented Jan 9, 2025

Ongoing tasks:

  • issue: grafana in separate docker compose, node in another docker compose, admin not work, metrics works
    • can traefik be useful for the admin port not responding?
    • prometheus cannot reach node's endpoint inside other docker compose on linux, (host.docker.internal:9184)
      --> no, traefik wont help in this case, as admin port is exposed only with binding to localhost, we create separate PR to discuss exposing it.
  • basic docker setup with bash script, similar to what hornet had:
  • test: grafana, tempo, working with any node setup
  • we can use .env var in volume - simplify grafana run

Issues to investigate later

  • issue: after running opentelemetry for a longer time this message is flowing the terminal, making logs unredable: OpenTelemetry trace error occurred. cannot send message to batch processor as the channel is full
  • possible problem: try enable tracing more than once, only first request visible in grafana
  • issue: run the node without TRACE_FILTER=off then enabling is not working (tempo don't have data) - the response for the user indicate that it works

@jkrvivian jkrvivian self-assigned this Jan 13, 2025
@muXxer
Copy link
Contributor

muXxer commented Jan 15, 2025

  • How does the tracing work?
  • What kind of different types of tracings are there? (are there different endpoints for tracing e.g. admin interface etc)
  • How can the different ways be activated?
  • What exactly is returned and how can it be visualised?

No need to change the setup for now. We will decide later how to proceed after it is clear what kind of tracings are available. For example for debugging we might need different tracing, which doesn't need to be added to the normal node monitoring setup.

@daria305
Copy link
Contributor

daria305 commented Jan 27, 2025

Collected information and guidelines summarized here.
Instructions covers use cases for:

  • adding new spans
  • enabling tracing with opentelemetry send via OTLP (can be explored through Grafana Tempo) and saved to a file
  • collecting latencies from spans by PrometheusSpanLatencyLayer layer exposed as Prometheus metrics

@daria305
Copy link
Contributor

As the last step we tried out the tokio console, steps updated in here.
Steps also listed below:

On the node side:

Build and run IOTA node using: a special rust (tokio_unstable) flag, --feature flag enable tokio-console feature, and run it with TOKIO_CONSOLE=1 environment variable.

The whole command:

TOKIO_CONSOLE=1 RUSTFLAGS="--cfg tokio_unstable" cargo run --bin iota-node --features tokio-console -- --config-path fullnode.yaml

Console side:
Clone the console repo.
Run the console:

cargo run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
node Issues related to the Core Node team
Projects
None yet
Development

No branches or pull requests

4 participants