Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waku v2 canary tool #754

Closed
2 tasks
jm-clius opened this issue Oct 25, 2021 · 11 comments · Fixed by #1205
Closed
2 tasks

Waku v2 canary tool #754

jm-clius opened this issue Oct 25, 2021 · 11 comments · Fixed by #1205
Assignees
Labels
good first issue Good for newcomers

Comments

@jm-clius
Copy link
Contributor

jm-clius commented Oct 25, 2021

Problem

Currently the only built-in monitoring of a Waku v2 node relies on either:

  1. using an external service, such as a port scanner, to check that the advertised listening address is live
  2. basic RPC health check which relies on an indirect response from the Waku v2 node and for RPC to be configured on the monitored host

We want a "canary" service, similar to the one created for Status-Go, that can be used to verify the Waku v2 service at a given port.

Suggested solution

A very simple canary service, would:

  1. Attempt to dial a node at a given listening address for relay protocol
  2. Attempt protocol negotiation at that address (a successful connection should serve as indication that protocol negotiation succeeded).
  3. Quit successfully if the above is successful or with an error code if not

As a next stage this can be extended with more in-depth tests, e.g.:

  1. is store mounted and does it work on monitored node?
  2. can we configure a filter, lightpush?
    etc.

In other words, this must be a separate binary (with its own make target) with a separate config. It could live under a tools subfolder within nwaku. It should be runnable with either --staticnode: or --storenode: config and indicate with an exit code whether relay or store (respectively) can be negotiated on the target node. As a next stage the tool can be extended for --filternode: and lightpush: checks.

Acceptance criteria

  • a new target binary that can be run as follows:
    ./build/waku-canary --staticnode:<multiaddr_to_test> or
    ./build/waku-canary --storenode:<multiaddr_to_test>
    and exits with indication if corresponding protocol (relay or store) could be negotiated on the target.
  • a README to explain basic usage
@jm-clius jm-clius added the good first issue Good for newcomers label Oct 25, 2021
@jakubgs
Copy link
Contributor

jakubgs commented Oct 25, 2021

Something nice to have would be flag for adjusting timeout, or possible other parameters of the connection in the future.

@D4nte

This comment was marked as resolved.

@jakubgs
Copy link
Contributor

jakubgs commented Jun 21, 2022

I think there's a confusion here. I want a canary tool, not a canary "service".

The node-canary in status-go is a CLI tool that you can call for a given enode:// address and check if it's responding as expected, it's not a service.

This way it can be plugged into other automations like Cabot, which we have an instance of: https://canary.infra.status.im/

@jm-clius jm-clius mentioned this issue Jun 21, 2022
5 tasks
@jm-clius
Copy link
Contributor Author

Agreed, @jakubgs. We do need something like a "monitoring" node as well, that will have the responsibility to gather various network stats and could give an overall impression of network health. Have opened an issue here: #1010

@jakubgs jakubgs changed the title Waku v2 canary service Waku v2 canary tool Jun 21, 2022
@jm-clius jm-clius modified the milestone: Release 0.11 Jun 22, 2022
@jakubgs
Copy link
Contributor

jakubgs commented Aug 8, 2022

Still waiting...

@jm-clius jm-clius added this to the Release 0.12 milestone Aug 10, 2022
@jm-clius
Copy link
Contributor Author

Have added this to the next release milestone.

@jm-clius jm-clius added this to Waku Sep 2, 2022
@jm-clius jm-clius modified the milestones: Release 0.12, Release 0.13 Sep 2, 2022
@jm-clius jm-clius moved this to Todo in Waku Sep 2, 2022
@alrevuelta
Copy link
Contributor

Before starting with the implementation of the suggested solution, would like to discuss an alternative solution that with my limited understanding would also solve the problem and follows a pattern I've seen before.

Alternative solution
Instead of having an external canary tool, can't we have that functionality built-in in the node and exposed as part of the 16/WAKU2-RPC? Then just curl node:port/healthz and get the status of the node. More verbose output can be added (i.e. if store is mounted and working).

This would require:

  • Modify 16/WAKU2-RPC adding a new endpoint (i.e. /waku/v2/relay/v1/healthz)
  • Define the expected output. Perhaps just 0 if its live or a more detailed output if we want more granularity. resp = {'service1': 'ok|nok', 'service2': 'ok|nok', 'storage': 'mounted|null'}

Some advantages:

  • Any external entity can check the status with just with curl and no external dependencies having to use our custom tool.
  • It can be used in Kubernetes as a readiness/liveness probe. Unsure if we currently have this feature.

One can argue that the node itself can't know if a given port is open, live, and accessible from the outside. I'm unsure if this statement is true, but if it is, perhaps we can rely on other local accessible metrics to evaluate this? For example, if at least 1 peer is connected, we can be sure that that port is open and live, otherwise, that peer wouldn't be able to connect.

The use of health is something I've seen prysm its some kind of convention. Also the Ethereum beacon-chain spec has something similar.

Kindly let me know if you think it's worth investigating this alternative, or if I'm missing something.

@jakubgs
Copy link
Contributor

jakubgs commented Sep 29, 2022

No, you are missing the point of a canary.

  1. This already exists, we already use the API for Consul healthchecks here (though RPC is disabled since it was flaky).
  2. A healthcheck is not the same thing as a canary. A canary checks public availability, not internal API availability.
  3. I have no intention of exposing REST API publicly for the sake of running healthchecks.
  4. I don't want an indirect way of checking if the node is up. I want a way to check if the node is available.

To summarize: The canary is intended to check availability and functionality of libp2p port, not the node running.

For example, the node might be running fine, and RPC responding fine, but firewall might be blocking libp2p port effectively making the node unavailable for anyone on the internet. Such an RPC healthcheck would be useless in that case.

@alrevuelta
Copy link
Contributor

alrevuelta commented Sep 29, 2022

Thanks, @jakubgs great explanation. Will then move forward with the suggested solution.

@alrevuelta
Copy link
Contributor

alrevuelta commented Oct 3, 2022

@jakubgs Mind providing some nodes that should be reachable? i.e. static and store nodes? I'm currently using wakuv2.prod from our fleets, is that correct?

Edit: I can see that all three support the following protocols. I was just having some issues with /dns4/xxx. Nevermind :)

# /ipfs/id/1.0.0,
# /vac/waku/relay/2.0.0,
# /ipfs/ping/1.0.0,
# /vac/waku/swap/2.0.0-beta1,
# /vac/waku/store/2.0.0-beta4,
# /vac/waku/lightpush/2.0.0-beta1,
# /vac/waku/filter/2.0.0-beta1

@jakubgs
Copy link
Contributor

jakubgs commented Oct 4, 2022

All the publicly available nodes should be listed on https://fleets.status.im/.

If you have a problem with any specific node please tell me which one so I can investigate it.

@jm-clius jm-clius moved this from In Progress to Done in Waku Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants