Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(networking): new metric showing if node is reachable #1206

Closed
3 tasks done
alrevuelta opened this issue Sep 30, 2022 · 8 comments · Fixed by #1472
Closed
3 tasks done

feat(networking): new metric showing if node is reachable #1206

alrevuelta opened this issue Sep 30, 2022 · 8 comments · Fixed by #1472
Assignees
Labels
enhancement New feature or request

Comments

@alrevuelta
Copy link
Contributor

alrevuelta commented Sep 30, 2022

Problem

As discussed with @jm-clius, currently there is no way for the node itself to be aware if it's reachable from the outside by other peers. #754 allows for an external party to check if it is reachable, but the node itself isn't aware of it.

Suggested solution

The libp2p feature autonat allows us to request other peers to dial us, and if the response is ok, the node itself can be sure that it is reachable. Luckily, looks like nim-libp2p support its. More info:

While the identify protocol described above lets peers inform each other about their observed network addresses, not all networks will allow incoming connections on the same port used for dialing out.

Once again, other peers can help us observe our situation, this time by attempting to dial us at our observed addresses. If this succeeds, we can rely on other peers being able to dial us as well and we can start advertising our listen address.

A libp2p protocol called AutoNAT lets peers request dial-backs from peers providing the AutoNAT service.

Once we have the value isNodeReachable updated periodically, we can use it:

  • in logs, warning if false.
  • in prometheus, to feed a dashboard like Grafana.
  • in RPC health checks. related

Alternatives considered

A less engineered approach would be to just check the number of inbound peers and if !=0 consider that the node is reachable. However one can argue that no inbound peers won't strictly mean that were are not reachable. In other words:

  • n_inbound_peers!=0 means that the node is reachable
  • n_inbound_peers=0 doesn't necessarily mean that the node is not reachable. Perhaps in practice yes, but still.

Additional context

Acceptance criteria

@alrevuelta alrevuelta added enhancement New feature or request track:production labels Sep 30, 2022
@jm-clius
Copy link
Contributor

Thanks. Very helpful and goes a long way towards addressing a common concern from node operators.

@jm-clius jm-clius moved this to Todo in Waku Oct 5, 2022
@jm-clius jm-clius added this to the Release 0.13.0 milestone Oct 5, 2022
@jm-clius jm-clius modified the milestones: Release 0.13.0, Release 0.14.0 Nov 11, 2022
@jm-clius
Copy link
Contributor

@alrevuelta any concerns if we remove this from 0.13.0 scope? We're more or less freezing risky features until the release Tuesday, so I'm going ahead and kicking this down the road. Feel free to object. :)

@alrevuelta alrevuelta self-assigned this Nov 16, 2022
@alrevuelta
Copy link
Contributor Author

The proposed feature in the alternatives considered above would be easy to add once the peer store refactor is finished #622. Add a new metric (prometheus + rpc) showing the amount of inbound/outbound peers the node has.

This would add a requirement to store the Direction Inbound|Outbound of the connection in the PeerStore and add a simple getPeersByDirection(Inbound|Outbound) proc. Refer to #622 for tracking this feature dependancies.

@alrevuelta
Copy link
Contributor Author

As raised by @Menduist, a node can have multiple connections, some being incoming and others outgoing, so perhaps I need to modify this metric. But first I would like to understand this from a practical point of view in waku side. Do we currently support multiple connections with a given peer? If not do we plan to support this in the future?

Plan b would be to:

  • Store the amount of incoming and outgoing connections.

The idea here is to know how well-connected a node is in terms of in/out connections.

@alrevuelta
Copy link
Contributor Author

alrevuelta commented Nov 25, 2022

Do we currently support multiple connections with a given peer?

Answering myself. I would say no, and this example should prove why. Let's say we have peer0 and peer1 and two different protocols relay and lightpush. -> denotes succesful dial.

  • peer0->peer1 using relay. This will create one outbound connection in peer0 and one inbound connection in peer1.
  • peer1->peer0 using lightpush. This does not creates any connections, since the connection is reused. I can confirm this by seeing Reusing existing connection in the nim-libp2p logs.

The same applies to WebSockets. Tried connecting peer0->peer1 with tcp and then with ws and can also see Reusing existing connection for the second connection. So looks like its reused.

So would say that as it is a peer can't have multiple connections, since existing ones are reused for i) different protocols and ii) different transports.

Edit:

let conn1 = await nodes[0].peerManager.dialPeer(peerInfos[1], WakuRelayCodec, 2.seconds)

# I see "Reusing existing connection" in between
let conn2 = await nodes[0].peerManager.dialPeer(peerInfos[1], WakuLightPushCodec, 2.seconds)

# But then, do I get a different connection?
echo "conn1", $conn1.get().protocol   # /vac/waku/relay/2.0.0
echo "conn2", $conn2.get().protocol  # /vac/waku/lightpush/2.0.0-beta1

@alrevuelta
Copy link
Contributor Author

As explained by @Menduist max connections per peer is currently set to one, but this might change in the future. I will make the default to 1 in our side. If at some point we want to allow multiple connections, we can revisit.

@alrevuelta
Copy link
Contributor Author

@diegomrsantos is working in an AutonatService that tells us the NetworkReachability state of our node (if we are reachable or not). So instead of using the dialMe function we could use that service, which should be easier and more complete. Since the AutonatService is under development, hopefully our input/requirements can help in shaping it.

Some requirements on our side, with some implications on the service:

  • We would like to have a metric that indicates: if the node is reachable, if the node is not reachable or if we simply can't know. Something like Reachable, Notreachable, Unknown.
  • This metric should be updated over time, with a given frequency.
  • The metric is to be used by node operators, to know if their node is properly set up. So something easy to look at, with just the 3 mentioned states.
  • If we use AutonatService, we would need to:
    • Know when the NetworkReachability state changed. Which is already addressed with a handler. Then update our metric accordingly.
    • Configure how frequently it should be checked. We don't want to be constantly checking this, every few minutes should be fine. Perhaps with an interval field.
    • We don't have strict requirements on what Reachable means. If a few peers confirm it, we're good. No need to define a fancy strategy. Perhaps instead of maxConfidence we could have a maxPeersToCheck and a minOkThreshold so that every interval we randomly select maxPeersToCheck and if minOkThreshold flag our node as Reachable, then we are. Perhaps I would use 2/10, since some peers might not support it and other may not be reachable.
    • Its important that Notreachable means that we are indeed not reachable. Otherwise node operators will be confused. If in doubt, it should fallback to Unknown.

@alrevuelta
Copy link
Contributor Author

alrevuelta commented Dec 9, 2022

vacp2p/nim-libp2p#814 is getting close to be ready, but this feature is blocked until it is merged.

@alrevuelta alrevuelta changed the title feat: new metric showing if node is reachable feat(networking): new metric showing if node is reachable Dec 13, 2022
@github-project-automation github-project-automation bot moved this from Todo to Done in Waku Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants