Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status MVP: Status Core Contributors use Status #7

Closed
14 tasks
fryorcraken opened this issue Nov 22, 2022 · 23 comments
Closed
14 tasks

Status MVP: Status Core Contributors use Status #7

fryorcraken opened this issue Nov 22, 2022 · 23 comments
Assignees

Comments

@fryorcraken
Copy link
Contributor

fryorcraken commented Nov 22, 2022

Deadline: Beginning of December

High level requirement: 120-130 people uses Status Communities over Waku v2.

Details

Client Diversity

  • All users uses Status Desktop

Network Connectivity

  • Clients are mostly online
    • Laptops on during working hours
    • Offline during night
    • Some app reboot
  • Assumed mostly stable internet connection (WiFi + DSL/Filber?)

Network Topology

  • Only fleet nodes provide store, light push and filter services
  • Only TCP transport is used.
  • Users connect to fleet and each other, Waku Relay is used.
    • Users are behind nat devices.
    • Connections drops semi-regularly (see Network Connectivity)
    • Low upload bandwidth > low download bandwith for another user
    • Discovery needed to find users.

Details

  • Confirm that 1/2/3 below are not needed and that we currently have enough inbound connectivity thanks to discv5 + uPnP to support a healthy network.
  1. AutoNat when clients come online:
  • go-waku: Use AutoNat when client comes online. Use public IP address for discovery if succeeds.
  • go-waku (conditional to outcome): if most clients in the network are reachable and can therefore successfully be discovered, the few clients that can't, could discover random peers and make outgoing connections only using Waku Peer Exchange (at least in the meantime).
  1. If AutoNat fails, use AutoRelay:
  • go-waku: discover and initiate circuit-relay connection to random peers if (1) failed
  • nwaku: enable libp2p circuit relay (already supported)
  • NATless discovery mechanism required here to discover relay addresses:
    • Option 1: libp2p rendezvous
      • nwaku: enable/integrate libp2p rendezvous (already supported in nim-libp2p)
      • go-waku: implement libp2p rendezvous client (should not be too complicated)
    • Option2: libp2p kad-dht
      • nim-libp2p: implement kad-dht (significant effort)
      • nwaku: integrate and enable libp2p kad-dht
      • go-waku: integrate and enable libp2p kad-dht
  1. DCuTR: hole-punching to create direct connection
  • go-waku: enable/integrate

No one-size-fits-all solution for NAT. It will be an iterative process based on dogfooding feedback.
Possible other workaround.

  1. Help Status CCs enable uPnP on their routers if AutoNat fails.

Roadmap

  • AutoNat in go-waku.
  • Waku Peer exchange go-waku <> go-waku/nwaku
  • AutoRelay (may not be needed)
    • libp2p rendezvous nwaku + go-waku (as client)
    • libp2p kad-dht : nim-libp2p + nwaku + go-waku (may not be needed)
  • DCuTR: go-waku (may not be needed)
  • Status CC enable uPNP (may not be needed)

Connection Numbers

  • Target 150 nodes.

  • Each node has at least one connection with a bootstrap node. Should assume two?

  • Status Client to confirm expected usage of Status Web.

Roadmap

  • Fleet can handle the expected number of connections

Availability

  • Confirm current nwaku uptime on Status prod thanks to Canary
  • Get sign off from Status client.

Waku Store

Store Data Volume

  • Extract from Status Discord to know expect volume of messages
    • # of messages in 30 days
    • Total size of 30 days of messages

Store Query Frequency

  • Assuming pattern defined in Network connectivity

    • Mostly 72 hours queries (laptop off in weekend)
    • 30 days queries on occasional app reset
  • peak of queries when app start at begin of work day. Monday highest due to weekend overlap. ~90 CCs in Europe.

  • Need to understand total volume of queries based on # of communities, channels, messages and contacts

  • Status Client to confirm expected usage of Status Web.

Store Query Format

  • Status Client to provide list of exact store query formats to ensure that nwaku unit tests cover all scenarios (# of content topics, cursor +- timefilter, etc)

Roadmap

  • Confirm expected data volume/frequency/format
  • Review SQLite upper bound performance (from published benchmarks)

Issues:

Peer Behaviour

  • Peers mostly behave correctly
  • Tracking of peers with poor bandwidth/connectivity may be needed, or peer that cannot accept inbound connections.

Bridging

  • Status Client to confirm suggestion from offsite to disable v1<> v2 bridging

Peer Persistence

  • Status Client to specify whether discovered peers be persisted across restarts (NB to remember gossipsub mandatory backoff period here)?
@fryorcraken
Copy link
Contributor Author

fryorcraken commented Nov 22, 2022

Also, the Network Topology section deems clarification.

  • (b) @jm-clius Can you please clarify whether my interpretation is correct and confirm/deny whether (2) is feasible and what steps/dogfooding would be needed to make it happen? (e.g. dogfood upper limit of nwaku number of connections).

Once this is done, we can ask Status Client team to:

  • (c) Check that assumptions above for the December deadline and provide clarifications/corrections if necessary.
  • (d) Acknowledged the Network Topology proposal

Finally, edit the description to:

  • (e) track any blocking issue relevant to each
  • (f) track the dogfooding/sign-off needed by Status Client team for each topic (@fryorcraken can help handle that)

@jm-clius
Copy link

Network Topology

Work effort for Topology 1: Waku Relay

For (1) I think the scope and order of work is (roughly):

  1. AutoNat when clients come online:
  • go-waku: Use AutoNat when client comes online. Use public IP address for discovery if succeeds.
    Note: I think go-waku already supports the same NAT traversal techniques as nwaku (UPnP, NATPMP @richard-ramos can confirm?). This step gives us an idea of whether the next steps are necessary/urgent (perhaps most clients are successfully reachable with existing techniques).
  • go-waku (conditional to outcome): if most clients in the network are reachable and can therefore successfully be discovered, the few clients that can't, could discover random peers and make outgoing connections only using Waku Peer Exchange (at least in the meantime).
  1. If AutoNat fails, use AutoRelay (conditional to outcome of 1):
  • go-waku: discover and initiate circuit-relay connection to random peers if (1) failed
  • nwaku: enable libp2p circuit relay (already supported)
  • NATless discovery mechanism required here to discover relay addresses:
    • Option 1: libp2p rendezvous
      • nwaku: enable/integrate libp2p rendezvous (already supported in nim-libp2p)
      • go-waku: implement libp2p rendezvous client (should not be too complicated)
    • Option2: libp2p kad-dht
      • nim-libp2p: implement kad-dht (significant effort)
      • nwaku: integrate and enable libp2p kad-dht
      • go-waku: integrate and enable libp2p kad-dht
  1. DCuTR: hole-punching to create direct connection
  • go-waku: enable/integrate

This solution would then need to be targeted for dogfooding under various scenarios.
Note that there's no one-size-fits-all solution for NAT and restrictive networking conditions, so data-gathering (e.g. running AutoNat) to see which clients are publicly reachable and which traversal techniques work, will be part of the effort. Perhaps helping contributors enable UPnP on their routers if AutoNat fails could be an intermediate, Status-internal step?

Work effort for Topology 2: Waku Filter, Waku Lightpush

This topology is much more risky and has many more unknowns. Neither of these protocols have been target tested for dogfooding, scalability is unknown, some redundancy is required client-side, etc. For this to get to production, I can think of at least the following outstanding items likely to be important in the protocol itself:

  • method for filter client to check state, remove, refresh or update an existing subscription
  • ACK mechanism for filter subscription requests
  • connectivity investigation (e.g. clients only able to make outbound connections), which may either require keeping the outbound connection open or NAT techniques

This implies updates to the protocols, implementation changes for go-waku and nwaku and targeted dogfooding.

I don't think it's feasible to support this topology within a short time frame. More on this ongoing effort here and here.

@jm-clius

This comment was marked as resolved.

@jm-clius

This comment was marked as resolved.

@richard-ramos
Copy link
Member

AutoNat when clients come online

I confirm that go-waku supports upnp / pmp. (tested and also confirmed by @cammellos as he was able to reach my machine).

Regarding the few clients that can't, could discover random peers and make outgoing connections only using Waku Peer Exchange. If we were to expose Peer Exchange to status-go, what would be the criteria to chose which peer-exchange-node should be used to request nodes from? should it be chosen randomly from the fleet nodes?

If AutoNat fails, use AutoRelay (conditional to outcome of 1) - Option 1: libp2p rendezvous

I had an implementation of libp2p rendezvous client and server that I removed recently in waku-org/go-waku#351 which had a change to use ENRs instead of signed peer records. If necessary it can be added back and remove the change that I did so it uses libp2p signed records.

If AutoNat fails, use AutoRelay (conditional to outcome of 1) - Option2: libp2p kad-dht

While go-libp2p has an implementation of kad-dht available (https://pkg.go.dev/github.com/libp2p/go-libp2p-kad-dht), it needs to be integrated in go-waku.

@jm-clius
Copy link

If we were to expose Peer Exchange to status-go, what would be the criteria to chose which peer-exchange-node should be used to request nodes from? should it be chosen randomly from the fleet nodes?

@richard-ramos, indeed. Risk is of course that this has not been dogfooded, but it's not a resource-intensive protocol (cached set of random peers should be available immediately upon request)

@jm-clius
Copy link

@richard-ramos the idea under Topology (1) is based roughly off what was discussed before, with some additions (e.g. the proposed discovery methods, using waku peer exchange, etc.) Does the work items/order make roughly sense to you? Do we already know, maybe via AutoNat, roughyl what proportion of clients are affected by unsupported NAT traversal?

@richard-ramos
Copy link
Member

The work items do make sense. Currently I'm doing a poll in #waku-e2e to get an idea of the status of NAT across clients.

@fryorcraken
Copy link
Contributor Author

If AutoNat fails, use AutoRelay (conditional to outcome of 1) - Option 1: libp2p rendezvous

Are you saying if AutoNat fails for an individual node then they use AutoRelay (or you mentioned peer exchange too). Or are you saying that if the technology overall fails, then the backup plan would be to use AutoRelay?

@fryorcraken
Copy link
Contributor Author

@jm-clius considering that we aim for p2p connections between Status Desktop items. Then we can imagine that some peers will provide poor connection quality (high latency, low bandwidth). How can we scope this as part of this milestone?

@fryorcraken fryorcraken changed the title Status MVP: Status CC use Status Status MVP: Status Core Contributors use Status Nov 23, 2022
@jm-clius
Copy link

Are you saying if AutoNat fails for an individual node then they use AutoRelay (or you mentioned peer exchange too). Or are you saying that if the technology overall fails, then the backup plan would be to use AutoRelay?

The former. In other words, the circuit-relay-to-hole-punching procedure can be triggered if the client determines its not publicly reachable (via AutoNat). I'm also saying that if AutoNat shows that existing NAT traversal techniques generally work for most clients, this AutoRelay/hole-punching procedure may not be critical for this milestone (those clients can connect to others after discovering using Waku Peer Exchange, for example).

Then we can imagine that some peers will provide poor connection quality (high latency, low bandwidth). How can we scope this as part of this milestone?

It depends: our ultimate solution for such peers is filter and lightpush, which is out of scope. This can be enabled as experimental feature and must be dogfooded already, but with the understanding that these are beta features. An intermediate step would be to e.g. use Waku Peer Exchange as light discovery mechanism and attempt to replenish connectivity in this way. Relay can be surprisingly resilient as it includes "error mechanisms" such as the IWANT/IHAVE control checks. This would imply some extra latency for such clients.

@fryorcraken
Copy link
Contributor Author

Looks like AutoNat is not needed: https://docs.google.com/spreadsheets/d/1xgtSQpIUB1k1aIenSF_wpc4zfXuuykKbX5Ne4SNaOyw/edit?usp=sharing

@Menduist could you help me here summarize the requirements for a healthy p2p network? @jm-clius mentioned that 25% of nodes need to accept incoming connections, is that for a healthy gossipsub with D=6?

@fryorcraken
Copy link
Contributor Author

Another topic not discussed is the expectation of node availability.
@alrevuelta I believe you are able to pull some stats from the canary node? What do we ahve at the moment for status.prod fleet please? let's say past 7 days.

@fryorcraken
Copy link
Contributor Author

@richard-ramos: Yeah, yesterday i created a fix for an issue related to Discovery V5. After gowaku acquired the external address, it was not updating the ENR for Discovery v5. Hence, the nodes were not being discovered, I opened a PR fixing that and now the number of peers that you can connect to assuming you that have open UPNP or PNP enabled has increased by a lot. (23 Nov)

It looks like NAT traversal strategy are not needed for this milestones as the connectivity issue was related to discv5. @richard-ramos can you please confirm and provide reference to the PR?

@fryorcraken
Copy link
Contributor Author

@LNSD What kind of information are we able to extract from Status Prod fleet sqlite? e..g current DB size?
@alrevuelta is it possible using the metric node to get the number and size of messages (traffic volume) sent on Status' content topics?

@alrevuelta
Copy link

alrevuelta commented Nov 29, 2022

Another topic not discussed is the expectation of node availability.
@alrevuelta I believe you are able to pull some stats from the canary node? What do we ahve at the moment for status.prod fleet please? let's say past 7 days.

Don't have the metric right now, but I'm planning to report on it (having problems discovering peers with the network monito tool). Is this what you are referring to?

  • amount of nodes we can connect
  • amount of total discovered nodes.

@alrevuelta is it possible using the metric node to get the number and size of messages (traffic volume) sent on Status' content topics?

Not exactly the same but this is what I have right now:

One interesting finding regarding the traffic:

  • Traffic really decreases 12:00 to 7am CET time.
  • We can see some pattern of usage, having high usage during 8am 3am CET time (EU+USA working hours?)
  • And very low usage during weekends.
    image

Will continue with this, hope this helps by now.

@Menduist
Copy link

Menduist commented Nov 29, 2022

Looks like AutoNat is not needed

AutoNat is part of the Hole-Punching stack, so it is required

could you help me here summarize the requirements for a healthy p2p network?

This spreadsheet is based around the percent of the network that a type of node can reach.
If that is 1%, it means that 1% of the network becomes a hotspot that will bottleneck out (assuming that type of node is frequent enough)
It's a simplification of reality, but should at least give some good intuitions

25% comes from = D_out / D_high = 3 / 12. That's the absolute minimum since peers will always keep at least D_out connections in their mesh (for sybil protection reasons), so they need at least 25% of outgoing connections. (if you reverse this, they will become a bottleneck if we expect them to take >75% in incoming connections)
I don't have a good figure on the required percentage to be "healthy" (that probably requires simulations), 25% is the lowest-possible-lower-bound.

75% is apparently the best we can hope for given network shares, UPnP popularity & Hole-Punching success rates.

@LNSD
Copy link
Contributor

LNSD commented Nov 29, 2022

@LNSD What kind of information are we able to extract from Status Prod fleet sqlite? e..g current DB size?

Let's talk to Waku archive (the Waku store message persistence backend). SQLite is just one of the possible persistence backend drivers.

The current exposed metrics are:

  • Number of stored messages
  • Message insertion duration
  • Persistent storage query time
  • Message validation errors: At this moment, message timestamps are checked by the Waku archive implementation. Messages outside the [now-20s, now+20s] range are discarded and reported invalid.

@jm-clius
Copy link

jm-clius commented Nov 29, 2022

Another topic not discussed is the expectation of node availability. @alrevuelta I believe you are able to pull some stats from the canary node? What do we ahve at the moment for status.prod fleet please? let's say past 7 days.

Availability over the last 7 days have been 92.23%. See this report.

And note that for the last 5 days it's higher (~95%), likely due to more config and other improvements. Report here

@richard-ramos
Copy link
Member

richard-ramos commented Nov 29, 2022

@fryorcraken
Reference PR for DiscV5 fix in go-waku: waku-org/go-waku#368
in status-go: status-im/status-go#2972

@oskarth
Copy link
Contributor

oskarth commented Jan 13, 2023

How is this issue different from #8? Can one be closed?

@jm-clius
Copy link

This issue was for the Desktop users. #8 is for launching on Mobile too. Closing this.

@oskarth
Copy link
Contributor

oskarth commented Jan 16, 2023

Oh gotcha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants