Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages get lost when using gossipsub #197

Open
MBakhshi96 opened this issue Aug 22, 2019 · 41 comments
Open

Messages get lost when using gossipsub #197

MBakhshi96 opened this issue Aug 22, 2019 · 41 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@MBakhshi96
Copy link

MBakhshi96 commented Aug 22, 2019

Pubsub is supposed to be reliable but I lose messages when using gossipsub. The problem is when I use around 10 nodes and all nodes try to broadcast messages to a topic in the pubsub, not all of the messages get delivered and receiving nodes lose some messages. I also have tried floodsub, but the problem still persists.

@aschmahmann
Copy link
Contributor

@MBakhshi96 Can you be more precise about what you mean by "losing messages"? Are your nodes actually connected to each other and have they completed their initial handshakes?

@MBakhshi96
Copy link
Author

MBakhshi96 commented Aug 22, 2019

@aschmahmann I mean that sent messages didn't get received by all of other nodes. The nodes Are connected, I also tried the fully connected configuration, but that was not helpful.
I wait for 2 seconds after connecting nodes to each other and then try to subscribe them to a topic.

@raulk
Copy link
Member

raulk commented Aug 22, 2019

@MBakhshi96 we are not aware of any issues that could cause this. Can you post a test case showing the issue in a github repo? It needs to be reproduced in order to help you. Thanks.

@MBakhshi96
Copy link
Author

@raulk I'v added a test case showing the problem here.
The example works in this way:

  • First every node broadcast a message with their id in round 1.
  • Then each node acknowledges received round 1 message and adds its own id to it. Acknowledgments are broadcast to all of the nodes.
  • Every node receives acknowledgements and prints them.

We start with n = 10 nodes. If everything works well every node must receive n*n + n messages and then the execution will terminate, But in this example the execution never stops. You can check the number of acks for every message in the output and you'll see that not all of acks are received by nodes.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

Are there any logs about dropped messages?

@MBakhshi96
Copy link
Author

@vyzo Where can I find logs for this execution? There is no log in the output, but it may because of level of logging used in the pubsub code.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

export IPFS_LOGGING=info

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

also, what is your toplogy?

@MBakhshi96
Copy link
Author

@vyzo My topology is a simple ring, but I'v also tested it with fully connected topology.
The logs are stating that messages couldn't be delivered:

INFO pubsub: Can't deliver message to subscription for topic TOPIC; subscriber too slow pubsub.go:522

I don't know what causes this problem and why these messages don't get retransmitted.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

this log tells you that the pubsub subsystem is dropping messages at subscription delivery; you are simply not consuming the messages fast enough.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

note that there is no retransmission whatsoever in pubsub; also note that the messages are propagated normally, they are just dropped at delivery.

@MBakhshi96
Copy link
Author

@vyzo What do you mean by not consuming fast? I'm receiving messages inside a for loop, which simply waits for a message and then prints it in the output. How can I consume it faster?
How can I prevent this situation? I mean how can I get notified that the receiver can't handle more messages and therefore stop overwhelming the receiver?

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

Are you running the receiver in separate goroutines?

@MBakhshi96
Copy link
Author

MBakhshi96 commented Aug 23, 2019

@vyzo yeah. You may take a look at the code I provided for reproducing the problem in previous comments. you can use the code here.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

what is your message rate? it may be that your computer is too slow.

@MBakhshi96
Copy link
Author

@vyzo Actually, I don't know my message rate. In the provided example, every node will publish only 1+10 messages, but I don't know how long it takes to publish these message. Also, even if my pc is too slow, which is not, I think it's not good to lose message. There must be a way to ensure reliable message delivery.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

there might be something else at play, are you receiving any messages?
Maybe your receiver goroutines are not running at all.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

Also, re: drop messages: there has to be a throttle somewhere, we can't buffer an infinite number of messages.

@MBakhshi96
Copy link
Author

@vyzo Most of messages get delivered, I only lose a few messages.
How can I increase the buffer capacity? I know that it's not possible to keep all of the message but the number in this case in not really huge. Also, it might be a good idea to notify publishers when recipients can't keep up with them.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

there is currently no way to specify the subscription buffer size.

@MBakhshi96
Copy link
Author

@vyzo So what is your proposition? How can I circumvent this problem, since I need a reliable broadcast scheme?

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

You can make a pr to make the buffer capacity configurable perhaps, but this is not the solution long term. How many nodes are you running in the single computer?

@MBakhshi96
Copy link
Author

@vyzo Between 10 and 20.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

that's weird, it's not a lot of nodes.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

is there any delay between message transmission, or are you sending as fast as you can?

@MBakhshi96
Copy link
Author

@vyzo There is no delay between reception and transmission.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

can you add a small delay before transmitting consecutive messages?

@MBakhshi96
Copy link
Author

@vyzo I tried to add 100 milliseconds of delay before publishing to pubsub, but the problem still persists and it has got even worse!

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

are you blocking the receive loop with that delay? that could explain getting worse.

@MBakhshi96
Copy link
Author

@vyzo I was just inspecting the pubsub.go code and discovered here that the capacity of the channel is only 32! Also in case the channel reaches to its capacity, the code simply discards the message!

@MBakhshi96
Copy link
Author

are you blocking the receive loop with that delay? that could explain getting worse.

@vyzo No. I run it in another goroutine.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

btw, are you maxing the cpus in your computer?

@MBakhshi96
Copy link
Author

@vyzo No!

@MBakhshi96
Copy link
Author

@vyzo In case you want to further investigate the problem, I had pushed the example for reproducing the problem. I've also mentioned it before.

@vyzo
Copy link
Collaborator

vyzo commented Aug 23, 2019

this is very weird, you are not maxing your cpus and yet you are too slow to receive the messages!

@MBakhshi96
Copy link
Author

@vyzo It gets better with adding random delays in range of a second. But that is too much of delay for 10 nodes!

@MBakhshi96
Copy link
Author

No news in this thread?

@aschmahmann
Copy link
Contributor

A few things about this issue:

  1. While you "only have 10 nodes" the number of messages that can be incident on any given node can be higher (e.g. 20-30 even under a simultaneous step-by-step broadcast), add in some randomness and the fact that sometimes a message is dropped is totally plausible (when I ran it on my machine I mostly got all messages sent successfully, and sometimes 1 message was dropped). It's also worth noting that in a graph with higher degree there is already some redundancy built into the system.
  2. As was mentioned above, there's going to have to be a message queue size limit at some point (although arguing for >32 isn't unreasonable) since the alternative is back pressure that slows the whole network down to the speed of the slowest nodes.
  3. If you want reliable transmission the current story is to layer some reliability on top of pubsub. For example, https://github.com/libp2p/go-libp2p-pubsub-router is a persistent Key-Value store on top of pubsub. If you're interested in other persistence schemes (e.g. Key-MultiValue) let me know since I've started some preliminary work on this already.
  4. If it would help we could let the sender know if a message was dropped using PubSub's internal event system. Right now we only have PeerJoin, and PeerLeave, but we can potentially add other events as well like PeerOverflowed that could be handled at the application layer if that would be helpful.

@MBakhshi96
Copy link
Author

@aschmahmann making queue size configurable would be good, so everybody can change the limit based on implementation needs.

If you want reliable transmission the current story is to layer some reliability on top of pubsub

I don't want to store anything on top of pubsub, the only thing that I want is to have reliable broadcasts using pubsub.

If it would help we could let the sender know if a message was dropped using PubSub's internal event system.

That would be really helpful for me, so I will be able to retransmit lost messages.

@aschmahmann
Copy link
Contributor

aschmahmann commented Sep 6, 2019

I don't want to store anything on top of pubsub, the only thing that I want is to have reliable broadcasts using pubsub.

@MBakhshi96 it really sounds like there's some shared state you're trying to track. Take this example where only rebroadcasting and/or a persistence layer is the only way to help with lost messages.

A-B-C are connected in a line. A sends a message to B, B doesn't send it to C (maybe B crashed, maybe it blacklisted C, etc.). Even though A wanted to send a message to C and it successfully sent the message to B there's no way for it to know C received the message (or even that C exists). Note, that even if A and C directly connect afterwards the message A initially sent will not be automatically rebroadcast.

Even your demo has this same property. All the nodes are implicitly aware of the other nodes and are trying to operate on the shared state map[message]map[messageAcks]struct{}

That would be really helpful for me, so I will be able to retransmit lost messages.

Recall that, as in the above example, some messages you won't know are lost. Adding this new event would be an optimization that would allow us to retransmit the state less frequently, but it's not strictly necessary in order to layer persistence on top of pubsub.

@sincoew
Copy link

sincoew commented Dec 31, 2020

I'm also lose messages when using gossipsub at stand alone project (numerous message at same peer and cpu is slow) ,
does gossipsub has 'At least Once' or 'Exactly Once' option at future ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

6 participants