Messages get lost when using gossipsub #197

MBakhshi96 · 2019-08-22T16:17:51Z

Pubsub is supposed to be reliable but I lose messages when using gossipsub. The problem is when I use around 10 nodes and all nodes try to broadcast messages to a topic in the pubsub, not all of the messages get delivered and receiving nodes lose some messages. I also have tried floodsub, but the problem still persists.

aschmahmann · 2019-08-22T16:23:05Z

@MBakhshi96 Can you be more precise about what you mean by "losing messages"? Are your nodes actually connected to each other and have they completed their initial handshakes?

MBakhshi96 · 2019-08-22T16:46:35Z

@aschmahmann I mean that sent messages didn't get received by all of other nodes. The nodes Are connected, I also tried the fully connected configuration, but that was not helpful.
I wait for 2 seconds after connecting nodes to each other and then try to subscribe them to a topic.

raulk · 2019-08-22T16:52:46Z

@MBakhshi96 we are not aware of any issues that could cause this. Can you post a test case showing the issue in a github repo? It needs to be reproduced in order to help you. Thanks.

MBakhshi96 · 2019-08-23T09:10:23Z

@raulk I'v added a test case showing the problem here.
The example works in this way:

First every node broadcast a message with their id in round 1.
Then each node acknowledges received round 1 message and adds its own id to it. Acknowledgments are broadcast to all of the nodes.
Every node receives acknowledgements and prints them.

We start with n = 10 nodes. If everything works well every node must receive n*n + n messages and then the execution will terminate, But in this example the execution never stops. You can check the number of acks for every message in the output and you'll see that not all of acks are received by nodes.

vyzo · 2019-08-23T09:22:03Z

Are there any logs about dropped messages?

MBakhshi96 · 2019-08-23T09:49:34Z

@vyzo Where can I find logs for this execution? There is no log in the output, but it may because of level of logging used in the pubsub code.

vyzo · 2019-08-23T10:28:22Z

export IPFS_LOGGING=info

vyzo · 2019-08-23T10:28:33Z

also, what is your toplogy?

MBakhshi96 · 2019-08-23T10:34:09Z

@vyzo My topology is a simple ring, but I'v also tested it with fully connected topology.
The logs are stating that messages couldn't be delivered:

INFO pubsub: Can't deliver message to subscription for topic TOPIC; subscriber too slow pubsub.go:522

I don't know what causes this problem and why these messages don't get retransmitted.

vyzo · 2019-08-23T10:57:47Z

this log tells you that the pubsub subsystem is dropping messages at subscription delivery; you are simply not consuming the messages fast enough.

vyzo · 2019-08-23T10:58:15Z

note that there is no retransmission whatsoever in pubsub; also note that the messages are propagated normally, they are just dropped at delivery.

MBakhshi96 · 2019-08-23T11:29:20Z

@vyzo What do you mean by not consuming fast? I'm receiving messages inside a for loop, which simply waits for a message and then prints it in the output. How can I consume it faster?
How can I prevent this situation? I mean how can I get notified that the receiver can't handle more messages and therefore stop overwhelming the receiver?

vyzo · 2019-08-23T11:33:14Z

Are you running the receiver in separate goroutines?

MBakhshi96 · 2019-08-23T11:38:08Z

@vyzo yeah. You may take a look at the code I provided for reproducing the problem in previous comments. you can use the code here.

vyzo · 2019-08-23T11:52:04Z

what is your message rate? it may be that your computer is too slow.

MBakhshi96 · 2019-08-23T12:02:18Z

@vyzo Actually, I don't know my message rate. In the provided example, every node will publish only 1+10 messages, but I don't know how long it takes to publish these message. Also, even if my pc is too slow, which is not, I think it's not good to lose message. There must be a way to ensure reliable message delivery.

vyzo · 2019-08-23T12:17:50Z

there might be something else at play, are you receiving any messages?
Maybe your receiver goroutines are not running at all.

vyzo · 2019-08-23T12:18:19Z

Also, re: drop messages: there has to be a throttle somewhere, we can't buffer an infinite number of messages.

MBakhshi96 · 2019-08-23T12:27:51Z

@vyzo Most of messages get delivered, I only lose a few messages.
How can I increase the buffer capacity? I know that it's not possible to keep all of the message but the number in this case in not really huge. Also, it might be a good idea to notify publishers when recipients can't keep up with them.

vyzo · 2019-08-23T12:42:16Z

there is currently no way to specify the subscription buffer size.

MBakhshi96 · 2019-08-23T12:49:50Z

@vyzo So what is your proposition? How can I circumvent this problem, since I need a reliable broadcast scheme?

vyzo · 2019-08-23T13:08:24Z

You can make a pr to make the buffer capacity configurable perhaps, but this is not the solution long term. How many nodes are you running in the single computer?

MBakhshi96 · 2019-08-23T13:09:32Z

@vyzo Between 10 and 20.

vyzo · 2019-08-23T13:23:01Z

that's weird, it's not a lot of nodes.

vyzo · 2019-08-23T13:34:34Z

is there any delay between message transmission, or are you sending as fast as you can?

MBakhshi96 · 2019-08-23T13:42:10Z

@vyzo There is no delay between reception and transmission.

vyzo · 2019-08-23T13:57:12Z

can you add a small delay before transmitting consecutive messages?

MBakhshi96 · 2019-08-23T14:17:32Z

@vyzo I tried to add 100 milliseconds of delay before publishing to pubsub, but the problem still persists and it has got even worse!

vyzo · 2019-08-23T14:32:52Z

are you blocking the receive loop with that delay? that could explain getting worse.

MBakhshi96 · 2019-08-23T14:36:38Z

@vyzo I was just inspecting the pubsub.go code and discovered here that the capacity of the channel is only 32! Also in case the channel reaches to its capacity, the code simply discards the message!

MBakhshi96 · 2019-08-23T14:38:12Z

are you blocking the receive loop with that delay? that could explain getting worse.

@vyzo No. I run it in another goroutine.

vyzo · 2019-08-23T14:49:34Z

btw, are you maxing the cpus in your computer?

MBakhshi96 · 2019-08-23T14:55:34Z

@vyzo No!

MBakhshi96 · 2019-08-23T14:57:55Z

@vyzo In case you want to further investigate the problem, I had pushed the example for reproducing the problem. I've also mentioned it before.

vyzo · 2019-08-23T15:12:19Z

this is very weird, you are not maxing your cpus and yet you are too slow to receive the messages!

MBakhshi96 · 2019-08-23T15:35:05Z

@vyzo It gets better with adding random delays in range of a second. But that is too much of delay for 10 nodes!

MBakhshi96 · 2019-08-30T09:25:40Z

No news in this thread?

aschmahmann · 2019-09-04T05:13:49Z

A few things about this issue:

While you "only have 10 nodes" the number of messages that can be incident on any given node can be higher (e.g. 20-30 even under a simultaneous step-by-step broadcast), add in some randomness and the fact that sometimes a message is dropped is totally plausible (when I ran it on my machine I mostly got all messages sent successfully, and sometimes 1 message was dropped). It's also worth noting that in a graph with higher degree there is already some redundancy built into the system.
As was mentioned above, there's going to have to be a message queue size limit at some point (although arguing for >32 isn't unreasonable) since the alternative is back pressure that slows the whole network down to the speed of the slowest nodes.
If you want reliable transmission the current story is to layer some reliability on top of pubsub. For example, https://github.com/libp2p/go-libp2p-pubsub-router is a persistent Key-Value store on top of pubsub. If you're interested in other persistence schemes (e.g. Key-MultiValue) let me know since I've started some preliminary work on this already.
If it would help we could let the sender know if a message was dropped using PubSub's internal event system. Right now we only have PeerJoin, and PeerLeave, but we can potentially add other events as well like PeerOverflowed that could be handled at the application layer if that would be helpful.

MBakhshi96 · 2019-09-06T08:30:55Z

@aschmahmann making queue size configurable would be good, so everybody can change the limit based on implementation needs.

If you want reliable transmission the current story is to layer some reliability on top of pubsub

I don't want to store anything on top of pubsub, the only thing that I want is to have reliable broadcasts using pubsub.

If it would help we could let the sender know if a message was dropped using PubSub's internal event system.

That would be really helpful for me, so I will be able to retransmit lost messages.

aschmahmann · 2019-09-06T14:19:01Z

I don't want to store anything on top of pubsub, the only thing that I want is to have reliable broadcasts using pubsub.

@MBakhshi96 it really sounds like there's some shared state you're trying to track. Take this example where only rebroadcasting and/or a persistence layer is the only way to help with lost messages.

A-B-C are connected in a line. A sends a message to B, B doesn't send it to C (maybe B crashed, maybe it blacklisted C, etc.). Even though A wanted to send a message to C and it successfully sent the message to B there's no way for it to know C received the message (or even that C exists). Note, that even if A and C directly connect afterwards the message A initially sent will not be automatically rebroadcast.

Even your demo has this same property. All the nodes are implicitly aware of the other nodes and are trying to operate on the shared state map[message]map[messageAcks]struct{}

That would be really helpful for me, so I will be able to retransmit lost messages.

Recall that, as in the above example, some messages you won't know are lost. Adding this new event would be an optimization that would allow us to retransmit the state less frequently, but it's not strictly necessary in order to layer persistence on top of pubsub.

sincoew · 2020-12-31T07:39:57Z

I'm also lose messages when using gossipsub at stand alone project (numerous message at same peer and cpu is slow) ,
does gossipsub has 'At least Once' or 'Exactly Once' option at future ?

aschmahmann mentioned this issue Sep 12, 2019

Gossipsub messages not reaching all subscribers to that topic libp2p/go-libp2p#721

Closed

This was referenced Oct 16, 2019

Fix flaky tests #202

Open

Published messages can be dropped silently #217

Open

daviddias added the kind/bug A bug in existing code (including security flaws) label Mar 25, 2020

Messages get lost when using gossipsub #197

Messages get lost when using gossipsub #197

Comments

MBakhshi96 commented Aug 22, 2019 • edited Loading

aschmahmann commented Aug 22, 2019

MBakhshi96 commented Aug 22, 2019 • edited Loading

raulk commented Aug 22, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019 • edited Loading

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

vyzo commented Aug 23, 2019

MBakhshi96 commented Aug 23, 2019

MBakhshi96 commented Aug 30, 2019

aschmahmann commented Sep 4, 2019

MBakhshi96 commented Sep 6, 2019

aschmahmann commented Sep 6, 2019 • edited Loading

sincoew commented Dec 31, 2020 • edited Loading

MBakhshi96 commented Aug 22, 2019 •

edited

Loading

MBakhshi96 commented Aug 22, 2019 •

edited

Loading

MBakhshi96 commented Aug 23, 2019 •

edited

Loading

aschmahmann commented Sep 6, 2019 •

edited

Loading

sincoew commented Dec 31, 2020 •

edited

Loading