-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipq40xx: 802.11s unicast package loss while broadcast works #210
Comments
Configuration on NDS-The-TARDIS:
dmesg on NDS-The-TARDIS:
The last lines look like #188. But I have no idea what that means. |
The early builds in here should be very much like (older) stock firmware images. If an earlier one works, then you can work to bisect the problem if you wish, and if you can find the bad commit, probably I can fix it. https://www.candelatech.com/downloads/ath10k-4019-10-4b/bisect/all_builds-4019-W.tar.gz |
I will try, thanks. |
Why didn't you answer my question please? I want to compile ath10-4.19 with the latest commit from openwrt, but it gives me compilation failure and there is a problem with recent versions. I put it in the problems section. |
I am currently bisecting the firmware for 3 weeks. Unfortunately, the issue is very unpredictable and I have to wait a day or two to see if the problem really occurs. Also, I have the impression that it might be more than one error, because I observe different behaviours when I bisect. I will try to continue. But if you have any hint how to accelerate the bisect, please let me know. |
I traced the error down to the first broken commit:
However, during the bisect, I observed different kind of behavior:
Here is my current bisect.log:
Since I observed different behaviors, I suspect, this is not the end of the bisect. Please advise me how to continue the bisect. |
Thanks for grinding through this, I'm sure it was not much fun to bisect this. The patch in build 235 looks fairly harmless at first glance. Can you please send me the crash file/information? |
Now, the -234- starts to crash also. :( Here is the log. |
Ok, 234 introduced a bug that was fixed ~5 commits later (back in 2016). To make bisecting less painfull, I merged those commits. Please see how this one works for you. It is with those 5 squashed commits on top of 234. |
If this version is just -234- squashed together with the next five commits, I do not understand how this version differs from -239-, but I will try this version. I'll come back when I have results. |
The new firmware firmware-5-full-community-qcache.bin.gz (from above) also doesn't work.
The router is crashing again and again: Current log looks like this:
I can try to capture a crash of the firmware, if this is helpful? Or should I try to get a log of the crash of the router itself? I am not even getting the point how the chip firmware can crash the router itself. This must be a problem in the driver, doesn't it? |
'crashing' is an overloaded term. Please be more specific. If OS or firmware is crashing, then logs are helpful. |
With "the router is crashing" I mean that the OS is crashing. I can not confirm if the firmware is also crashing for this firmware or not. (My logs do not contain it, since the logs were gone when the OS crashed. From now on, I am pulling the logs from the router to a remote system, so I can inspect them after the OS crashed). In earlier tests, I observed that both the OS and the firmware are crashing. I will now try the -233-. |
The -233- is now running stable since two days. This means, -233- is ok. What are the next steps? |
I will look closely at the commit that bisects as bad and try to figure out the problem. Thanks for your detailed testing. |
Are you able to connect serial console to your device to get more info about the crashes? Since only firmware changed, I guess maybe it is a firmware crash that is triggering some other bug that crashes the OS. The serial console logs should help me debug the firmware crash. |
I will make sure we get serial logs. But it will take a few days. |
Ok, the serial logging is now installed. This is the first log piece: Link.
I will keep running this setup. Later I will post what happened then. |
Here is a log, where the OS is also crashing: Link. Look at approx 17.44h. (This is still -234- "squashed with 5 commits".) Can you figure out something using this log? |
Here is another log from today, where multiple crashes are included in case you want to compare: Link. |
I tried to backport some relevant fixes to make bisecting clean, but that became a big mess. The patch that starts causing trouble for you initialized the rate codes to invalid value to catch other bugs in rate-ctrl, then the next 100+ patches deal with fallout from that patch. So, maybe we can go at the problem from a different direction. In your original bisect notes, you indicate that maybe build 1000 is OK. Can you re-test that and/or a few other higher builds to see if you can find a stable build there? If so, then could bisect with that as starting point to find where other problems are introduced. |
I am currently testing the 1000. I will come back if I have definite results. (I need to double-check since some environment conditions have changed. The "other" router (Ubiquiti UniFi AC Mesh) was "autoupdated" to mainline ath10k, so I need to double check if this setup still triggers our issue at all.) Another idea: do you have any hardware to reproduce this in your lab? Maybe this could help to debug the issue? (We could provide you with our firmware/config if you tell us what kind of hardware you have.) |
Or another idea: Can we maybe set the bitrates manually to bypass the rate-ctrl issues (or so)? |
You can override rate-ctrl for larger data frames (this was designed for specific test case, so it is probably not good for general use). You can look at the logic in the driver that deals with txo and modify that further if you want. If mucking with Unfortunately, I do not have time to reproduce the bug locally. [root@ct523c-0b29 ~]# cat /debug/ieee80211/wiphy1/ath10k/set_rate_override |
Things got a bit silent here. The reason is that the original setup was at the home of a colleague. Since he was annoyed by the broken wifi, I tried to move the setup to my home. Since then, I was trying to reproduce the issue at my home. But so far, I was not able to reproduce it at my home. Therefore, we will now start to test again at the home of my colleague. I hope, we will be able to reproduce this again at his home. Just to avoid wrong conclusions: The home of my colleague was not the only place, where we observed the issue. It also appeared in other scenarios. |
I was not able to reproduce it there as well (within one week). So maybe, I was doing something wrong or there are other influences that we are not aware of yet. However. We decided to go back to mainline ath10k instead. Therefore I will not investigate here further. |
We have two routers:
The two routers are connected with an 802.11s mesh link. The link seems to work fine for a few hours, but then unicast package loss starts to appear.
We observe approx 75% package loss if we ping NDS-The-Sonic-Screwdriver from NDS-The-TARDIS via unicast:
If we do the same ping with multicast, we get almost answers for all packets from NDS-The-Sonic-Screwdriver (
fe80::bceb:adff:fee1:5f79
):Some debug info:
Ath10k-ct firmware info:
Tx-Error-Counters on NDS-The-TARDIS:
I found out, that the bug is gone for a few hours (before it appears again) when I call the following command:
This issue completely disappears when I switch from ath10k-ct to mainline ath10k on NDS-The-TARDIS. Even after days, the mesh link remains stable.
Later today, I will add some more information about the mesh configuration to this issue.
The text was updated successfully, but these errors were encountered: