Build for Netgear R7800

Yes, unfortunately as @hnyman has said, those blk requests and async read errors are normal when enabling usb_storage packages. Your bootlog is okay, other errors are informational related to driver initialization routine.

I passed the 24hour mark without wifi crashing!

There's 1 more commit in my tree that might have also helped indirectly


Guess we have to clarify which one solves the issue

@chunkeey hey, decreasing ath10k buffers seem to cause issues on qca9984 in high load cases, check posts#167, #172 (FS#801) here and on.
I remember there has been issues earlier related to coherent dma pool size, maybe this way may also help for ipq4019 (the reason of decreasing ath10k buffers)?
@nbd

1 Like

According to the git commit, reducing the buffer size was meant to decrease the memory pressure on systems with only 128 MB RAM (which is quite common on ipq40xx routers).

1 Like

Yep.
But the commit was rather sneaky: it was labeled as "enable ath10k AHB support for QCA4019" which sounds quite innocent. But in addition to that it also removed the OTP check patch breaking wifi for some QCA9xxx devices (requiring a new fix for that) and made these performance settings changes for all ath10k users.

Interesting if the patch from @dissent1 reverting the performance settings fixes things for some R7800 users.

@chunkeey @nbd @blogic @dissent1

unfortunately the wifi crash occured again after 48 hours uptime, so the patch only postpones the crash.

https://pastebin.com/Jmib8bCE

Actually, the 9984 is still looks broken. And nobody with the chip ever bothered to fix/investigate it.
I'm guessing that the 9984 needs to go with the pre-cal + specific board file, just like the IPQ40xx. The 9980 might be the same boat. But I don't have a device with it so I can only do some speculations.

@dissent1, @hnyman: what's the plan for fixing this? It would be possible to make a patch to quadruple the buffers and see if it lasts a day longer. Another option is to test if the older firmwares do not crash under the same circumstances.

You may be right with the caldata file, because it seems that it's 2.4ghz that is crashing (board.bin is taken from 5ghz caldata).
I've experienced same rx ring buffer corruption on 2.4ghz yesterday and a crash a while ago after I've changed channel to 11 and country code to US. Channel 13 and EU country code has been working flawlessly for me.
Considering this it may really be the calibration data being faulty for 2.4ghz chip... Seems it's occasional crash, when some environmental conditions are met (devices connected and surrounding APs?)

Looks like Tetsuo Osaka followed up on this issue and posted it on the ath10k ML.
http://lists.infradead.org/pipermail/ath10k/2017-May/009785.html

I have problems compiling your build lede-r4214-822ee54544-20170526.
This is part of the log, which contains the error: https://pastebin.com/CUwfk5Cs
I've followed your instructions for rebuilding your build environment.

Is this an error on my side or is there something wrong in the line?

By the way, thank you so much for providing this marvelous builds. I've enjoyed them many years on my WNDR3700.

I can do troubleshooting but i don't know how much longer i will be able to keep lede on the device as the family is very unhappy with the situation and want me to return to stock

i got some new things in my log
> [30648.618741] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode

[30648.618807] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[30648.626077] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[30648.634168] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode

[quote="burningjoe, post:192, topic:316"]
I have problems compiling your build lede-r4214-822ee54544-20170526.This is part of the log, which contains the error: https://pastebin.com/CUwfk5CsI've followed your instructions for rebuilding your build environment.

Is this an error on my side or is there something wrong in the line?
[/quote]You need to delete that file package/network/config/firewall/patches/100-test.patch

It contained a firewall fix from Jow in early May and that code has now been mainlined to the firewall with the latest firewall3 version bump. So the patch is now unnecessary (as it tries to add source code that already exists)

Specifically, the patch matches this commit in firewall3 sources
https://git.lede-project.org/?p=project/firewall3.git;a=commitdiff;h=e5dfc8253bebb7cfed06f81f34bbe1afdf285735;hp=f62595480555f4034841cfbdec5858645528ae7d
which commit has been imported to LEDE on Saturday (just after the build you are using) by this firewall upgrade
https://git.lede-project.org/?p=source.git;a=commitdiff;h=6e46f6edc4ee8ad127658c55616bb9d32a8f2d1a

The patch will get removed from my next build. So, you can also wait a few hours and then download new firmware creation patches that will apply cleanly.

EDIT:
new version without the patch: lede-r4235-61eb18d3f7-20170529

1 Like

I'm noticing a bunch of problems with the 5GHz radio on r4235. Changing any frequency options will kill the radio entirely, forcing a reboot to bring it back up. I also noticed it has some issues selecting the appropriate frequency on reboot. I had forced channel 36, and it rebooted on 40. I tried forcing 48, and it booted into 48 fine. I then tried 52 and it failed to associated on reboot. Lastly, I tried auto and it also failed to associate. All these tests were done at 80Hz channel width in AC mode. I'm just going to leave it on 36 (40) in the meantime.

Is anybody else noticing this?

That is new. Probably visible in system log as:

Wed May 31 12:08:20 2017 daemon.notice netifd: radio0 (10144): WARNING (wireless_add_process): executable path /usr/sbin/wpad does not match process 1953 path ()
Wed May 31 12:08:20 2017 daemon.notice netifd: radio0 (10144): Device setup failed: HOSTAPD_START_FAILED

Channel selection depends quite much on the country CRDA settings etc., so it is quite possible that some channels will not work due to DFS restrictions etc.

i followed dissent1's advice to change to a fixed a channel.

Changed the location to world:
For 5ghz the AP was not visible on several channels but i finally found one that worked

For 2,4 ghz i chose channel 11 and the wifi crashed anyway.
Testing channel 6 now.

@chunkeey @hnyman QCA9984
I've done some extensive testing on the 2.4 ghz issue, though I haven't encountered crash like some users, but I have faced very low upstream throughput (from router to client device), tested on 2 different devices:
2.4 ghz:
router to client device - 25-50 mbits/s
client device to router - 100+ mbits/s
For reference on 5ghz I get 500/500 in both directions.
I've tested all firmware branches 10.4-3.2, 10.4-3.3 and 10.4-3.4. I have also tested CT firmware. Ath10k-CT driver constantly produces errors on both bands, so not taking it into account.
I've also extracted board.bin from board-2.bin, tried symlinking the 2.4 ghz cal file to board.bin instead of 5ghz.
Result is the same in all cases.
Also, that crap that is being produced in identification line
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] ath10k_pci 0001:01:00.0: failed to fetch board data for bus=pci,vendor=168c,device=0046,subsystem-vendor=168c,subsystem-device=cafeOh???
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] m
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] ??????,? from ath10k/QCA9984/hw1.0/board-2.bin

happens on all firmwares and board files with current compat-wireless.

So concluding said above the stability and performance issue that concerns 2.4 ghz band is mostly ath10k driver related, but not firmware.

Also, it seems that our device does have OTP to get calibration and identification from, because I've found a lot of mailing lists with various logs from Netgear R7800 on ath10k driver that have OTP working.
Adding to that, there's an upstream patch to get bmi identification working for pre-cal file on QCA99xx https://patchwork.kernel.org/patch/9748097/
But it doesn't work in our case because we have not pre-cal (pure cal), but cal file (pre-cal + board data). So maybe the offset is shifted or it is possible to extract pre-cal from it.
@nbd

1 Like

Just a note: The ?????,x crap was fixed

This commit will probably be in the next compat-wireless refresh. So no need to worry about it.

I can't say much about the bad performance on the 2.4G though. :worried:

I was looking for those messages. Do you have a link to the "various logs from Netgear R7800 on ath10k driver that have OTP working"? I can't seem to find anything since most of the OEM bootlogs are not that verbose when it comes to wifi.

Note: The QCA9984 cards from compex/unex/... do have an eeprom.
That's why the BMI Identification is/was working for them, since this was always supported by ath10k in the kernel.

As for pre-cal, cal and otp: Adrian Chadd explained in his post what's going on behind the scenes: https://www.mail-archive.com/ath10k@lists.infradead.org/msg06233.html:

[...]
Each board data is custom for the board layout / part selection - it's
a template that is used during calibration. The data in OTP is just a
diff against the board template (board.bin / board-2.bin.) [...]

If people aren't using unique BMI IDs (which is another question we
have for QCA) then it's possible you don't have enough information to
"know" which board data to use, so it has to be overridden by a custom
package. We do this at work for our own boards as well - they're
sufficiently different to a reference board that indeed we need to "know".

Now, the reason for pre-loading the calibration data is because it's
needed early in the boot process so the firmware/driver has some idea
of what the hardware is.

So, the driver steps should be:

  • If you have a pre-calibration file, you load that in before you kick
    the firmware too hard;
  • then you read the calibration data /back/ - then the normal firmware
    process will fetch the board ID;
  • then it loads the board-2.bin matching the board/BMI ID, then
  • starts things normally.

Now, I forget if the pre-cal data (and say, data in flash versus data
in OTP) is the whole thing or a diff against the board data. I'd have
to triple-check. The OTP data is certainly just a diff against the
board data.
[...]

Seems that I have been using some truly magic keywords when searching for otp issue on qca9984 and R7800, but I cant find those logs at first glance now.
Alas here are some brief findings:
http://lists.infradead.org/pipermail/lede-dev/2016-December/004987.html - Netgear R9000 with qca9984, highly presumably in the same boat
https://www.spinics.net/lists/linux-wireless/msg160696.html but there are some extras

TP-link C2600 with qca9980 is in the same boat.
OEM bootlog, post 559 https://forum.openwrt.org/viewtopic.php?id=54973&p=23