Build for Netgear R7800

Moving the discussion here

lede-r4362-5e4bb476c0-20170610

Build contains wifi firmware calibration fixes, as the ath10k wifi driver is patched to properly read the router-specific board data from flash. Details:

I have tried the changes a few days myself and have not noticed any negative changes.

It is not clear from there discussions or the commit description what issue this change fixes. Any chance you can help and describe it in a few sentences?

It fixes the detection and utilisation of the proper device/radio-specific calibration files that are included in the art/caldata partition on flash.

Earlier the same generic file has been used for both 2.4 and 5 GHz radios, because the ath10k wifi driver has failed to read the correct board info file from the router itself. That has been a kludge to get the radios working.

The patches firstly increase the timeout used in the ID reading (as the reading seems to take slightly over 2 seconds that was the earlier timeout limit), and as as the ID reading now works, the second change is the proper naming of the radio-specific calibration files read from the flash so that 2.4 and 5 GHz radios use separate calibration data.

It is yet uncertain what is the net effect of the changes, but likely the performance of one of the radios improves somewhat.

Ps. note that the discussion started in this thread about two weeks ago, was moved to the exploration thread and finally resulted in the PR. (And the whole thing was initially highlighted in March when adding IPQ4xxx support killed all QCA9XXX radios as the original hack was temporarily removed and a new hack was implemented. discussion from Netgear R7800 exploration (IPQ8065, QCA9984) onward)

1 Like

Thx for the explanation. I personally have not noticed any issues myself, but I only have a few clients mostly on 5GHz. Will this be applied to 17.01?

Possibly, but first get it to the LEDE master...

EDIT:
I added the wifi otp change to my 17.01 build in
lede1701-r3437-a6b5ddfd9b-20170611

@hnyman @dissent1 @chunkeey

Unfortunately with the build hnyman uploaded a few days ago this error is back:

ath10k_pci 0001:01:00.0: rx ring became corrupted: -5

@tetsuo55
Like yourself, I've been experiencing random ath10k crashes since I started using LEDE with the R7800. I have two of these installed in different buildings and it's usually the 2.4GHz SSID that disappears. Sometimes once a day, sometimes once a week. I've tried countless combinations of countries and channels without improvement. I've tried the last few hnyman builds.

Do you know of a particular build that you have used that does not exhibit these crashes?

Unfortunately no.

My best results are with @dissent1 build.

@dissent1 could you make a new build with the new patches and your buffer reverts?

In my case, a couple of days with @hnyman r3437 build and with log. It loads correctly the firmware.

root@LEDE:~# dmesg | grep ath10k
[ 24.242114] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[ 24.242239] ath10k_pci 0000:01:00.0: enabling bus mastering
[ 24.242849] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 24.374453] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[ 24.374499] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 31.252484] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 31.252520] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[ 31.263451] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
[ 33.542766] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
[ 39.392884] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
[ 39.477843] ath10k_pci 0001:01:00.0: enabling device (0140 -> 0142)
[ 39.477970] ath10k_pci 0001:01:00.0: enabling bus mastering
[ 39.478669] ath10k_pci 0001:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 39.619878] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0001:01:00.0.bin failed with error -2
[ 39.619909] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 39.998219] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 39.998251] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[ 40.009097] ath10k_pci 0001:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
[ 42.284021] ath10k_pci 0001:01:00.0: board_file api 2 bmi_id 0:2 crc32 751efba1
[ 48.192570] ath10k_pci 0001:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
[71418.792044] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1
[71418.792107] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1

That sounds more like a bug in the ath10k driver than a purely R7800 related problem. I have not encountered that myself, but I have rather modest wireless usage.

It could be due to the ath10k buffer size reduction commit by @chunkeey that @dissent1 has tested reverting in his build, but intuitively I think that the "corruption" points more to an actual bug in ath10k than just buffer exhaustion.

I think @hnyman may consider including it in his newer builds? Those buffer sizes has been set for a reason obviously and I'm not sure it's a good catch to decrease it.

There are not much updates concerning R7800 so not sure you need the bleeding edge version if you don't experience any other issues, because you won't notice changes.

1 Like

The ath10k buffer size reduction was semi-hidden in the "add ath10k support for 4019" commit, so I don't think that it was ever really discussed / evaluated from the general ath10k perspective.

The original commit https://github.com/lede-project/source/commit/cc189c0b7fa015978b04bb663a75b1da726376b5 included three different things, two of which had almost nothing to do with the commit title...

"mac80211: enable ath10k AHB support for QCA4019"

This patch enables the ATH10K_AHB support for the QCA4019 devices on the AHB bus.
This patch also removes 936-ath10k_skip_otp_check.patch...
It also limits ath10k memory hunger (This is a problem with 128MiB RAM)

I might try reverting that patch, but I am too layman to really have an opinion about the ath10k buffer size adjustments in the two patches created by that commit. It would be great to hear some input from real wifi gurus like @nbd or ipq806x persons like @blogic

Hey guys, I have a question: I have an R7800 with these builds and would like to use some kmod packages (specifically, dnsmasq-full) for split VPN routing, but I understand these won't work on a private build. What is my best option here, use the public releases (I have no idea if those even work vs. these builds)?

If you need separate new kmods, the best bet is to use the public release builds like 17.01.2.

I thought to answer to you that dnsmasq-full does not even want to install additional kmods, but apparently the full variant installs libnetfilter-conntrack that needs kmod-nf-conntrack-netlink. I compile my build with the default kernel options, so overriding the dependencies may work. I force-installed dnsmasq-full to my own router (current running the 17.01 build) and it worked ok.

So, the basic safe advice is to use the official release build, but if you want to be adventurous, you might try:

opkg update
opkg remove dnsmasq
opkg install --force-depends dnsmasq-full

"rx ring became corrupted: -5". Let's look it up.

the "rx ring became corrupted" is generated by:

-5 is the error code for -EIO.

It's clear that the ath10k_htt_rx_amsdu_pop() function returned the error code. Based on the rx_amsdu_pop name, this is function is used by the code path that deals with deaggregating amsdu frames from a client. (So, if you can prevent that, the error will go away... However, look at the FIXME: ath10k is leaving the device inoperable... The QSDK driver might be more advanced and could restart the device here for you. That said, I don't have access to it, so I don't know)
Let's continue and look where/why ath10k_htt_rx_amsdu_pop() would return -EIO. There's only one place, where this can happen:

Looking at the comment, there's a problem and the HW and SW are not working correctly.
if ath10k was running out of htt rx buffers, it would print out a warning:

Now, ath10k does have a way to restart itself. This should also clear the rx_confused and
get the device going again. From what I can see: queue_work(ar->workqueue, &ar->restart_work) would do it.
so @tetsuo55, @dissent1:

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index ddd94c53d323..af16c24aeee2 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1546,10 +1546,8 @@ static int ath10k_htt_rx_handle_amsdu(struct ath10k_htt *htt)
	if (ret < 0) {
		ath10k_warn(ar, "rx ring became corrupted: %d\n", ret);
 		__skb_queue_purge(&amsdu);
-		/* FIXME: It's probably a good idea to reboot the
-		 * device instead of leaving it inoperable.
-		 */
		htt->rx_confused = true;
+		queue_work(ar->workqueue, &ar->restart_work);
 		return ret;
 	}

this patch schedules an automatic restart, but only for this case when rx rings got corrupted. In other cases, the device will still be dead.

1 Like

@hnyman check this out
https://github.com/dissent1/r7800/commit/e2087bb2f29a27e84c152ed924ebd999c3223ddd
This enables and maxes all busses clocks. At first glance it fixes timer so for example OpenSSL benchmark now takes exactly 2.99 - 3 seconds each iteration and not 2.91-2.99 as it has been for me before

@hnyman but there's a 2nd option - to make it scale in accordance to cpu speed.
There's a driver for that
https://source.codeaurora.org/quic/qsdk/oss/kernel/linux-msm/commit/?h=eggplant&id=7f4d9b5c8814329a66fe44d0dac55a4bd3cbcb78

I couldn't make it yet to port it into k4.9 and adjust it to work in cpufreq-dt driver instead of cpufreq-krait. Maybe you could take a look if interested? Last 13 commits in
https://github.com/dissent1/r7800/commits/sta44

I may look into it, but please, let's keep the development discussion in the R7800 exploration thread...

Sorry my bad)