Netgear R7800 exploration (IPQ8065, QCA9984)

@blogic @chunkeey

Thanks for the hint.
Restoring that mac80211 patch restores wifi functionality to R7800 with r3845.

To summarise findings so far:

I reverted the dwc3 commit and restored the 936-ath10k... patch and now R7800 boots again and has wifi.

@hnyman
The FritzBox 4040 boots with r3853-a8164bd171 just fine. There's no hang when it is initializing the dwc3 and wifi works there as well. As for the hang on ipq806x: chances are, your hang is possibly caused by a reversed? loading order of the usb phy drivers, dwc3 and dwc3-of-simple. I've seen similar issues with IPQ40XX (it was fixed by making dwc3 in charge of the clocks).

Hey, have you tried putting pre cal data through a device tree? Though you'll need a parser for that maybe to insert it during boot

@chunkeey
I looked more closely to your wifi-breaking commit https://git.lede-project.org/?p=source.git;a=commit;h=cc189c0b7fa015978b04bb663a75b1da726376b5 and noticed that it also decreases the wifi driver's buffer sizes (pci buffer, htt rx ring) for all ath10k users, not just for the new chip.

I am not a wifi driver expert, so I wonder what is the expected performance impact from this change?

@blogic
Thanks for reverting the dwc3 commit. Any chance that also this wifi-breaking commit could be reverted for now, until a better solution can be figured out? It seems pretty strange that a hack that makes the radios of the existing devices to work is deleted when a new radio chip does not like it. Yeah, the existing hack may be theoretically wrong (as upstream seems to think), but it has made the radios of the current devices to work.

Ps. I tried to find the origins of that wifi hack, and it has appeared on the openwrt-devel mailing list in April 2016 as part of introducing C2600 router, but there has been no explanation / reasoning for that hack.
https://lists.openwrt.org/pipermail/openwrt-devel/2016-April/040880.html

Sure, there's this utility:

which I think would allow you to patch the dtb "live". This might also be interesting for the RPI and APU-folks,
since they usually need it to change GPIO assignments, provide initialization stuff/enumeration of connected
spi and i2c devices, ...

However, it doesn't solve the problem at hand. Since providing the cal- or pre-cal data isn't the problem.
It's whenever or not you can ignore the error from ath10k_core_get_board_id_from_otp() or not...

I don't know where exactly the IPQ806x fails. In theory the following patch for the mac80211-package:

--- a/drivers/net/wireless/ath/ath10k/core.c       2017-03-23 12:44:41.899549793 +0100
+++ b/drivers/net/wireless/ath/ath10k/core.c    2017-03-23 12:44:11.912856777 +0100
@@ -686,7 +686,7 @@ static int ath10k_core_get_board_id_from
    if (ret) {
            ath10k_err(ar, "could not execute otp for board id check: %d\n",
                       ret);
-               return ret;
+               return -EOPNOTSUPP;
    }

    board_id = MS(result, ATH10K_BMI_BOARD_ID_FROM_OTP);

might fix it. @hnyman ?

Edit:

Yes, this patch is necessary for the RT-AC58U. The problem is that the device will panic during
operation because it runs out of memory. Note: The AHB code shares the same buffers with the
PCI implementation.

As for the impact on performance: This has to be measured. I don't have any IPQ806x. But there
was no difference for the IPQ40XX (Tested with FB4040, which has 256 MiB) and the QCA9880 in
my C7. What are your numbers?

[quote="chunkeey, post:276, topic:285"]
I don't know where exactly the IPQ806x fails. In theory the following patch for the mac80211-package:
...
might fix it. @hnyman ?
[/quote]Thanks.
I tested it and at the first glance it really does fix wifi.

Explanation why it works:
the return value of "ath10k_core_get_board_id_from_otp" is set to be -EOPNOTSUPP when failure, as the function "ath10k_core_probe_fw" allows just that error code as a "harmless" failure for the otp board id check call. If it sees that error it uses the board file as a backup source.

I am currently re-compiling the whole firmware to verify the result, (as I manually opkg installed the fix first).

It would be great if other users of ipq806x devices could verify the result (for C2600 etc.)

kernel log with this patch looks like this for one radio:

[   16.163318] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[   16.163401] ath10k_pci 0000:01:00.0: enabling bus mastering
[   16.163850] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   16.337294] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[   16.337351] ath10k_pci 0000:01:00.0: Falling back to user helper
[   22.837360] firmware ath10k!pre-cal-pci-0000:01:00.0.bin: firmware_loading_store: map pages failed
[   23.212157] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   23.212211] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   23.226748] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00074 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 fa32e88e
[   25.259266] ath10k_pci 0000:01:00.0: unable to read from the device
[   25.259288] ath10k_pci 0000:01:00.0: could not execute otp for board id check: -110
[   25.277326] ath10k_pci 0000:01:00.0: failed to fetch board data for bus=pci,vendor=168c,device=0046,subsystem-vendor=168c,subsystem-device=cafem...from ath10k/QCA9984/hw1.0/board-2.bin
[   25.277588] ath10k_pci 0000:01:00.0: board_file api 1 bmi_id N/A crc32 dd636801
[   26.800717] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal file max-sta 512 raw 0 hwcrypto 1
[   26.882020] ath: EEPROM regdomain: 0x0
[   26.882030] ath: EEPROM indicates default country code should be used
[   26.882036] ath: doing EEPROM country->regdmn map search
[   26.882046] ath: country maps to regdmn code: 0x3a
[   26.882055] ath: Country alpha2 being used: US
[   26.882062] ath: Regpair used: 0x3a

The solution worked also after the full build and flash, so I submitted it as a PR:
https://github.com/lede-project/source/pull/995

Ok, I think I see it now. there's a second part to this 936-ath10k_skip_otp_check.patch in the
ath10k-firmware package as well:

define Package/ath10k-firmware-qca9984/install
     $(INSTALL_DIR) $(1)/lib/firmware/ath10k/QCA9984/hw1.0
     ln -s \
            ../../cal-pci-0000:01:00.0.bin \
            $(1)/lib/firmware/ath10k/QCA9984/hw1.0/board.bin
     $(INSTALL_DATA) \
            $(DL_DIR)/$(QCA9984_BOARD_FILE_DL) \
            $(1)/lib/firmware/ath10k/QCA9984/hw1.0/board-2.bin

The symbolic link from: /lib/firmware/ath10k/cal-pci-0000:01:00.0.bin to /lib/firmware/ath10k/QCA9984/hw1.0/board.bin.

So the cal-data of the 2.4GHz (I think so?) Radio is being used as the board data for both.
This wouldn't work for some of the IPQ40XX. For example the RT-AC58U uses different
PA/FE chips (The 2.4GHz has two RTC6649E. The 5GHz two SKY85728-11). So each
radio will only work correctly if the the right configuration is select from the board-2.bin file.

With just one board.bin for both radios, how is this working for the QCA9984?

Note: I tried using just board.bin with IPQ40XX. And while the device would boot, the WIFI
performance is abysmal.

Note2: board-2.bin houses a collection of different board data. There's a tool ath10k-bdencoder
in the qca-swiss-army-knife project, that can create and extract boards from board-2.bin file.

Meanwhile I've found a correct and fully functional ipq806x tsens driver, i haven't tested it yet though, it should be ok
https://github.com/dissent1/r7800/commit/b699ec91058dbe6fddd0504d9434eaab9042d51b

1 Like

@chunkeey yes that's what I've meant by suggesting symlinking the board.bin back then. There ought to be a simpler solution

Edit: it's the caldata for 5ghz that gets symlink, not 2.4

Edit2: to summarize things up for qca9984, it uses:

  • board.bin symlinked to 5ghz caldata
  • board-2.bin downloaded from CAF or Kvalo's git
  • firmware itself from CAF or Kvalo's git
    when anything from this list is absent neither radio comes up

Edit3: there's newer board-2.bin available https://source.codeaurora.org/quic/qsdk/oss/firmware/ath10k-firmware/commit/ath10k/QCA40XX/hw1.0/board-2.bin?id=171b9607fb8cc694ed469a4e29c91af9d92f2971
Try it along with symlinking 5ghz pre-cal to board.bin

Okay the tsens driver works, all 11 sensors are visible but it shows temp in full degrees again, sigh...

Too bad. Sounds again like a hack by somebody, maybe something similar that we discarded a few months ago.

Ok, I assumed it was the 2.4GHz radio, since it is usually the first.
so thanks for clearing this up. BTW: I wrote mail to the ML to ask
about the QCA9984 oddity.
https://marc.info/?l=linux-wireless&m=149028769320374
(Next time, I'll add you to the CC: as well.)

As for the board-2.bin / board.bin:

The ath10k driver tries to locate the correct board data in the board-2.bin. If this fails,
it will fall back to the the board.bin. I think you could get away with deleting the
board-2.bin in your configuration and it will still work.

Note: ath10k doesn't do any auth/id checks for the board.bin. If it's there it will
be uploaded... If it works: great.

Hm, I've backed out the "wifi-breaking patch" and added your patch, and this does not seem to produce a wifi-not-broken build. Is there more to it than that? Here's the kernel output of the driver horking the firmware load:

[ 10.111180] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[ 10.111721] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 r
eset_mode 0
[ 10.281986] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-
pci-0000:01:00.0.bin failed with error -2
[ 10.282037] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.472881] firmware ath10k!pre-cal-pci-0000:01:00.0.bin: firmware_loading_st
ore: map pages failed
[ 10.473045] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/cal-pci-
0000:01:00.0.bin failed with error -2
[ 10.480835] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.686545] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-5.bin failed with error -2
[ 10.686583] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.717425] firmware ath10k!QCA9984!hw1.0!firmware-5.bin: firmware_loading_st
ore: map pages failed
[ 10.717600] ath10k_pci 0000:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-5.bin': -11
[ 10.725463] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-4.bin failed with error -2
[ 10.735356] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.786290] firmware ath10k!QCA9984!hw1.0!firmware-4.bin: firmware_loading_st
ore: map pages failed
[ 10.786461] ath10k_pci 0000:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-4.bin': -11
[ 10.794328] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-3.bin failed with error -2
[ 10.804240] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.859650] firmware ath10k!QCA9984!hw1.0!firmware-3.bin: firmware_loading_st
ore: map pages failed
[ 10.859773] ath10k_pci 0000:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-3.bin': -11
[ 10.867548] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-2.bin failed with error -2
[ 10.877560] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 10.917004] firmware ath10k!QCA9984!hw1.0!firmware-2.bin: firmware_loading_st
ore: map pages failed
[ 10.917515] ath10k_pci 0000:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-2.bin': -11
[ 10.924939] ath10k_pci 0000:01:00.0: could not fetch firmware files (-11)
[ 10.934880] ath10k_pci 0000:01:00.0: could not probe fw (-11)
[ 10.942109] ath10k_pci 0001:01:00.0: enabling device (0140 -> 0142)
[ 10.947857] ath10k_pci 0001:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 r
eset_mode 0
[ 11.121669] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/pre-cal-
pci-0001:01:00.0.bin failed with error -2
[ 11.121712] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.174919] firmware ath10k!pre-cal-pci-0001:01:00.0.bin: firmware_loading_st
ore: map pages failed
[ 11.175360] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/cal-pci-
0001:01:00.0.bin failed with error -2
[ 11.182914] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.420601] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-5.bin failed with error -2
[ 11.420644] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.467551] firmware ath10k!QCA9984!hw1.0!firmware-5.bin: firmware_loading_st
ore: map pages failed
[ 11.467741] ath10k_pci 0001:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-5.bin': -11
[ 11.475603] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-4.bin failed with error -2
[ 11.485496] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.517657] firmware ath10k!QCA9984!hw1.0!firmware-4.bin: firmware_loading_st
ore: map pages failed
[ 11.517831] ath10k_pci 0001:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-4.bin': -11
[ 11.525702] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-3.bin failed with error -2
[ 11.535596] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.585103] firmware ath10k!QCA9984!hw1.0!firmware-3.bin: firmware_loading_st
ore: map pages failed
[ 11.585220] ath10k_pci 0001:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-3.bin': -11
[ 11.593054] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/QCA9984/
hw1.0/firmware-2.bin failed with error -2
[ 11.603016] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 11.633533] firmware ath10k!QCA9984!hw1.0!firmware-2.bin: firmware_loading_st
ore: map pages failed
[ 11.633647] ath10k_pci 0001:01:00.0: could not fetch firmware file 'ath10k/QC
A9984/hw1.0/firmware-2.bin': -11
[ 11.641491] ath10k_pci 0001:01:00.0: could not fetch firmware files (-11)
[ 11.651443] ath10k_pci 0001:01:00.0: could not probe fw (-11)

You should not do both. The patch in my PR is meant to restore functionality with the current git head. There is no need to back out the wifi breaking patch.

Hm, thanks. Just your patch didn't do it. I will see if my git fu has betrayed me.

@hnyman this driver is from Qualcomm SDK
I'm starting to think that without that rounding to full degrees we are getting incorrect temp +- 0.5C.
If you look through the code there is difference in code_to_degc function in tsens-ipq8064.c in my above commit and the one that is used at the moment in kernel tsens-8960.c (it's code_to_mdegc in this one).

If you check temp calculations you'll see that at first both drivers get similar data
(adc_code * s->slope) + s->offset;

but then ipq8064 version adds or substracts 500 (depending on conditions) and only then divides by 1000. So the difference is 500 millicelsius all the time.

Edit: by the way according to code, master sensor is sensor0, but we can't pull it with upstream driver cleanly. Upstream driver lets parse only sensor5-10, I guess it's sensor address range difference (seems that ipq sensors range is lower a bit).

[quote="dissent1, post:288, topic:285"]
but then ipq8064 version adds or substracts 500 (depending on conditions) and only then divides by 1000. So the difference is 500 millicelsius all the time.
[/quote]I did not yet check the source, but that sounds like a quite logical (and expected) correction against "round-down" in interger division calculations. That enables the millivalues 500-999 to round up.

55444 / 1000 = 55,
but 55666 / 1000 = 55 (wrong)

(55444+500) / 1000 = 55,
and (55666+500) / 1000 = 56 (right)

There's a bit different logic: if temp > 0 then + 500 all the time, if temp < 0 then - 500 all the time, so that's not rounding thing
Edit: or maybe you are right and that has been the reason

[quote="dissent1, post:290, topic:285, full:true"]
There's a bit different logic: if temp > 0 then + 500 all the time, if temp < 0 then - 500 all the time, so that's not rounding thing Edit: or maybe you are right and that has been the reason
[/quote]Sounds like I am right. That rule matches perfectly the needed rounding logic to counter the "always round toward 0" of integer divisions.

Integer division always truncates (or "rounds down toward 0"), so a value like 55900 that you would like to see as 56, will be 55 unless you pad it before the division. The needed padding is divisor/2, in case of divisor 1000 the needed correction is 500. So 55900+500 = 56400 that divides to 56.
On the negative side the same:
-44333 / 1000 = -44
-44666 / 1000 = -44 (wrong)
(-44333-500) / 1000 = -44
(-44666-500) / 1000 = -45 (right)

The truncation nature of integer division is a sneaky thing that has caused trouble for many programmers, as it is easy to overlook. Nice to see that Qualcomm has got it right.