Netgear R7800 exploration (IPQ8065, QCA9984)

i am planning to push the ipq40xx split patch next week. this will move ipq40xx to v4.14
i have no plans o invest any time into ipq806x in the near future and it will remain on v4.9 for the time being.

1 Like

Hi, rookie here, my ISP is using VLAN 500, and I'm currently using just a VLAN which is VLAN 500, CPU(0)(1) LAN 1 2 3 4 all untagged and WAN Tagged, and I found my ping time is better for about 2ms than originally 2 VLANs which put CPU(1) to LAN 4 (all untagged) as VLAN 1 and CPU(0) untagged WAN tagged as VLAN 500.

Will it cause any problems? and wonder why would they split it into 2 VLANs from the beginning

@fantom-x
Good job on solving the latency issue!
Would you mind sharing your "recipe" on how to get the latency-spikes down?
So us noobs can benefit as well? :blush:

Well, the key is to compile your own firmware with custom kernel boot parameter isolcpus=1. That takes CPU1 out and the scheduler no longer uses it. Then use a script a few posts above to move network IRQ’s to CPU1. I also lower priority for things like collectd, nlbwmon, uhttpd, etc which do not have any business running with default priority.
In the end CPU1 is used exclusively for the network interrupts while CPU0 is running everting else. I have not seen ether overloaded yet.
The most difficult step is to compile your firmware. I can share mine if that helps.

2 Likes

Thank you for you summary! I know how to create my own image, thanks to @hnyman and @escalade. Also, I know the function of the set_cpu_affinity script.
Unfortunately, setting a custom kernel boot parameter and lowering the priority of collectd, nlbwmon, uhttpd is completely new to me. Can you please provide some details on how to dot this?

That will be complicated if you're new to build a firmware. First thing i'll try to start in this thread

hnyman made it easy to compile your own image but you also need to change a parameter (isolcpus) with make menuconfig command

To change the priority you need to use the nice command. I advise to change the service scripts in /etc/init.d/uhttpd , collectd, and nlbwmon

By memory:

make kernel_menuconfig

Then look for Boot Parameters and then there is an item to set kernel parameters. Type isolcpus=1 there and rebuild.

Once you install your image, run cat /proc/cmdline to see this new parameter.

Deal with IRQ’s now.

Then you need to edit some startup files under /etc/Init.d/ (collectd, nlbwmon, uhttpd) by adding nice -n 19 to the line that starts the processes. If you get in trouble there, I will post more details in a few hours once I get to my computer.

Great, thanks for the detailed explanation! With this extra info I will manage to create a functioning build.

Some more details a promised are below. The procedure is manual, so extra attention is warranted. Any screw-ups are not my fault.

  1. Start with @hnyman's build and only continue once you can built and deploy it.

  2. Add a custom kernel boot parameter either via make kernel_menuconfig / Boot options / Default kernel command string : isolcpus=1 or by modifying this config file by adding one line:

grep isolcpus target/linux/ipq806x/config-4.9 
CONFIG_CMDLINE="isolcpus=1"
  1. Build and deploy the image, then check that the new config is active:

cat /proc/cmdline
isolcpus=1

  1. Move wifi0, eth0, and eth1 to CPU1 and verify that it actually worked. I leave wifi1 on CPU0 and I do not care much about the 2.4GHz clients. Verify that the numbers in the CPU1 column are increasing.
cat /proc/interrupts | egrep "eth|qcom-pcie-msi|CPU0"
           CPU0       CPU1       
 97:       8296   57872293     GIC-0  67 Edge      qcom-pcie-msi
 98:   16322413          0     GIC-0  89 Edge      qcom-pcie-msi
100:       1069   26137176     GIC-0 255 Level     eth0
101:        511    9101453     GIC-0 258 Level     eth1
  1. Add nice -n 19 to the services that should be running in the background. They will be running on CPU0, but they have absolutely no business to run with default priority.
grep nice /etc/init.d/*
/etc/init.d/collectd:	procd_set_param command nice -n 19 /usr/sbin/collectd -f
/etc/init.d/nlbwmon:	procd_set_param command nice -n 19 "$PROG"
/etc/init.d/uhttpd:	procd_set_param command nice -n 19 "$UHTTPD_BIN" -f
  1. Restar these services or just reboot. Verify that the change took affect (look for SN; N means nice; or use htop that they running nicer):
ps -w | egrep "collectd|nlbwmon|uhttpd"
 1652 root      3256 SN   /usr/sbin/uhttpd
 1892 root      4104 SN   /usr/sbin/collectd -f
 2039 root      1460 SN   /usr/sbin/nlbwmon
  1. Stop all services that you do not use. Here is what I do, but you may have use for some of them.
/etc/init.d/etherwake disable
/etc/init.d/etherwake stop
/etc/init.d/miniupnpd disable
/etc/init.d/miniupnpd stop
/etc/init.d/odhcpd disable
/etc/init.d/odhcpd stop
/etc/init.d/vsftpd disable
/etc/init.d/vsftpd stop
  1. Reboot just in case

  2. Share your results. I am for one curious if this works for others.

  3. For the super adventurous among us, run the following lines and add them to /etc/rc.local. This CPU takes 100 us to switch frequencies, which is quite long.

echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

2 Likes

nice catch! i'll test it thanks

I managed to make a little more progress for the NSS cores. I think I've successfully activated one of the two NSS cores (I think). But it doesn't seem improve thruput or CPU utilisation. Output for /proc/interrupt shown below:

   CPU0       CPU1       
 16:      31840     175695       GIC  18 Edge      gp_timer
 18:         33          0       GIC  51 Edge      qcom_rpm_ack
 19:          0          0       GIC  53 Edge      qcom_rpm_err
 20:          0          0       GIC  54 Edge      qcom_rpm_wakeup
 26:          0          0       GIC 241 Edge      29000000.sata
 27:         26     135106       GIC  67 Edge      qcom-pcie-msi
 28:         23     148231       GIC  89 Edge      qcom-pcie-msi
 29:     175596          0       GIC 202 Edge      adm_dma
 30:     257827          0       GIC 255 Level   
 31:          0     553977       GIC 258 Level   
 32:          0          0       GIC 130 Level     bam_dma
 33:          0          0       GIC 128 Level     bam_dma
 34:     760205          0       GIC 245 Level     nss
 41:          2          0   msmgpio   6 Edge      gpio-keys
 89:          2          0   msmgpio  54 Edge      gpio-keys
100:          2          0   msmgpio  65 Edge      gpio-keys
104:          0          0   PCI-MSI   0 Edge      aerdrv
105:         26     135106   PCI-MSI   1 Edge      ath10k_pci
137:          0          0   PCI-MSI   0 Edge      aerdrv
138:         23     148231   PCI-MSI   1 Edge      ath10k_pci
170:         12          0       GIC 184 Level     msm_serial0
171:          2          0       GIC 187 Level     1a280000.spi
172:          0          0       GIC 142 Level     xhci-hcd:usb1
173:          0          0       GIC 237 Level     xhci-hcd:usb3
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:       5618       7556  Rescheduling interrupts
IPI3:          0          0  Function call interrupts
IPI4:       6190      40574  Single function call interrupts
IPI5:          0          0  CPU stop interrupts
IPI6:          2          0  IRQ work interrupts
IPI7:          0          0  completion interrupts
Err:          0

IRQ 30 & 31 used to be for the qca-nss-gmac driver, but once I loaded the qca-nss-drv driver, both IRQs stopped incrementing but a new one (IRQ34) appeared.

I guess the next step will probably to get the user land program running, so that it can start to delegate traffic to the NSS cores.

The qca-nss-drv driver needed addition device tree and linux kernel clock driver support, which I adapted from Chromium OS for Kernel 3.14, so some portion of the clock driver had to be disabled as it could not be compile. The NSS core driver support still need work. I only found the device tree config for the first NSS core.

Does anyone know how to extract device tree information from the existing Netgear firmware? I managed to extract the contents but the compiled device tree details does not seem to be available in the firmware. The Netgear firmware probably contains the device tree details for the second NSS core.

If anyone is interested in trying out a R7800 build with one NSS core activated, please let me know.

I'll check in the source codes into GitHub in a while.

Hi there,

I tried to set affinity of eth-interrupts manually (@hnyman builds 6394 and 6420), default_smp_affinity=1 and ethernet irqs to 2 to lessen spikes. Works fine some time, but after some minutes the affinities are garbled for irq 100 and 101. Which process is reseting the irq affinity? I haven't found a clue yet.

BTW: If setting as above (I don't care about wifi) the latency of my preferred ping target drops from 21-22 ms to 18-19 ms and spikes from 60-80 ms (max 100 ms) to about 30 ms. Also spikes are less in a given time.

Are you running irqbalance? Run the command below to check:
ps -w | grep irqbalance

Sh... the bloody obvious! Shame on me!
Yes, irqbalance was still active. Legacy from some trials several month ago.

Thanks!

there are some source of nss driver in uboot if you want

Sounds like maybe this setup fantom-x found should be the default for this router.

There has not been a single independent confirmation that this setup is working for anyone else but me yet.

@fantom-x
I've now made my own build.
While running the command: cat /proc/interrupts | egrep "eth|qcom-pcie-msi|CPU0" I noticed that the numbers in the CPU1-column where and stayed 0.
I noticed there is a mistake in your set_cpu_affinity script: at eth0) and eth1) in the text it says: $irq_wifi1.
After changing these bits of text to $irq_eth0 respectively $irq_eth1, it seems to work.
More info after some testing.

No, I am not the author of that script.

You can use this version of the script: Netgear R7800 exploration (IPQ8065, QCA9984)

UPDATE: Oh, I see now. I have decided to try that script and you are correct, there is an issue in that script just as you described.