Netgear R7800 exploration (IPQ8065, QCA9984)

rhyvu · March 7, 2018, 12:16am

Currently need support for IGMP snooping and IGMP proxy on R7800 for use with multicast, but research has shown R7800 to have a possible bug when it comes to multicast. I am wondering if this patch been applied upstream? If not will it be at some point or is something lacking before it can be applied?

Patch can be found at https://bugs.lede-project.org/index.php?do=details&task_id=954

The discussion regarding the possible bug can be found at Multicast/switch/snooping/vlan problem. Bug?

johnnysl · March 7, 2018, 9:18am

For me it seems there are two different hardware versions of the R7800 around.
I say this because i have mine running fine on a FTTH-setup, where the TV-signal is managed via a second public vlan, while the tv-box is sitting in the internal network. The IGMP proxy neatly helps to get the stream handled and it works perfectly.

about 6 months ago I recommended the r7800 to a collegue and there we couldn't get it to work reliably. the video stream kept getting interrupted, as if the proxy didn't work. We used an identical build and even swapped devices. My device was ok, his wasn't. He ended up trading it in against a WRT3200.

And i hear more stories where IGMP didn't work well on the r7800.

fantom-x · March 7, 2018, 11:44am

That could truly be the case. I found this link below about numbers etched on the antennas (newer) or just having stickers (older). And some vague reference to adding support for 160MHz channels to the newer revision in the very last post.

https://community.netgear.com/t5/Nighthawk-WiFi-Routers/R7800-versions/td-p/1129374

rhyvu · March 7, 2018, 3:14pm

This is very interesting, did you both have the same IPTV provider? This doesn't look well if the problem is different HW version that aren't marked. @fantom-x replied with a link where someone mentioned the difference, which version do you have?

fantom-x · March 7, 2018, 4:41pm

In addition to IPTV issues, there seem to be latency spikes problem in the “newer” units. My unit bought less than a year ago has antenna numbers etched on the antennas, so it is a newer unit, possibly. It is experiencing significant latency spikes a few times a minute and a few other people confirmed that they have the same issue. And some owners of R7800 do not see the spikes.
Is there a way to get hw info without opening it up or having serial access?

avx · March 7, 2018, 7:34pm

I have the original R7800 (Beta unit) and retail one sent by Netgear for another one test recently and both work at HT160 with the Intel 9260ac in my Dell Inspiron 7577. So if any change did occur it’s not with HT160.

lesandie · March 7, 2018, 8:31pm

i see them too. At first i thought the culprits where my PLCs but no, same results connected directly to the router. Also i have a new unit too and i'm running r6334. BTW another unit i have with a 17.04 snpashot has the same problems. My tests were wired, no wifi.

fantom-x · March 7, 2018, 10:54pm

Thx for sharing this information. I was about to order another unit to see if mine is defective or not.

rhyvu · March 8, 2018, 2:33am

So I am not sure what the problem in question is, I have just gotten my R7800 and haven't had the opportunity to flash it to OpenWRT/LEDE, but if it is ping related I can tell you about my experience with the R7000 and Tomato.

When the R7000 came out the Tomato developers added support for it and configured it exactly as the original Netgear firmware, but struggled with unstable ping. This was weird since other routers using the same processor and WiFi chips (same platform) weren't exhibiting these issues on Tomato. After a long time trying to debug the issue and discouraging people from getting the R7000 they finally found a solution by disabling the virtual eth1 interface (http://www.linksysinfo.org/index.php?threads/tomato-for-arm-routers.69719/page-14#post-257798). Once that was the R7000 worked great and as far as I know nobody ever figured out why removing it had an impact or why the original Netgear firmware was working with the virtual eth1 added.

I don't know if that is helpful or not, but I really hope the multicast patch will be added and upstreamed. Let me know if I can help in any way,

lesandie · March 8, 2018, 8:22am

BTW i have irqbalance installed in both units r6334 and r3498. I'll try to see if irqbalance is a factor to take into account, but in the first test with the r3498 yielded the same results.

kyva1929 · March 8, 2018, 10:33am

Just want to add a data point -

I have the "newer" version of r7800 - the one with numbers etched on the antenna.

I tested with 4 concurrent ping sessions for ~15mins, and experienced no latency spike issues.

My r7800 is used in wireless client mode with masqurade and most firewall disabled.

I am using this build:
[ 0.000000] Linux version 4.9.82 (perus@ub1710) (gcc version 5.5.0 (OpenWrt GCC 5.5.0 r5953-d58c8f4029) ) #0 SMP Sat Mar 3 08:02:51 2018

@hnyman I am using your build
motd reports
OpenWrt SNAPSHOT, r6365-45fdb12258

dmesg 2nd line reports (gcc version 5.5.0 (OpenWrt GCC 5.5.0 r5953-d58c8f4029) ) #0 SMP Sat Mar 3 08:02:51 2018.

lesandie · March 8, 2018, 10:42am

That could be the difference?. Mine is used as main router with NAT and firewall enabled.

i think i need to make more tests an try to isolate ...

hnyman · March 8, 2018, 10:52am

I think that @dissent1 got some weird results with irqbalance, so it might make sense to either adjust IRQs manually, or to run "irqbalance oneshot mode" with option "-o", where irqbalanacve runs once and then exits.

fantom-x · March 8, 2018, 11:26am

I tried irqbalance and it made no difference. The only thing that did help to some degree was to isolate cpus: one for network interrupts and the other one for everything else.

lesandie · March 8, 2018, 12:10pm

Yep, with or without irqbalance the results for me are the same though, one curious thing, if i ping the wan interface from and to mi public IP, there are no spikes, but if i ping internally, between my wired lan clients, the spikes become noticeable. It can go from 0,8ms to 43ms.

About irqbalance and oneshot option, nice point @hnyman i've noticed that eth0 is assigned to core1 and eth1 to core2, wifi0 to core2 and wifi1 to core1. If i remember correctly dissent pointed out that both eth should be assigned to the same core ... but i will test with this config to see if there is some performance increase noticeable, due to the cache misses that @dissent1 pointed out could occur if irqbalance was always "on".

BTW if anybody wants to assign the irqs manually i've adapted @dissent1 script to the R7800 board. It can be executed after the irqbalance --oneshot option if the irqs are not balanced as expected.

#!/bin/sh /etc/rc.common
# First start irqbalance with the --oneshot option
# Try to balance manually both eth to core2 and wifi0 to core2 ifthey are not balanced correctly
# Startup command for openwrt/lede
# /usr/sbin/irqbalance --oneshot --debug > /var/log/irqbalance.log

START=99

set_irq_affinity() {
	local name="$1"
	local val="$2"
  
case "$name" in
wifi0)
  	local irq_wifi0=`grep -E -m1 'qcom-pcie-msi' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi0" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_wifi0/smp_affinity"
	;;
wifi1)
  	local irq_wifi1=`grep -E -m2 'qcom-pcie-msi' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi1" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_wifi1/smp_affinity"
	;;
eth0)
  	local irq_eth0=`grep -E -m3 'eth0' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi1" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_eth0/smp_affinity"
	;;
eth1)
  	local irq_eth1=`grep -E -m3 'eth1' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi1" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_eth1/smp_affinity"
	;;
*)
  	local irq=`grep -m 1 "$name" /proc/interrupts | cut -d: -f1 | sed 's, *,,'`
	[ -n "$irq" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq/smp_affinity"
	;;
esac
}

start() {

. /lib/functions.sh

    set_irq_affinity eth0 2
	set_irq_affinity eth1 2
	set_irq_affinity wifi0 2

}

escalade · March 8, 2018, 1:46pm

@luarane

Trying to compile your 4.14 branch for the C2600, it's failing to create the factory image with "os-image partition too big (more than 2097152 bytes): Undefined error: 0"

Any ideas?

Relevant output: https://pastebin.com/9gG97ins

fantom-x · March 8, 2018, 3:39pm

Do you want to try my recipe (Netgear R7800 exploration (IPQ8065, QCA9984) - #903 by fantom-x) ? It has helped me drop the spikes form 100ms down to around 20ms?

fantom-x · March 9, 2018, 12:22am

@dissent1's idea to move the network IRQ's to CPU1 was a right one, but not complete without making sure that nothing else can hog CPU1 and by doing so delay the network interrupts processing. Isolating CPU1 for the exclusive use by eth0, eth1, and wifi0 and also making collectd, nlbwmon, and uhttpd run nicer (with nice -n 19) makes things so much better. I have been testing the latency for the last several hours during the daily peak usage and below is what I am getting now (my best ping is around 11ms and I ignore everything below 20ms). Not perfect, but much more usable. Have not tried VoIP yet, but the online games got better (comparing to 50..100 ms pings several times a minute before the change).

2018-03-08 18:59:14 PING 8.8.8.8 (8.8.8.8): 56 data bytes
2018-03-08 19:01:03 64 bytes from 8.8.8.8: icmp_seq=108 ttl=60 time=20.031 ms
2018-03-08 19:01:45 64 bytes from 8.8.8.8: icmp_seq=150 ttl=60 time=21.197 ms
2018-03-08 19:03:21 64 bytes from 8.8.8.8: icmp_seq=246 ttl=60 time=21.607 ms
2018-03-08 19:04:22 64 bytes from 8.8.8.8: icmp_seq=307 ttl=60 time=21.162 ms
2018-03-08 19:05:43 64 bytes from 8.8.8.8: icmp_seq=388 ttl=60 time=21.669 ms
2018-03-08 19:05:53 64 bytes from 8.8.8.8: icmp_seq=398 ttl=60 time=20.957 ms
2018-03-08 19:06:51 64 bytes from 8.8.8.8: icmp_seq=456 ttl=60 time=20.281 ms
2018-03-08 19:07:23 64 bytes from 8.8.8.8: icmp_seq=488 ttl=60 time=21.706 ms
2018-03-08 19:11:42 64 bytes from 8.8.8.8: icmp_seq=746 ttl=60 time=21.359 ms
2018-03-08 19:13:08 64 bytes from 8.8.8.8: icmp_seq=832 ttl=60 time=23.206 ms
2018-03-08 19:13:09 64 bytes from 8.8.8.8: icmp_seq=833 ttl=60 time=20.857 ms
2018-03-08 19:13:36 64 bytes from 8.8.8.8: icmp_seq=860 ttl=60 time=21.929 ms
2018-03-08 19:15:08 64 bytes from 8.8.8.8: icmp_seq=952 ttl=60 time=25.217 ms
2018-03-08 19:15:35 64 bytes from 8.8.8.8: icmp_seq=979 ttl=60 time=22.741 ms
2018-03-08 19:15:55 
2018-03-08 19:15:55 --- 8.8.8.8 ping statistics ---
2018-03-08 19:15:55 1000 packets transmitted, 1000 packets received, 0.0% packet loss
2018-03-08 19:15:55 round-trip min/avg/max/stddev = 10.824/11.858/25.217/1.462 ms

lesandie · March 9, 2018, 10:20am

Will try this new approach and rebalance IRQs with a manual script.

luaraneda · March 9, 2018, 12:11pm

It's the same problem I've observed for the other boards (d7800, r7500, r7500v2, r7800, vr2600v).
I missed yours, because KERNEL_SIZE variable is not declared for your board, and the utility tplink-safeloader is doing the size check and throwing the error (it's on your log).

Let's hope all those problems will go away once the target split is done (ipq40xx and ipq806x), which should reduce the size of the ipq806x kernel.

btw, last night I rebased my ipq806x-k4.14 branch against OpenWRT master, and is now running on my Asus RT-AC58U, but I think the size problem remains for the other boards.

I don't know if there is a timeline for the split of the target, or if it will be split before the branch of the 18.0x release, maybe @mkresin or @blogic could answer that.