Optimized build for the D-Link DIR-860L

@Bartvz @Mushoz going to cherry-pick this commit myself as well and see what it gives here. I am curious if it will improve anything (although @Axl_Mas 's report doesn't instill much confidence).

Applied blogic's patches to both current snapshot and 17.01, and tested with both fq_codel and cake.
After fidling for a bit with the limits it looked like the bug was fixed, but sadly no. About 300mbit ingress/egress does the trick, and after ~15/20minutes and some heavy downloading, reboots and crashes happen.

[ 945.720000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 945.730000] 1-...: (21 GPs behind) idle=a22/0/0 softirq=25119/25120 fqs=1
[ 945.740000] (detected by 3, t=6004 jiffies, g=3848, c=3847, q=313)
[ 945.750000] Task dump for CPU 1:
[ 945.760000] swapper/1 R running 0 0 1 0x00100000
[ 945.770000] Stack : 00000000 87c4b180 000000dc ffffffff 000000c2 00000000 804db2a4 80490000
[ 945.770000] 8048874c 00000001 00000001 80488540 80488724 80490000 80490000 8000c0e0
[ 945.770000] 1100fc03 00000001 87c70000 87c71ec0 80490000 8000c410 1100fc03 00000001
[ 945.770000] 804db2a4 80490000 804db2a4 8005ed68 80490000 8001b2f8 1100fc03 00000000
[ 945.770000] 00000004 804884a0 000000a0 8001b300 c939c939 c939c939 c939c939 c939c939
[ 945.770000] ...
[ 945.840000] Call Trace:
[ 945.850000] [<8000be98>] __schedule+0x574/0x758
[ 945.860000] [<8000c0e0>] schedule+0x64/0x7c
[ 945.870000] [<8000c410>] schedule_preempt_disabled+0x10/0x1c
[ 945.880000] [<8005ed68>] cpu_startup_entry+0x11c/0x1b8
[ 945.890000] [<8001b300>] start_secondary+0x440/0x470
[ 945.900000]
[ 945.900000] rcu_sched kthread starved for 6019 jiffies! g3848 c3847 f0x0 s3 ->state=0x1

@Axl_Mas & @Ntalton thank you for your reports! I myself haven't experienced any crashes yet but I only got a 10/1 Mbit connection.
@Mushoz thanks for reporting it on the mailing list. Hopefully a new version of the patch will come soon. The current patch is a step in the right direction because it takes longer for the problem to manifest.

That's true, as I was trying stuff yesterday, I have to say that a limit of ~15Mbit/s on trunk seemed more stable, although I can't be sure on that, not enough testing was conducted.

@All, for me r4498 has been rock solid with an uptime of 2d 2h 30m 24s without stack traces or reboots with SQM QoS (cake) on.
@Ntalton it would be interesting to see at which Down-/upload speed when using SQM QoS the bug pops up. Did you do some more testing?

I've had at least one reboot with 17.01.2 and blogic's patch atm.

@Bartvz Was that with the aforementioned patch applied? Or without?

I was too busy to do a lot of testing, but from what I did test, I gathered that crashes/traces happen regardless of limits, although it seemed like the lower the limit = less frequent crashes, more uptime and a lot of inconsistancy on the time between crashes. Also cake was more unstable this way than fq_codel and when it crashed with cake, most frequently the device would just suddenly reboot, not much information on kernel log. On the other hand, fq_codel had more consistant crashes, with a couple of stack traces on kernel log before the eventual reboot.

With both patches applied. Currently at 3 days uptime without stack traces and/or reboots.

Interesting! Were the other stack traces the same as the one you posted earlier (INFO: rcu_sched detected stalls on CPUs/tasks)?

Just a quick update, nothing, much unfortunately. In LEDE head:

  • Kernel for ramips got bumped to 4.9 (but still there are CPU stalls)
  • mt76 got updated to the latest version (which fixes some mt7603 problems so I doubt it will impact us much).

Hopefully, more soon!

Side note: Qualcomm fast path has been ported by gwlim for several devices. This apparently improves hardware NAT, but I'm not finding information about it.

However, he said as DIR860B1 is a dual core architecture, it will not be easy for him to port for this device.

Has anyone tried 4.9? Are the reboots less frequent than with 4.4?

I tried it a while ago. It seemed slightly less frequent, but the issue was still there.

On 4.9 it's the same as on 4.4 regarding stack traces/reboots. I can upload a build if people want to test it themselves.

Why not? :innocent: If you have time to release an updated build, I would be very happy to test the new kernel 4.9.

OP updated with a new build.
Since we still suffer from stack traces and reboots while using SQM QoS, I included the BBR TCP congestion control algorithm and the fq packet scheduler to play around with.
For the people who want to read here are some nice articles:

How to enable and thus use them? In an ssh session:

echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf
tc qdisc add dev eth0 root fq
tc qdisc add dev eth0.2 root fq

Please note that these changes are temporary. When you reboot you will have the default settings again! Happy testing

Please note that the fq qdisc is really intended for TCP endpoints, on a router Eric Dumazet one of the principal developers of both fq and fq_codel still recommends fq_codel over fq (if I remember correctly). Also BBR in my understanding needs to run on a connections endpoint, so again might not really affect most of a router's traffic (unless you all hit issues with torrents served by your routers :wink: ).
One more thing, the recommended combination of fq and BBR is spot-on, as far as I know this is the only way to use BBR on linux ATM.

Best Regards

Thank you very much for these interesting builds, Bart! Hopefully I will have time on Sunday to do some testing. Has anyone tried these builds with BBR enabled? Does it help with bufferbloat?

Just a quick comment on the Cake SQM issues:

For a long while, I've been on 17.01.2 stable release for this router. On that, with my ISP speeds (118/12 nominal, set to 115/11 in Cake), I've never managed to crash the router. This build was still on the 4.4 kernel.

However, today, I attempted the the latest trunk nightly, which has been switched to 4.9.37 kernel. This crashed my router within 10 minutes, not even a very high workload. Same issues as previously mentioned here, router inaccessible and eventually lost access to the internet.

On 17.01.2 SQM crashed the router very easily at higher speeds. You are indeed correct that it was a bit more stable at lower speeds. I just emailed John with some questions about his patch that was trying to fix these stall issues. He just replied that he would release another patch later today. Let's hope this will fix it :slight_smile:

1 Like