Optimized build for the D-Link DIR-860L

Have you guys ruled out CPU overheating? (Guess since SQM is longterm cpu loads)
Third person here compiling with different settings. :smiley:

Yes, I ruled that out. I ran 3 different SSH sessions with the command: cat /dev/urandom | gzip > /dev/null

I used a fourth SSH session to run top, which showed a constant 100% CPU usage. For the heck of it, I even ran long speedtests on dslreports simultaneously. All was fine. No crash at all.

I won't be able to try a build with only SMT disabled for today. Will have to look at that tomorrow. Have you had a look already by any chance?

It's building :wink:

I also had the option to set the max number of processors. An integer from 2 to 256 was accepted. Do you think this could have anything to do with the boot failures with SMP disabled? Since you effectively have 1 core / 1 thread in that case, so even 2 cores would be too high. It was set to 4 by default. Maybe this should be set to 2 if you run SMT disabled, since you will have 2 cores / 2 threads? Or does this value not really matter, as long as it is higher than the actual number of cores, since it is a max value? I'm just grasping at straws here :yum:

Good luck with the build btw! Really curious to the result :slight_smile:

Edit: Not sure if I'll have time to test it today, since the girlfriend doesn't like me messing with our network :wink: but the image without SMT is currently compiling for me as well

I had the same train of thought :wink: I just left everything default (as per my build config) except for SMT. The build boots, nothing out of the ordinary in both the kernel and system logs. Running: "cat /proc/cpuinfo" shows 4 cpu cores.
For the people brave enough to test (there be dragons!), you can download the build here.
Now running some dslreport speedtests to see if I can make it crash.

Edit: 5-6 speedtests + a few HD YouTube videos later, no crashes or stack traces in both the kernel and the system logs. If after a couple of days nothing crops up I think we can conclude it is indeed SMT which causes the problems.

Curious if setting the max number of processors to 2 will be able to fix it. Do you notice anywhere that SMT is disabled? 2 ksoftirqd threads instead of 4 maybe?

Edit: Just noticed your edit. That is very good news indeed! Let's hope this bug has been finally pinpointed. That will make fixing it a lot easier.

I had some time on my hands and compiled my own config/master with SMT disabled, seems to have fixed the bug for me.
Strangely enough I still get 4 processors available even in /proc/cpuinfo, kernel bug(bugreport)?:

processor               : 3
cpu model               : MIPS 1004Kc V2.15
BogoMIPS                : 586.13
wait instruction        : yes
microsecond timers      : yes
tlb_entries             : 32
extra interrupt vector  : yes
hardware watchpoint     : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa                     : mips1 mips2 mips32r1 mips32r2
ASEs implemented        : mips16 dsp mt
shadow register sets    : 1
kscratch registers      : 0
package                 : 0
core                    : 1
VCED exceptions         : not available
VCEI exceptions         : not available
VPE                     : 1

I think that that's normal because SMP is not disabled. Therefore the Linux kernel sees 4 CPU cores.
SMT =! SMP. This article gives a quick and short explanation of the differences. This one if you want to know more.

19 hours and still going strong :slight_smile:

Edit: after some more reading I am a bit confused. We should see only 2 cores but we really see 4 cores :confused:

Edit edit: disabling SMT does not fix our issues. I just had a stack trace :cry:. Back to the drawing board...
Kernel log:

[76720.940000] ------------[ cut here ]------------
[76720.950000] WARNING: CPU: 0 PID: 2988 at net/core/skbuff.c:4196 skb_try_coalesce+0x228/0x35c()
[76720.970000] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TEE xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_dup_ipv6 nf_dup_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt sch_cake nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mt76x2e mt7603e ledtrig_usbport mt76 mac80211 cfg80211 compat xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables ifb leds_gpio xhci_mtk xhci_plat_hcd xhci_pci xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[76721.200000] CPU: 0 PID: 2988 Comm: dropbear Not tainted 4.4.61 #0
[76721.200000] Stack : 00000000 00000000 804c6862 00000035 00000000 00000000 80470000 804e0000
[76721.200000] 87e1606c 80465c83 803e351c 00000000 00000bac 804c367c 854a9cf7 00000010
[76721.200000] 872613f8 8006323c 80470000 804e0000 8046a168 8046a16c 803e8150 854a9ba4
[76721.200000] 00000003 80060ff8 854a9cf7 00000010 872613f8 00000125 00000000 004a9ba4
[76721.200000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[76721.200000] ...
[76721.200000] Call Trace:
[76721.200000] [<80016718>] show_stack+0x6c/0x88
[76721.200000] [<801b5624>] dump_stack+0x8c/0xc0
[76721.200000] [<8002b954>] warn_slowpath_common+0xa0/0xd0
[76721.200000] [<8002ba0c>] warn_slowpath_null+0x18/0x24
[76721.200000] [<80290678>] skb_try_coalesce+0x228/0x35c
[76721.200000] [<802eba9c>] tcp_try_coalesce+0x70/0xd4
[76721.200000]
[76721.350000] ---[ end trace 494c6962df42b073 ]---

System log:

Sun May 7 21:01:51 2017 kern.warn kernel: [76720.940000] ------------[ cut here ]------------
Sun May 7 21:01:51 2017 kern.warn kernel: [76720.950000] WARNING: CPU: 0 PID: 2988 at net/core/skbuff.c:4196 skb_try_coalesce+0x228/0x35c()
Sun May 7 21:01:51 2017 kern.warn kernel: [76720.970000] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TEE xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipvSun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] CPU: 0 PID: 2988 Comm: dropbear Not tainted 4.4.61 #0
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] Stack : 00000000 00000000 804c6862 00000035 00000000 00000000 80470000 804e0000
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] 87e1606c 80465c83 803e351c 00000000 00000bac 804c367c 854a9cf7 00000010
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] 872613f8 8006323c 80470000 804e0000 8046a168 8046a16c 803e8150 854a9ba4
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] 00000003 80060ff8 854a9cf7 00000010 872613f8 00000125 00000000 004a9ba4
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] ...
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] Call Trace:
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<80016718>] show_stack+0x6c/0x88
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<801b5624>] dump_stack+0x8c/0xc0
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<8002b954>] warn_slowpath_common+0xa0/0xd0
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<8002ba0c>] warn_slowpath_null+0x18/0x24
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<80290678>] skb_try_coalesce+0x228/0x35c
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000] [<802eba9c>] tcp_try_coalesce+0x70/0xd4
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.200000]
Sun May 7 21:01:51 2017 kern.warn kernel: [76721.350000] ---[ end trace 494c6962df42b073 ]---

Those stack traces look way different than the ones I was getting. Maybe we are looking at different issues? You also mentioned that the 17.01 branch was not stable for you, right? I've been using 17.01.0 and 17.01.1 with no issues whatsoever, except the SQM crashes and txpower issues. No stack traces when not running SQM.

I will check your build now and see how SQM behaves for me on a 500 mbit connection.

Have you tried setting the max number of processors to 2 yet, by the way? Maybe this will help? Since we are still seeing 4 processors with cat /proc/cpuinfo.

I wanted to try the build with SMT disabled with SQM to test as well, but the build was already too old. I couldn't install SQM due to dependency issues. I tried a force install, but that bricked the device. I am now going to compile 17.01.1 with SMT disabled. Hopefully I can provide some extra information as well.

My build has SMT disabled and SQM :wink: At my first attempt at disabling SMT, I also set the number of CPU cores to 2. Have not tried it yet but I think it won't solve the reboot problem. I can try it however :wink:
About those stack traces. Those are the ones I have been batteling for the last couple of weeks. They appear quite irregularly at least once every 24 hours but I think it has something to do with the open source mt76 driver.

Ah, you included it in your build. Smart. I guess you have to, since SQM needs to be compiled with the SMT-less kernel in mind as well right? I just installed my own 17.01.1 build with SMT disabled, and installing SQM fails with the same message as before. Will try out your build with SQM included next :slight_smile:

Sometimes the buildbot is a couple of commits behind (commits get committed after it starts) and thus some packages are not installable. Also remember one time packages could not be installed because there was no disk space on the buildbot for newer versions.

Up and running. Now going to try some speedtests with SQM enabled.

Edit: @Bartvz very weird. The top command does nothing on your build. Do you have the same issue? Maybe because of the fact the system expects 4 threads (since that is what /proc/cpuinfo is showing as well), but there are only 2?

Edit 2: The DIR-860L was bugged. I couldn't reach it via Luci nor did any of the command in my SSH session work. Strangely enough, it didn't crash and internet kept working. SQM didn't result in an instant crash with speedtests like before, so there is definitely an improvement there.

Edit 3: Tried out cake instead of fq_codel and it crashed again. Back to the drawing board unfortunately :frowning:

1 Like

Top works fine for me, so it's very strange that it doesn't for you. Maybe because it was "bugged"?
What's also strange is that for some people when they run a speedtest with SQM QoS and cake enabled the router crashes. Maybe because they strain the available cpu power more? I have a very slow connections (10/1 MBit) so maybe that's why I don't see immediate crashes?

New build is cooking with SMP, SMT and MIPS Coherent Processing System disabled and amount of cpu cores set to 2.

Edit: Build finished building just fine but it didn't boot. Out of ideas at the moment.

I'm cooking for the MT7621A myself (not the DIR860L), but having similar (the same) problems. My connection in 200/20 on fiber.

Top is only showing CPU-0. Is this a short coming within busybox? CAT /proc/cpuinfo shows CPU 0 - 3 as it should. 2 cores, and 2 threads per core.
CAT /proc/interrupts shows the IRQ for ethernet, MT7603 and MT7612 are only on CPU-0. I tried shifting the IRQ to offload CPU-0: eg. ECHO "2" > /proc/irq/31 (MT7612 my case) and ECHO "4" > /proc/irq/10 (Ethernet)
a CAT /proc/interrupts shows that IRQ's are getting more balanced. On a speedtest.net run, I am seeing my advertised speeds. TOP is indication "only" 28 - 30% on soft-irq. CPU load is almost 0.
When I activate SQM and put lower limits (eg. 40.000 / 10.000) is works and doesn't crash, so I am surpriced that on 10/1 it does crash the router. So maybe until SQM reaches the CPU limits or bandwidth limits when it does need to work to keep bufferbloat within reason nothing happens. I should note that I'm on a PPPoE internet connection :frowning: and my ISP provided all-in-one wifi-router is set to bridge-mode, but still providing VLAN for IPTV and VOIP. I'm considering getting a GPON to Ethernet media converter and let LEDE do all the work without the additional box. Maybe even set MTU to 1508 on eth0.2 to get a 1500MTU over the PPPoE.
As for compiling: I kept it "standard", but I did add "mtune=1004Kc -mmt =mdsp"

Why would you need a separate converter for that? I am also using Fiber, and I ditched their all-in-one wifi router the day I got it. VOIP, IPTV and Internet are all coming in over the same ethernet cable from the fiber termination unit (FTU) as three separated tagged VLANs. I am not using VOIP myself (I have it, because it was cheaper to get an all-in-one offer), but the DIR-860L is doing the routing for the internet and routed IPTV just fine.

Yes, it was bugged. A reboot fixed it. I'm really out of ideas as well. For the people who haven't yet, please upvote the issue at flyspray, so that hopefully the developers can have another look: https://bugs.lede-project.org/index.php?do=details&task_id=764

Off-topic but fiber comes into my apartment as 2 very thin fiber cables ending in a so called "SC" connector. This connector goes directly into a PT632 E8C All-in-one box: 1x 1000Mbps, 3x 100Mbps and 2x connector for phone plus 2.4Ghz Wifi (single non-detachable antenna). Had to use a Serial console to get the "telecomadmin" password to get it into bridge mode. PPPoE with a non-public IP :frowning:

More on topic: How is the wifi situation for you? I get a pretty stable 5Ghz AC signal. I can use the radio as WDS AP and add an extra Virtual AP as Wifi-master. Using the additional MT7603 (2.4GHz N) on the PCIe bus is a nightmare. I have a feeling the kernel panic / crash is related to the MT76 driver. Tomorrow I will make a wifi-less build and do some wired testing only to see if I can get it to crash using cable only. For now, most of my speedtest(s) were from my wireless connected laptop or my iPad, while I was connected (wireless) to an SSH terminal to monitor the router.

As "interface" in SQM I use PPPoE, not the underlying eth0.2. Don't know if that should make a difference.

That is the correct way of configuring it :slight_smile:

5ghz is very stable for me as well. The only thing worth mentioning is that the tx-power is way lower than what is allowed in the regulatory domain, which causes the range to be less than stellar.

The DIR-860L uses the MT7602. No issues with it whatsoever. Very good range, and good speed. Both radios are using two SSIDs (one for ourselves, one for guests).

Ah, that setup is way different than what we have. Disregard my previous comments :slight_smile: