Optimized build for the D-Link DIR-860L

I am also suprised that the builld works and that it doesn't crash for you while running speedtests with cake enabled. The LuCI/SSH bug concerns me however. I have not run into it but for normal router management it's quite an essential thing ;). Is there something in the logs concerning LuCI or SSH? Maybe because I include luci-ssl-openssl instead of the regular luci-ssl? Just guessing here.
I forgot to mention that the build is a normal build with all the regular goodies so SMP,SMT and mt76 enabled.

Does your router still respond to commands in a SSH session?

So something breaks, but the initial results with cake are promising. If someone is able to get a log I will be very grateful :slight_smile: Tonight around 23:00 UTC+1 I can finally test the build myself, I need to have some patience :see_no_evil:

Nothing strange in the logs when SSH/LUCI broke. The symptoms and behavior of the bug were exactly the same as with the 4.4 build with SMT disabled. Maybe it is still a bug caused by SQM, but the crashing on non-experimental builds prevent the bug from ever rearing its head?

When the router stopped working completely and required a reboot, I wasn't able to get any logs, since it was not accessible through LUCI nor through SSH. I am now running a SSH session with logread -f to hopefully catch something before the thing crashes again.

1 Like

Just another thing clouding the waters unfortunately. Only thing we can do is try to get some logs and see what is causing these bugs. My gut-feeling says it has something to do with the drivers for our device but without evidence, that is just guesswork. Did you catch anything last night? Did WiFi break again?

Just flashed the 4.9 kernel build. Running it without SQM QoS for the time being (need a bit of stability due to lack of time). Let's see if I can break stuff :stuck_out_tongue:

Nothing broke so far. Luci still works and SSH commands are also working. I think it only broke under heavy load. Not sure if SQM had anything to do with the issues.

Perhaps SQM + heavy load now leaves it in a buggy state rather than causing stack traces / hard crashes? Will have to test to see if heavy traffic also causes the same issues with SQM disabled.

My current hypothesis is that there are 2 bugs, the "skb_try_coalesce" stack traces and the SQM QoS bug. Enabling SQM QoS exacerbates the first one and then the router crashes. My router running the 4.9 kernel build has an uptime of 1 day but I see the same stack traces as on 4.4.xx kernels. Do you see those as well?

What exactly do you mean? sqm-scripts really is just a convenience script around the tc and ip binaries plus a few kernel modules. Bugs in sqm-script are expected to result in sub-optimal/non-functional traffic shaping but not stalls or crashes (such crashes might be related to the cake module and the cake associated changes in tc, but that is a different, but related, project than sqm-scripts).

As usual you are right, while not technically correct for sake of convenience I used the term SQM QoS whereas I am talking about using cake.
Using cake seems to exacerbate something in the drivers for the DIR-860L which causes it to act flaky (LuCI + SSH session not responding, degraded performance) and become unstable.

I didn't see any stack traces last time I checked. But I didn't see them on the 4.4 kernel either. The only stack traces I was getting on the 4.4 kernel were the one that were posted on the flyspray page with SQM running: https://bugs.lede-project.org/index.php?do=details&task_id=764

It's not only cake. Fq_codel did the same thing for me on the 4.4 kernel, and so did the QoS package (Not SQM). The bugs are definitely somewhere in the drivers I think.

Is it normal for the 5ghz tx power to only go up to 18dbm on the upper channels? My Trendnet QCA 9563 AP goes much higher.

No, it's not normal. It's a bug as far as I know. See my previous post on this issue here: Optimized build for the D-Link DIR-860L - #73 by Mushoz

I will put up a bug report of this issue once the SQM issue has been completely solved. But feel free to file your own report if you don't want to wait for that. Bugs can be reported here: https://bugs.lede-project.org/

Just a quick update:

Kernel Version 4.9.20
Uptime 3d 12h 26m 28s

So it's looking quite good. SQM has been running all that time with fq_codel + qos_simplest. Load in these three days has been quite low though. I did download a few torrents, but that was over WiFi, so it hardly fully stresses the connection. The SSH/LUCI bug would probably return if I would run some heavy traffic tests.

For the other people that are still running the build with the 4.9 kernel: How are your experiences so far?

Edit: Turns out SQM was disabled. The GUI showed it enabled, but in reality it wasn't running. I just re-enabled the SQM instance, and the SSH/LUCI bug cropped up again as soon as I applied some load. Conclusion: The kernel 4.9 build is fine for regular use (no SQM) but is still unstable with SQM enabled. The bugs that crop up are just different than the ones in the 4.4 build.

Anything you noticed between the 4.4 bugs and 4.9 bugs?? Any crash and recovery on the wifi drivers? I could live without SQM at the moment. But only if the wifi is solid.

@drbrains For me the router was completely stable on the 4.9 kernel. I had a 3.5 day uptime that was only interrupted after enabling SQM again. No wifi issues either.

I too saw stability until I activated SQM. One thing I noted was that upon a crash, all the configurations I did on the router were gone. It was as if I had just performed a clean install of LEDE each time.
Any way to diagnose the issue? I would love to have this router as primary Cake router on my network. Currently, that duty is given to my Archer C7v2, which is definitely inferior hardware these days!

[QUOTE]
Doing the same tests with fq_codel, everything was rock solid stable. What surprised me though, was that fq_codel was able to perform way better[/QUOTE] Should I just use fq_codel with this router instead of Cake for the time being? Would that be faster than Archer C7v2 running as main router?

What are the advantages of a 4.9 kernel in LEDE?

Sorry, disregard that old comment about fq_codel by me. While it was more stable with fq_codel than with cake, it would still crash sooner or later. We need to wait for the developers to fix this bug, unfortunately. :frowning:

None at the moment. It is still experimental and it doesn't solve any of the issues that are present on 4.4. I returned to the 4.4 kernel and would advice others to stay on 4.4 as well.

Any thing we can do to help with diagnosing the issue? I've started streaming 4k Blurays and I imagine the dual core CPU of the D-Link would work great for this!

@everyone, sorry for the delay responding to everything but I've been extremely busy lately.
Just noticed this commit, so I guess it's testing time again!

New build with a 4.4.x kernel is building.

Does this also help the DIR-860L? It uses the mt7602 instead of mt7603, right?

The MT76 is for the MT7602 and MT7612 plus the MT7603 on pci, or MT7628, which has the MT7603 in the SOC.

So it would help if it gets stable. However I still have stacktraces. Depends an channel I select. Needs more testing.

But looking at the 14 new commits that were added to the driver here: https://github.com/openwrt/mt76/commits/master

Shows that all edits were done inside functions that have mt7603 in their names. I presume that these functions will never be called when the device is using the mt7602, right? That would mean there should be no differences with these commits for devices using the mt7602.