[Solved] WRT1900ACV1 reboots: kernel 4.9

This are bad news :-/
But yeah PJ4B custom design cpu.
and those fixed are for cortex a9.
No one knows exactly on which design this cpu is based.
Could also be cortex a8. Only marvell knows for sure.

isn't this memory corruption issue in mwlwifi there since the beginning, atleast at seems.
Yes the commit fixed some issue that was introduced lately.
Im also not quite sure if the problem really is the mwlwifi driver, cause other devices doesnt suffer from this issue? Maybe more likely an mwlwifi firmware bug?

Could be many things. even a bad batch of power supplies x)

There is still the gcc patch left to try.

but stock firmware is stable yes?

//edit
i downloaded the source from linksys. (kernel 3.x) ?
In their config is following set:
CONFIG_SHEEVA_ERRATA_ARM_CPU_6124=y
CONFIG_SHEEVA_ERRATA_ARM_CPU_PMU_RESET=y

where is this 6124 errata fix gone in 4.x? is it now combined with 4742?
And there is also 4611
https://lists.infradead.org/pipermail/linux-arm-kernel/2013-May/171475.html

and for wrt1200 it is
CONFIG_PJ4B_ERRATA_4742
CONFIG_ARM_ERRATA_720789
No pl310 erratas checked.
but i leave mine running with
CONFIG_ARM_ERRATA_754322
CONFIG_ARM_ERRATA_754327
CONFIG_ARM_ERRATA_764369
and with the gcc patch.
cause i trust the official arm doc more then those marvell devs x)

With just the CPU IDLE disabled and mwlwifi dated 1/30/2018 I have had zero reboots on Mamba, Rango, Shelby or Caiman. I consider this problem solved (not necessarily in trunk, but certainly with custom build). This applies to both kernel 4.9 and 4.14.

Yes, that is understood, at least I know I'm aware of that. But for me (currently) that is treating the sympton, and I'm one of those find out the cause kind of people; but that is going to take what has been discussed above.

Regarding not necessarily in trunk, I've been pondering if maybe an RFC PR might not solicit some input or even let's run with it push, but I know I will not be opening one wink, wink, nod, nod.

Arch Linux' kernel config for the Solidrun Clearfog (close enough) has these:

I just noticed that I seem to be getting spontaneous reboots on my Netgear WNDR3800 running LEDE 17.01.4.
Network went down temporarily and when looking at uptime it was just a couple of minutes old. I get new external IPs on reboots and as I log these I can see that this has been happening every few days.

Should I start a new discussion thread or is this the thread the right place?

The WNDR3800 is an Atheros MIPS SoC whereas the WRT series use a Marvell ARM SoC. Furthermore, you're on a 4.4 kernel, and this thread is about Marvell's SoCs combined with a 4.9 kernel. So your issue is probably completely unrelated. I'd put my money on SQM - since I haven't heard about any stability problems on ar71xx with 4.4.

Thanks, will post as a separate discussion!

By way of a further data-point, with:

diff --git a/target/linux/mvebu/config-4.14 b/target/linux/mvebu/config-4.14
index 71123f52..9069247 100644
--- a/target/linux/mvebu/config-4.14
+++ b/target/linux/mvebu/config-4.14
@@ -44,8 +44,12 @@ CONFIG_ARM_ATAG_DTB_COMPAT=y
 CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_FROM_BOOTLOADER=y
 CONFIG_ARM_CPU_SUSPEND=y
 CONFIG_ARM_CRYPTO=y
+CONFIG_ARM_ERRATA_643719=y
 CONFIG_ARM_ERRATA_720789=y
+CONFIG_ARM_ERRATA_754322=y
+CONFIG_ARM_ERRATA_754327=y
 CONFIG_ARM_ERRATA_764369=y
+CONFIG_ARM_ERRATA_775420=y
 CONFIG_ARM_GIC=y
 CONFIG_ARM_GLOBAL_TIMER=y
 CONFIG_ARM_HAS_SG_CHAIN=y
@@ -387,6 +391,7 @@ CONFIG_PINCTRL_ARMADA_38X=y
 CONFIG_PINCTRL_ARMADA_XP=y
 CONFIG_PINCTRL_MVEBU=y
 # CONFIG_PINCTRL_SINGLE is not set
+CONFIG_PL310_ERRATA_769419=y
 CONFIG_PJ4B_ERRATA_4742=y
 # CONFIG_PL310_ERRATA_588369 is not set
 # CONFIG_PL310_ERRATA_727915 is not set

reboot in <1 hour.

Given that this research has been going on for a more than a year, and that we do have a workaround with limited drawbacks, and that there is a need to move on and get other parts of v4.14 tested on mvebu (including Mamba), I think the time has come to push a patch to disable CPU_IDLE for now.

Yes, I would also prefer finding the real bug instead of papering over it. But that is obviously not easy. It would be nice if there was some way to do a kernel git bisect here. But the large number of out-of-tree patches coupled with the big version gap between known good (v4.4.?) and first known bad (v4.9.?) makes that too hard. I made a feeble attempt last night, but had to give up quickly after messing up the switch while attemting to backport the OpenWrt patches to the commit being tested. It's just too much work for every test, and very hard to get it right every time.

So I vote for disabling CPU_IDLE for now. Reenabling it along with a proper fix is simple once the proper fix is found. No need to prevent Mamba users from testing other parts of the OpenWrt trunk.

Unclear to me as to the intent here. I am certainly not holding things up, nor advocating to hold things up. I even have a 4.14.x cpuidle patched mamba build for public consumption off the link on my avatar. I just investigate when I feel like taking a look at the issue once again.

To me the holdup on submitting a patch as indicated in this thread, is the other mvebu targets. My guess, and I think also @InkblotAdmirer, is that owners of other Linksys wrt devices, Omnia Turris, SolidRun... devices are going to be less than keen on this solution. My suggestion as to a RFC PR was simply meant to see if we could maybe garner further input, maybe alternative implementation via Makefile that would target just mamba; but I for one don't see the targeted solution, if it exists. But by all means, submit a patch PR | ML and gauge the community reaction. I certainly will not be speaking against same.

Edit: and in getting around to checking today's ML I see you have done that. Thanks, should be interesting to see if there are any naysayers.

Please try the latest version from my staging tree at https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary
This commit should hopefully fix it: https://git.openwrt.org/0142d447bc02922369c363de6510e76848943e9a

1 Like

Or better yet, a targeted solution that seems to have done the trick. Thanks much.

A Mamba specific workaround! Thanks @nbd and all others that have contributed to the potential isolation of this long pending and frustrating issue

Lets see how this pans out. Time for me to put together another build

This may be a hack but with the commit now pushed to trunk mamba users can flash releases, snapshorts, or build from trunk without needing to custom patch.

It would be great to find the underlying cause but as bmork has mentioned (and I found a while back) the effort may not be worth it. I have been able to observe no negatives from disabling cpu idle and thus as I have hinted before, I've moved on from this issue.

This might be old news to the rest of you, but in case it is not...

It believe the assumption that the bug only affects mamba could be false. Unless there is something I don't see here, then CPU_IDLE has "always" been forcibly disabled on caiman, cobra, rango and shelby:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/mach-mvebu/pmsu.c#n431

Not that this changes anything. It just explains why the regression didn't affect the other variants. Which I think is very important to keep in mind when trying to figure out the real problem. The real bug does not have to be mamba specific at all.

The fix/workaround for mamba is still the correct one. A big thanks to @nbd for that.

1 Like

well. First of, I don't have a WRT1900AC V1. So all I'm about to say could all be wrong . But I do know of a strangely similar issue with IPQ4019. It too would randomly reboot from time to time for no apparent reason. Luckily, the reason was found by Qcom and a patch was upstreamed:

https://github.com/torvalds/linux/commit/395717ee0d010a172c17c9e27a9483388d0f8e4c

So, in the IPQ4019 case, the CPU can't be freely switched between the frequencies.
For example, if the CPU was clocked at 48 MHz (lowest) it first needs to be switched to the 500MHz clock before it can go to the highest clock (716 MHz). Could something like this also be the case for the armada-370 cpu? I.e. when going to cpu_idle (or leaving) the cpu has to switch through the frequencies more like a gearbox and ramp up slowly (up or down) and only go from one frequency to the next?

Getting somewhat curious, given the @bmork link above, and that the CPU frequency scaling appears to function on the rango, just what the real situation is regarding what is truly borken.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.