[Solved] WRT1900ACV1 reboots: kernel 4.9

For those looking for another possible avenue to explore regarding this issue. @hauke staging mvebu-4.14 branch builds and loads on the mamba. Have not had much of a chance to test things yet though.

4.14 may be the last hope for the V1. I believe 4.14 pulls in all the mvebu-next stuff with a fix for a reboot issue. Not sure if the fix is related to this mvebu reboot issue

I updated the original post with the patch I've been using. No issues on two mamba devices over roughly two weeks -- stable enough for me!

@InkblotAdmirer @NainKult

Testing The Patch provided by NainKult on LEDE master / Openwrt master (with the merger).

uptime: 8 days!
kernel: 4.9.70

I relegated mamba as a dumbAP but still uptimes look good! I might give 4.14 a go but good job guys! Device is now "stable*"

On the 4.14.x kernel front, I built a 4.14.6, and a 4.14.11, image, both suffered random reboots. These are the first images that I have generated in 4 months that rebooted.

There is some chatter on IRC that the issue might be the watchdog. I may take a look at the watchdog idea now that I have an image that experiences the issue again.

For what it's worth, I have kernel 4.14 running on mamba, caiman, shelby and rango -- all are running fine and stable. Based on anomeome's prior post I kept CPU_IDLE disabled and have had no issues. There is a proposed patch to migrate the pci syntax in /etc/config/wireless but it didn't work for me on the mamba. For anyone else who wants to try, correct syntax is (at least... what's working for me):

mamba (radio0, then radio1):

soc/soc:pcie@82000000/pci0000:00/0000:00:02.0/0000:02:00.0
soc/soc:pcie@82000000/pci0000:00/0000:00:03.0/0000:03:00.0

all others (radio0, radio1):

soc/soc:pcie/pci0000:00/0000:00:01.0/0000:01:00.0
soc/soc:pcie/pci0000:00/0000:00:02.0/0000:02:00.0

Hi Everyone,

I followed this thread trying to resolve my mamba random reboot issue with Kernel 4.9. I applied the patch as directed in [mamba stability patch](mamba stability patch). The patch is applied against a clean clone from the openwrt.git repository. I enabled /proc/config.gz to ensure the CPU power management feature is disabled. My kernel config from /proc/config.gz in the router can be viewed from here.

My openwrt version is below
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='SNAPSHOT'
DISTRIB_REVISION='r6029-d0d37e89af'
DISTRIB_TARGET='mvebu/generic'
DISTRIB_ARCH='arm_cortex-a9_vfpv3'
DISTRIB_DESCRIPTION='OpenWrt SNAPSHOT r6029-d0d37e89af'
DISTRIB_TAINTS='no-all no-ipv6 busybox'

It seems that everybody has a great success with disabling CPU power management but this is not the case for my own router. Did I miss anything here?

@e88z4

It's possible your issue is not related -- the symptoms that started this thread are reboots with no syslog messages or kernel crashlogs.

I have seen other reboot issues that likely still exist (android clients not playing nice with mwlwifi on certain models). Have you set logs to non-tmp storage and checked for messages prior to reboot, or checked for the kernel crashlog after a reboot?

I found this in the crashlog. This may not be related to the random reboot caused by the CPU power management.

I did a few troubleshooting since a week ago.
I tried to compile against kernel 4.4 as well and I got a random reboot as well. I didn't have this issue before in kernel 4.4 when I compile the source about 6 months ago. Some recent commit in some packages probably break something here. I will start by using an older version of mwlwifi.

Anyway, thanks for replying quickly.

That certainly looks like one of the open mwlwifi issues, you may want to check in on the github page. I can't recall but the mwlwifi crashes I saw was when I had clients configured to attach to an SSID on either of 2.5G and 5G bands. When attaching or detaching mwlwifi would get confused and crash. Configuring the client to only attach to one band solved the crash.

Hi
i dont want to open a new thread for this...
I digged around in the kernel menu config and found some options for cortex a9 errata fixes.

According to this document:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388h/BABCBFDF.html
The MIDR register contains the cpu rev and patch level.

According to this post:

/proc/cpuinfo displays this info.

So it seems like in my wrt1200 an cortex a9 r4p1 is doing its work.

So going through the errata document:
Link: https://silver.arm.com/download/Unspecified/BX500-DA-10004-r0p0-01rel0/UAN0009_cortex_a9_errata_r4.pdf

761319 - gcc patch needed, not sure if this is already included
-> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html

740657 - Global Arm Timer is enabled, so its fixed?
751476 - Doesnt effect normal operation, no fix needed (debugging/watchpoint)

754323 / 754327
-> https://www.spinics.net/lists/arm-kernel/msg442709.html
**Since there is no fix for 754323 errata **
754327 maybe also needs to be checked.

754322 - not checked, needs to be enabled?
764369 - not checked, needs to be enabled?
794072 - no buildin kernel fix available (was grouped with 764369 in the past)
794073 - needs bootloader workaround, disable mmu
794074 - no buildin kernel fix available
845369 - no buildin kernel fix available
-> https://community.nxp.com/thread/371109 -> heavy perfomance drop

Seems like the default force checked errata 720789 fix is not needed.
It isnt even mentioned in the doc...why?

That leaves me with the pl310 errata.

So i found one document, but it is from 2012
https://silver.arm.com/download/Unspecified/BX500-DA-11003-r0p0-00rel0/corelink_pl310_software_developers_errata_notice_r3_UAN0013B.pdf
Most erratas are fixed > r3p1.
If someone knows how to get the rev from the pl310 controller that would be great.

In the Kconfig for the mvebu platform it states:

config MACH_ARMADA_38X
bool "Marvell Armada 380/385 boards"
depends on ARCH_MULTI_V7
select ARM_ERRATA_720789
select ARM_ERRATA_753970

(same errata selection for armada 375)
But errata 753970 is not checked in menuconfig.

CONFIG_PL310_ERRATA_753970:
This option enables the workaround for the 753970 PL310 (r3p0) erratum.
Under some condition the effect of cache sync operation on
the store buffer still remains when the operation completes.
This means that the store buffer is always asked to drain and
this prevents it from merging any further writes. The workaround
is to replace the normal offset of cache sync operation (0x730)
by another offset targeting an unmapped PL310 register 0x740.
This has the same effect as the cache sync operation: store buffer
drain and waiting for all buffers empty.

The problem is, the symbol changed from
ARM_ERRATA_753970
to
PL310_ERRATA_753970

I found this patch
https://patchwork.kernel.org/patch/10199559/

So an errata fix for r3p0 was force selected.
Assuming the marvell socs are using an pl310 r3p0.
It seems like the following erratas are also needed:
// fixed copy paste fail :>
754322
754327
764369

I think?

Can someone with wrt1900 post the output of cat /proc/cpuinfo please?

1 Like

Hey @shm0!

Here is the output you requested:

root@LEDE:~# cat /proc/cpuinfo 
processor	: 0
model name	: ARMv7 Processor rev 2 (v7l)
BogoMIPS	: 50.00
Features	: half thumb fastmult vfp edsp vfpv3 tls idiva idivt vfpd32 lpae 
CPU implementer	: 0x56
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0x584
CPU revision	: 2

processor	: 1
model name	: ARMv7 Processor rev 2 (v7l)
BogoMIPS	: 50.00
Features	: half thumb fastmult vfp edsp vfpv3 tls idiva idivt vfpd32 lpae 
CPU implementer	: 0x56
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0x584
CPU revision	: 2

Hardware	: Marvell Armada 370/XP (Device Tree)
Revision	: 0000
Serial		: 0000000000000000

Thank you.

Armada 370/XP uses an custom cpu design by marvell? (pj4b cpu?)
I cant find any errata document on this cpu.
However some say this cpu is based on cortex a8/a9 design.
Maybe its worth a try to test the erratas for cortex a8/a9 on this cpu to see if there is any difference.

I found 3 erratas for this cpu.
4742, which included in the linux kernel and already checked by default.
But there also seems to be 4611 and 6124?
Those patches are for kernel 3.x. I dont know if this erratas where fixed in kernel 4.x
https://www.spinics.net/lists/arm-kernel/msg248078.html
https://www.spinics.net/lists/arm-kernel/msg246560.html

A 4.14.20 image with errata:

CONFIG_PL310_ERRATA_753970

rebooted after < 2 hours. Not that one.

can you try the three erratas fixes from my post above?
754322
754327
764369
Just for curiosity

//edit
https://github.com/kaloz/mwlwifi/commit/f4d5f12affbca285b501eb984e4dd9498740e520
https://github.com/kaloz/mwlwifi/issues/270#issuecomment-367972657

Finally got around to generating a test image with:

> diff --git a/target/linux/mvebu/config-4.14 b/target/linux/mvebu/config-4.14
> index 708c4e5..827a1b3 100644
> --- a/target/linux/mvebu/config-4.14
> +++ b/target/linux/mvebu/config-4.14
> @@ -389,8 +389,11 @@ CONFIG_PINCTRL_MVEBU=y
>  CONFIG_PJ4B_ERRATA_4742=y
>  # CONFIG_PL310_ERRATA_588369 is not set
>  # CONFIG_PL310_ERRATA_727915 is not set
> -# CONFIG_PL310_ERRATA_753970 is not set
> +CONFIG_PL310_ERRATA_753970=y
>  # CONFIG_PL310_ERRATA_769419 is not set
> +CONFIG_ARM_ERRATA_754322=y
> +CONFIG_ARM_ERRATA_754327=y
> +CONFIG_ARM_ERRATA_764369=y
>  CONFIG_PLAT_ORION=y
>  CONFIG_PM_OPP=y
>  CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=11

which rebooted after about an hour.

I'm not sure as to your intent of the mwlwifi links, but that issue was introduced to the mwlwifi code long after this issue started, also I test this mamba issue with the radios turned off.

This are bad news :-/
But yeah PJ4B custom design cpu.
and those fixed are for cortex a9.
No one knows exactly on which design this cpu is based.
Could also be cortex a8. Only marvell knows for sure.

isn't this memory corruption issue in mwlwifi there since the beginning, atleast at seems.
Yes the commit fixed some issue that was introduced lately.
Im also not quite sure if the problem really is the mwlwifi driver, cause other devices doesnt suffer from this issue? Maybe more likely an mwlwifi firmware bug?

Could be many things. even a bad batch of power supplies x)

There is still the gcc patch left to try.

but stock firmware is stable yes?

//edit
i downloaded the source from linksys. (kernel 3.x) ?
In their config is following set:
CONFIG_SHEEVA_ERRATA_ARM_CPU_6124=y
CONFIG_SHEEVA_ERRATA_ARM_CPU_PMU_RESET=y

where is this 6124 errata fix gone in 4.x? is it now combined with 4742?
And there is also 4611
https://lists.infradead.org/pipermail/linux-arm-kernel/2013-May/171475.html

and for wrt1200 it is
CONFIG_PJ4B_ERRATA_4742
CONFIG_ARM_ERRATA_720789
No pl310 erratas checked.
but i leave mine running with
CONFIG_ARM_ERRATA_754322
CONFIG_ARM_ERRATA_754327
CONFIG_ARM_ERRATA_764369
and with the gcc patch.
cause i trust the official arm doc more then those marvell devs x)

With just the CPU IDLE disabled and mwlwifi dated 1/30/2018 I have had zero reboots on Mamba, Rango, Shelby or Caiman. I consider this problem solved (not necessarily in trunk, but certainly with custom build). This applies to both kernel 4.9 and 4.14.

Yes, that is understood, at least I know I'm aware of that. But for me (currently) that is treating the sympton, and I'm one of those find out the cause kind of people; but that is going to take what has been discussed above.

Regarding not necessarily in trunk, I've been pondering if maybe an RFC PR might not solicit some input or even let's run with it push, but I know I will not be opening one wink, wink, nod, nod.

Arch Linux' kernel config for the Solidrun Clearfog (close enough) has these: