[Solved] WRT1900ACV1 reboots: kernel 4.9

I got confirmation kernel 4.14 will be the next kernel pushed to master. We should probably start debugging for this kernel.

My devices consistently reboot with 4.9. I have tested more recent kernels (4.9.51) and the issue is still there. I have even tested versions with CESA and BM disabled, to no avail.

Between 4.9 and 4.4 there were changes in ARM multiprocessor concurrency, which would be my next guess (barring any crash logs).

Did we found a way to mitigate the issue or V1 users will have to stick to 4.4 and custom builds for a while ?

By the way, it look like a hard lock-up. No logs being sent to logging server and nothing on the serial console (uboot just appears magically out of nowhere).

We? You?

To my knowledge, the problem is so far unsolved.

It looks like there aren't any core developers with a V1 and so nobody knowledgeable enough really looks for the problem. I have seen no movement regarding that for several months.

I believe there is one core dev that has a mamba, but with no crash log facility...

I have generated quite a few images since ~Sept 3, mostly associated with a kernel push, whether it made its way into the LEDE git, or was just from kernel.org, and not one of these images crashed. Every image previous to that has eventually crashed on me, some sooner than others, but always the same result. I even started generating images in a different OS environ to see if that made a difference (Ubuntu 16.04 -> Arch), still keeps on ticking. just an update.

  • Has anyone tried > 5 gcc, I have not done that on this target as of yet.
  • I test with radios off, I am going to guess that everyone else has those pesky things turned on. There was someone on irc reporting mamba reboot on 17.01.4 when 2.4 was brought up(4.4 kernel).

As an aside, I had assumed images would be byte identical across platforms, but things churned out by this new Arch environment were a few bytes different.

Please try the latest version from my staging tree at https://git.lede-project.org/?p=lede/nbd/staging.git;a=summary
I've pushed an IRQ related fix, maybe it will help with the stability issues.

1 Like

Another commit might help, I wonder if nbd can upgrade the kernel.

commit be3390d86bc24dc1ceb38e677f8ea2a1cf78d309
Author: Yan Markman ymarkman@marvell.com
Date: Sun Oct 16 00:22:32 2016 +0300

ARM: dts: mvebu: pl310-cache disable double-linefill

commit cda80a82ac3e89309706c027ada6ab232be1d640 upstream.

Under heavy system stress mvebu SoC using Cortex A9 sporadically
encountered instability issues.

The "double linefill" feature of L2 cache was identified as causing
dependency between read and write which lead to the deadlock.

Especially, it was the cause of deadlock seen under heavy PCIe traffic,
as this dependency violates PCIE overtaking rule.

Fixes: c8f5a878e554 ("ARM: mvebu: use DT properties to fine-tune the L2 configuration")
Signed-off-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Igal Liberman <igall@marvell.com>
Signed-off-by: Nadav Haklai <nadavh@marvell.com>
[gregory.clement@free-electrons.com: reformulate commit log, add Armada
375 and add Fixes tag]
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.61

Going to try nbd's commit on a custom -and dirty- build bumped to kernel 4.9.61. Gonna keep you posted.

Patch for bumping to 4.9.61 is already submited to patchworks so you can use that

I have two units upgraded with the IRQ patch and running 4.9.61. Fingers crossed... maintaining 4.4 and 4.9 builds is suboptimal.

1 Like

I'm throwing boulders at the damn thing, it doesn't event flinch. Been running CPU, IOs and Network intensive tasks for the past 3 hours and it's still going strong. I even changed the static routing table of my whole network to make a giant gigabit packet roller coaster, stressing everything in it's path.

I think we finally nailed it, and if we didn't, it's definitively more stable than before and we're heading in the right direction. It didn't even last this long idle on my last attempt running 4.9.

root@net002:~# uptime
 14:47:38 up  3:13,  load average: 8.36, 8.56, 8.74
root@net002:~# btrfs scrub status /mnt/ext.btrfs/
scrub status for ecf9ffd5-ba62-4210-b5da-7be5ec094ab4
    scrub started at Wed Nov 15 12:14:14 2017, running for 02:33:40
    total bytes scrubbed: 913.27GiB with 0 errors
root@net002:~# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1:        +68.2°C  

Screenshot_20171115_145738

Edit 1: Passing 1 day of uptime, half of it was under extreme stress conditions, looking good ! :slight_smile:

Both of mine rebooted in under 48 hours. Running 4.9.61 with IRQ patch.

It doesn't seem to have made much difference. I've had three reboots (that I've noticed) in about the same time period.

Tested with nbd's staging at first then swapped to an official snapshot when the patch was merged+built.

It might just be coincidence, but two of the times were just after I'd walked back into range of the wifi. I heard the fan kick up at boot as I walked in the front door.

This is really beginning to smell like it is mwlwifi causing the grief. I have not had a reboot on any image generated since the beginning of September, but like I stated in earlier post, I am testing with all radios turned off. Currently 4.9.61, the IRQ patch, up for 2 days, 6 hours. Any relation to 225 issue? There was someone that posted on there regarding same experience, but with 88W8864 rather than 88W8964.

Crashed an hour ago too, 2 days and 1 hour of uptime :frowning:

The mwlwifi issue seem to produces an OOPS were my reboots doesn't. I'm not sure if it's related. Anyway, i'm recompiling with the latest mwlwifi (+firmware) to see about that.

Not disagreeing with anyone but my reboots seemed to happen when I was doing long large downloads via ethernet. However, I did have wireless clients connected at the same time. I just know that the 2 times i witnessed the reboot was when I was downloading a 4Gig ISO image via wired.

Passing my uptime record of 2 days and 1 hour as of now, all radios off. Cutting the significant other's iPad off the Internet isn't really WAF compliant but heh, a crash in the middle of her intense Hearthstone game isn't either.

Will keep you posted about my mamba stability with this new factor in play.

Edit 1: Crashed at 2 days 6 hours of uptime, radios off. I think mwlwifi's driver is unrelated.

Edit 2: I asked to re-open #888 as we are still experiencing reboots.

Everything works well with LEDE r5322 Kernel 4.9.58 , i own a wdr3600 rev.1.5 and a WRT1900ACS v.2!

https://cdn.superwrt.download/firmware/

@oli, So what, none of that has squat to do with this thread...

And things just keep on ticking:

root@bsaedgy:/etc# cat openwrt_version
r5297-bddffc5
root@bsaedgy:/etc# uname -a
Linux bsaedgy 4.9.61 #0 SMP Fri Nov 10 13:53:04 2017 armv7l GNU/Linux
root@bsaedgy:/etc# uptime
 11:04:42 up 5 days, 21:57,  load average: 0.00, 0.02, 0.00

May as well chase the kernel PR of the day...

1 Like

Perhaps manually modifying this kernel option to prevent the restart?

# Debug Lockups and Hangs
#
# CONFIG_LOCKUP_DETECTOR is not set