[Solved] WRT1900ACV1 reboots: kernel 4.9

I believe there is one core dev that has a mamba, but with no crash log facility...

I have generated quite a few images since ~Sept 3, mostly associated with a kernel push, whether it made its way into the LEDE git, or was just from kernel.org, and not one of these images crashed. Every image previous to that has eventually crashed on me, some sooner than others, but always the same result. I even started generating images in a different OS environ to see if that made a difference (Ubuntu 16.04 -> Arch), still keeps on ticking. just an update.

  • Has anyone tried > 5 gcc, I have not done that on this target as of yet.
  • I test with radios off, I am going to guess that everyone else has those pesky things turned on. There was someone on irc reporting mamba reboot on 17.01.4 when 2.4 was brought up(4.4 kernel).

As an aside, I had assumed images would be byte identical across platforms, but things churned out by this new Arch environment were a few bytes different.

Please try the latest version from my staging tree at https://git.lede-project.org/?p=lede/nbd/staging.git;a=summary
I've pushed an IRQ related fix, maybe it will help with the stability issues.

1 Like

Another commit might help, I wonder if nbd can upgrade the kernel.

commit be3390d86bc24dc1ceb38e677f8ea2a1cf78d309
Author: Yan Markman ymarkman@marvell.com
Date: Sun Oct 16 00:22:32 2016 +0300

ARM: dts: mvebu: pl310-cache disable double-linefill

commit cda80a82ac3e89309706c027ada6ab232be1d640 upstream.

Under heavy system stress mvebu SoC using Cortex A9 sporadically
encountered instability issues.

The "double linefill" feature of L2 cache was identified as causing
dependency between read and write which lead to the deadlock.

Especially, it was the cause of deadlock seen under heavy PCIe traffic,
as this dependency violates PCIE overtaking rule.

Fixes: c8f5a878e554 ("ARM: mvebu: use DT properties to fine-tune the L2 configuration")
Signed-off-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Igal Liberman <igall@marvell.com>
Signed-off-by: Nadav Haklai <nadavh@marvell.com>
[gregory.clement@free-electrons.com: reformulate commit log, add Armada
375 and add Fixes tag]
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.61

Going to try nbd's commit on a custom -and dirty- build bumped to kernel 4.9.61. Gonna keep you posted.

Patch for bumping to 4.9.61 is already submited to patchworks so you can use that

I have two units upgraded with the IRQ patch and running 4.9.61. Fingers crossed... maintaining 4.4 and 4.9 builds is suboptimal.

1 Like

I'm throwing boulders at the damn thing, it doesn't event flinch. Been running CPU, IOs and Network intensive tasks for the past 3 hours and it's still going strong. I even changed the static routing table of my whole network to make a giant gigabit packet roller coaster, stressing everything in it's path.

I think we finally nailed it, and if we didn't, it's definitively more stable than before and we're heading in the right direction. It didn't even last this long idle on my last attempt running 4.9.

root@net002:~# uptime
 14:47:38 up  3:13,  load average: 8.36, 8.56, 8.74
root@net002:~# btrfs scrub status /mnt/ext.btrfs/
scrub status for ecf9ffd5-ba62-4210-b5da-7be5ec094ab4
    scrub started at Wed Nov 15 12:14:14 2017, running for 02:33:40
    total bytes scrubbed: 913.27GiB with 0 errors
root@net002:~# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1:        +68.2°C  

Screenshot_20171115_145738

Edit 1: Passing 1 day of uptime, half of it was under extreme stress conditions, looking good ! :slight_smile:

Both of mine rebooted in under 48 hours. Running 4.9.61 with IRQ patch.

It doesn't seem to have made much difference. I've had three reboots (that I've noticed) in about the same time period.

Tested with nbd's staging at first then swapped to an official snapshot when the patch was merged+built.

It might just be coincidence, but two of the times were just after I'd walked back into range of the wifi. I heard the fan kick up at boot as I walked in the front door.

This is really beginning to smell like it is mwlwifi causing the grief. I have not had a reboot on any image generated since the beginning of September, but like I stated in earlier post, I am testing with all radios turned off. Currently 4.9.61, the IRQ patch, up for 2 days, 6 hours. Any relation to 225 issue? There was someone that posted on there regarding same experience, but with 88W8864 rather than 88W8964.

Crashed an hour ago too, 2 days and 1 hour of uptime :frowning:

The mwlwifi issue seem to produces an OOPS were my reboots doesn't. I'm not sure if it's related. Anyway, i'm recompiling with the latest mwlwifi (+firmware) to see about that.

Not disagreeing with anyone but my reboots seemed to happen when I was doing long large downloads via ethernet. However, I did have wireless clients connected at the same time. I just know that the 2 times i witnessed the reboot was when I was downloading a 4Gig ISO image via wired.

Passing my uptime record of 2 days and 1 hour as of now, all radios off. Cutting the significant other's iPad off the Internet isn't really WAF compliant but heh, a crash in the middle of her intense Hearthstone game isn't either.

Will keep you posted about my mamba stability with this new factor in play.

Edit 1: Crashed at 2 days 6 hours of uptime, radios off. I think mwlwifi's driver is unrelated.

Edit 2: I asked to re-open #888 as we are still experiencing reboots.

Everything works well with LEDE r5322 Kernel 4.9.58 , i own a wdr3600 rev.1.5 and a WRT1900ACS v.2!

https://cdn.superwrt.download/firmware/

@oli, So what, none of that has squat to do with this thread...

And things just keep on ticking:

root@bsaedgy:/etc# cat openwrt_version
r5297-bddffc5
root@bsaedgy:/etc# uname -a
Linux bsaedgy 4.9.61 #0 SMP Fri Nov 10 13:53:04 2017 armv7l GNU/Linux
root@bsaedgy:/etc# uptime
 11:04:42 up 5 days, 21:57,  load average: 0.00, 0.02, 0.00

May as well chase the kernel PR of the day...

1 Like

Perhaps manually modifying this kernel option to prevent the restart?

# Debug Lockups and Hangs
#
# CONFIG_LOCKUP_DETECTOR is not set

I'm starting to think the same, as each time there is a crash, the ethernet interfaces are under load. My unit is mainly an Access Point / External Backup Storage NAS and nothing go trough the interfaces when idle. Routing is done elsewhere. However, two nearly successive crashes occurred yesterday night when sending my backup to an outside location, generating traffic on Ethernet. I thought earlier it was a CPU/Load bound bug.

We should look towards MacDebian and DDWRT to look if they have mvneta implemented and, if so, how they implemented it. They seems unaffected by these reboots under the same kernel versions.

Edit 1:

Nobody's home, radios off, kernel module unloaded and the router just crashed with IRQ fix and 4.9.61.

I did not have the radios enabled back when I was getting the reboots, so I agree it probably is not mwlwifi, the @ListerWRT post above gave no indication as to whether there was any logging seen.

In an earlier post @InkblotAdmirer had backed out some patches around mvneta, but still experienced reboot. I have not taken any kind of look into the McDebian build to see what specifics are included around this area, but I do remember some chatter that the image was not experiencing this issue. I am just using the device on a second pipe in my shack so the throughput load is not great. Pulling it and testing with iperf using a computer on the WAN and LAN ports may provide a hint I suppose. But I am not doing anything different now, than when I was getting the reboot.

And on it goes:

root@bsaedgy:/etc# cat openwrt_version
r5394-9247864
root@bsaedgy:/etc# uname -a
Linux bsaedgy 4.9.63 #0 SMP Mon Nov 20 16:52:51 2017 armv7l GNU/Linux
root@bsaedgy:/etc# uptime
 12:27:25 up 21:42,  load average: 0.00, 0.00, 0.00

I am now certain that it is either CPU related or network related. The router has been fairly stable the last couple of days with only one reboot every 48-72 hours or so. But today, I tried 4 times (one this morning, three just now) to do an rsync of 300+ gigs of my server to the hard drive plugged into the router and, damn... ...a minute after starting the rsync... ...BLAM! Each time. Always, after a minute or so, the router crashes. And if i stop sending it mountains of data, like now for example, it works like a charm.

Edit 1: Plugged the serial console hoping to catch something but i have little to no faith to catch it without debugfs.

Just throwing this out there...we could try testing dsa on wrt1900ac v1 from openwrt thus ruling out swconfig or related files and can isolate CPU as being the culprit. As to mvneta being the source of the error i doubt that is the case as it works fine on all other versions of the same chipset on 4.9 and was rock solid on 4.4.