[Solved] WRT1900ACV1 reboots: kernel 4.9

I doubt it, but...

and putting load on CPU did not yield reboot.

Mine is fairly stable too for the last couple of days.

root@net002:~# uptime
 11:25:03 up 2 days, 21:54,  load average: 1.63, 1.27, 0.62
root@net002:~# cat /etc/openwrt_version 
r5436+2-18cc8d520c
root@net002:~# uname -a
Linux net002.ncnet.local 4.9.65 #0 SMP Wed Nov 29 16:01:13 2017 armv7l GNU/Linux

Been running last mwlwifi-20171129, last kernel, some cpu related kernel hacks, idle during days and heavy loads at nights. Rock solid so far. Hope i nailed something.

Edit 1: up 4 days now. :slight_smile:

@NainKult
Care to share the kernel hacks or are you going to generate a patch/pull request?

I will as soon as i do it properly. I need to do a "real" 4.9 patch file instead of writing over the kernel conf with kernel_menuconfig. :stuck_out_tongue: I also need to pinpoint and separate each disabled function so we can reenable them one after the other to find the culprit. I was waiting to make sure i didn't go all in, betting on an image who would crash several hours later...

I know i have increased verbosity and some lockup/irq/workqueue/timers debug stuff enabled to try to catch something. The rest is just disabled kernel functions like cpuidle, cpu frequency scaling, real time clock (my prime suspects).

Maybe it's not one of these functions, maybe it's the added latency introduced by the slight overhead caused by the numerous debug flags who avoid the chip going crazy. Maybe this image was stable because of cosmic rays or whatever.

If you want the ugly version for the kernel, it's right there : less_ugly_4_9_kernel_roulette.diff
And: mvebu_bump_mwlwifi-20171129.diff for mwlwifi

Edit 1: Reworked the kernel patch so it can be mergeable on a clean tree, shorten urls, some markdown thingy so it doesn't hurt the eyes too much.
Update on my last build, up 5 days, 3 hours, 22 minutes :smiley:

Yeah ive been up for 3 days now, 2 days before that but with a manual reboot in between. I just disabled a few debugging interfaces(Not like they help for this issue anyway), disabled cpuidle, cpufreq is enabled but switched to the schedutil gov (ive had random reboot issues with different governor/ kernel combinations on android phones, so applying that here)

I also enabled the fastpath patch with the hope that it might bypass whatever is causing the issue and "compiled for perfomance"

edit: Nearly 4 days for me :slight_smile: I suspect disabling cpuidle has been the main contributor to stability
https://git.realms.tech/JTRealms/lede-wrt1900v1.git
build: https://git.realms.tech/JTRealms/lede-wrt1900v1/src/master/bin/targets/mvebu/generic

btw, using default mwlwifi driver (kmod-mwlwifi_4.9.65+10.3.4.0-20171011-1_arm_cortex-a9_vfpv3.ipk)

I am using a WRT1200AC (v2 I think) and David's build on it. I had 3 reboots over the last two days, since I updated to r5422. I had running r5297 before, and it was rock stable, had no reboots since I bought it a month ago and put directly r5297 on it. r5297 had wifi version kmod-mwlwifi - 4.9.58+10.3.4.0-20170810-1 and I am using right now kmod-mwlwifi_4.9.65+10.3.4.0-20171129-1. One of the three crashes was directly the moment, when I opened LUCI in the web browser over wifi ( https://10.0.0.199/cgi-bin/luci/admin/system/packages ) maybe CPU load related? The other were just when I was browsing the web or at another random point.

I wasnt even aware that the wrt1200ac had this problem. I can do a build for you with the above config if you would like to test, im nearing 6 days uptime, would be interesting to see if it helps the 1200 also

What are your changes @JTRealms do I understand it correctly, you disabled cpuidle? I actually read before that it is bugged on this CPU, but that it was already disabled by default.

My dmesg also mentions this during boot:

[ 0.029886] cpuidle: using governor ladder
[ 0.030062] mvebu-pmsu: CPU hotplug support is currently broken on Armada 38x: disabling
[ 0.030070] mvebu-pmsu: CPU idle is currently broken on Armada 38x: disabling

So isn't it already disabled?

Do you get any crash logs generated? I assumed the wrt1200 was using the AmadaXP but rather it uses the same soc found in the wrt1900acv2 which doesnt have this issue. The reboot issue on the 1900acv1 specifically, doesnt print any debugging or crash info - if yours does its probably a different issue?

@JTRealms how would I see/save crash dumps/logs if the router reboots? is there a way saving it somewhere before the reboot?

Mine is logging to a USB drive and it's never recorded anything of use. I've just kept it that way on the off chance it ever does.

@ListerWRT That is what I thought too. If a kernel panic happens, how would it be able to log anything anymore to a driver place, like to a USB disk or network or whatever. How does Windows do it though for writing kernel dump to disk for a bluescreen? I guess it has it's own little space in ram with a BSOD kernel+seperate disk driver which can do it? Does Linux have something like this too? But not on a small device like a router I guess?

debugfs dump, but not on this target.

Edit: @mfka8, No, see discussion in posts back around the time-frame of the post I linked.

What do you mean "but not on this target" @anomeome ? So is there a way to get the logs, or not? No Howto on how to set this up somewhere for Lede? There must be a way to debug crash reboots, no?

Not implemented or not functional.

Your patch is for 3.18 kernel. A lot changed since that for the ARM ecosystem. I doubt it still disabled as i don't have this mvebu-pmsu debug message appearing. I also don't have a file with that name in my patchwork directory.

However, I am now almost certain our grief is either caused by CPU frequency scaling, or caused by the CPU Idle driver. With both these functions disabled, my router is now stable and did not experience spontaneous reboot since. It just run a little hotter.

I also have the RTC kernel support disabled as per @anomeome thread: mvebu-rtc-can-be-wrong

.

And why do I have those messages then with my device @NainKult and kernel 4.9.65 during boot if it is just for 3.18 like you claim? Also, who cares if some work was done since then? You see, it is still broken.

Not for Armada XP.

Because, as stated earlier by you and JTRealms, you have a WRT1200AC, codename cobra, running on the Marvell Armada 385 88F6820 chip in a thread about an issue impacting the WRT1900ACv1, codename mamba, running on the Marvell Armada XP MV78230 chip. So informations will not be accurate and applicable for your device. Stop being condescending.

Also, your patch for 3.18, is merged upstream (that's why there is no patch in my patchwork dir).

But thank you for being wrong misguided (really), as it caused me to look closer to the Power Management Service Unit of the device, which may also be broken for marvell,armadaxp as well. Currently, CPU Idle disable itself only if it detect marvell,armada380. Will dig into that.

I know that WRT1200AC V2 has a different CPU... same as WRT3200AC and 1900AC v2 I think. But I thought my reboots were related to this issue maybe, because I dont see another thread for reboot issues. This actually happened today and the router is still running, but I guess it is part of the problem, and would lead sooner or later to a catastrophic state which the router cant handle anymore. Any thoughts of what could cause these processes to crash?

[177170.508736] BUG: Bad page map in process tinyproxy  pte:1b06d7dd pmd:1cbe3831
[177170.516009] page:dff59da0 count:0 mapcount:-1 mapping:  (null) index:0x0
[177170.522837] flags: 0x10(dirty)
[177170.526052] page dumped because: bad pte
[177170.530081] addr:01421000 vm_flags:00100073 anon_vma:dcbb2540 mapping:  (null) index:1421
[177170.538390] file:  (null) fault:  (null) mmap:  (null) readpage:  (null)
[177170.545209] CPU: 1 PID: 6474 Comm: tinyproxy Not tainted 4.9.65 #0
[177170.551501] Hardware name: Marvell Armada 380/385 (Device Tree)
[177170.557549] [<c0016010>] (unwind_backtrace) from [<c0012220>] (show_stack+0x10/0x14)
[177170.565420] [<c0012220>] (show_stack) from [<c0218580>] (dump_stack+0x7c/0x9c)
[177170.572769] [<c0218580>] (dump_stack) from [<c00b5140>] (print_bad_pte+0x154/0x18c)
[177170.580549] [<c00b5140>] (print_bad_pte) from [<c00b7394>] (unmap_page_range+0x4fc/0x554)
[177170.588851] [<c00b7394>] (unmap_page_range) from [<c00b781c>] (zap_page_range+0xd0/0x174)
[177170.597156] [<c00b781c>] (zap_page_range) from [<c00c468c>] (SyS_madvise+0x58c/0x7e8)
[177170.605111] [<c00c468c>] (SyS_madvise) from [<c000ed40>] (ret_fast_syscall+0x0/0x3c)
[177170.612984] Disabling lock debugging due to kernel taint
[177170.618830] BUG: Bad rss-counter state mm:dc174700 idx:0 val:-1
[177170.624899] BUG: Bad rss-counter state mm:dc174700 idx:1 val:1
[178472.320439] BUG: Bad page state in process swapper/0  pfn:1b06d
[178472.326478] page:dff59da0 count:-1 mapcount:-1 mapping:  (null) index:0x0
[178472.333393] flags: 0x10(dirty)
[178472.336545] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[178472.342837] bad because of flags: 0x10(dirty)
[178472.347296] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_quota xt_policy xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm slhc rfcomm nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_amanda
[178472.419066]  nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast ts_kmp nf_conntrack_amanda iptable_mangle iptable_filter ipt_ah ipt_ECN ip_tables hidp hci_uart crc_ccitt btusb btmrvl_sdio btmrvl btintel br_netfilter bnep bluetooth fuse sch_cake em_nbyte cls_basic sch_dsmark sch_pie sch_gred sch_teql act_ipt em_text em_meta sch_codel sch_sfq sch_fq act_police sch_prio em_cmp sch_red act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress hid evdev input_core mwlwifi mac80211 cfg80211 compat cryptodev xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet
[178472.490579]  ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables msdos bonding ifb tun vfat fat ntfs nls_utf8 nls_iso8859_1 nls_cp437 regmap_mmio sha512_generic sha256_generic seqiv jitterentropy_rng drbg md5 hmac ghash_generic gf128mul gcm ecb ctr cmac cbc authenc ohci_pci uhci_hcd ohci_platform ohci_hcd gpio_button_hotplug
[178472.542862] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B           4.9.65 #0
[178472.550114] Hardware name: Marvell Armada 380/385 (Device Tree)
[178472.556162] [<c0016010>] (unwind_backtrace) from [<c0012220>] (show_stack+0x10/0x14)
[178472.564034] [<c0012220>] (show_stack) from [<c0218580>] (dump_stack+0x7c/0x9c)
[178472.571383] [<c0218580>] (dump_stack) from [<c009c000>] (bad_page+0x100/0x138)
[178472.578727] [<c009c000>] (bad_page) from [<c009df98>] (get_page_from_freelist+0x638/0x658)
[178472.587115] [<c009df98>] (get_page_from_freelist) from [<c009e3f8>] (__alloc_pages_nodemask+0xe8/0xa10)
[178472.596635] [<c009e3f8>] (__alloc_pages_nodemask) from [<c009edb8>] (__alloc_page_frag+0x34/0x14c)
[178472.605725] [<c009edb8>] (__alloc_page_frag) from [<c0399c80>] (netdev_alloc_frag+0x24/0x34)
[178472.614291] [<c0399c80>] (netdev_alloc_frag) from [<c03c6868>] (hwbm_pool_refill+0x18/0x68)
[178472.622771] [<c03c6868>] (hwbm_pool_refill) from [<c031e3a0>] (mvneta_poll+0x470/0x970)
[178472.630903] [<c031e3a0>] (mvneta_poll) from [<c03a76f4>] (net_rx_action+0xe8/0x2ac)
[178472.638684] [<c03a76f4>] (net_rx_action) from [<c002d364>] (__do_softirq+0xd0/0x204)
[178472.646550] [<c002d364>] (__do_softirq) from [<c002d71c>] (irq_exit+0x94/0xb8)
[178472.653887] [<c002d71c>] (irq_exit) from [<c0062154>] (__handle_domain_irq+0x90/0xb4)
[178472.661840] [<c0062154>] (__handle_domain_irq) from [<c0009428>] (gic_handle_irq+0x50/0x94)
[178472.670316] [<c0009428>] (gic_handle_irq) from [<c0012c8c>] (__irq_svc+0x6c/0x90)
[178472.677916] Exception stack(0xc062ff60 to 0xc062ffa8)
[178472.683077] ff60: 00000001 00000000 00000000 c001b1a0 00000000 c062e000 c0630fe4 00000001
[178472.691377] ff80: c062c168 00000000 c062ffb8 00000001 00000000 c062ffb0 c000f808 c000f80c
[178472.699675] ffa0: 60000013 ffffffff
[178472.703268] [<c0012c8c>] (__irq_svc) from [<c000f80c>] (arch_cpu_idle+0x2c/0x38)
[178472.710791] [<c000f80c>] (arch_cpu_idle) from [<c005b4a4>] (cpu_startup_entry+0xf0/0x19c)
[178472.719096] [<c005b4a4>] (cpu_startup_entry) from [<c05e8c54>] (start_kernel+0x39c/0x420)
[187564.266647] swap_free: Bad swap file entry 00000c00
[187564.271683] BUG: Bad page map in process grep  pte:00060000 pmd:1cbe3831
[187564.278525] addr:00021000 vm_flags:00000875 anon_vma:  (null) mapping:df068b94 index:11
[187564.286663] file:busybox fault:filemap_fault mmap:generic_file_readonly_mmap readpage:squashfs_readpage
[187564.296199] CPU: 0 PID: 4974 Comm: grep Tainted: G    B           4.9.65 #0
[187564.303276] Hardware name: Marvell Armada 380/385 (Device Tree)
[187564.309321] [<c0016010>] (unwind_backtrace) from [<c0012220>] (show_stack+0x10/0x14)
[187564.317193] [<c0012220>] (show_stack) from [<c0218580>] (dump_stack+0x7c/0x9c)
[187564.324539] [<c0218580>] (dump_stack) from [<c00b5140>] (print_bad_pte+0x154/0x18c)
[187564.332319] [<c00b5140>] (print_bad_pte) from [<c00b718c>] (unmap_page_range+0x2f4/0x554)
[187564.340621] [<c00b718c>] (unmap_page_range) from [<c00b773c>] (unmap_vmas+0x44/0x54)
[187564.348487] [<c00b773c>] (unmap_vmas) from [<c00bc0fc>] (exit_mmap+0xc0/0x1bc)
[187564.355829] [<c00bc0fc>] (exit_mmap) from [<c0026ee8>] (mmput+0x38/0xf4)
[187564.362649] [<c0026ee8>] (mmput) from [<c002b650>] (do_exit+0x354/0x838)
[187564.369468] [<c002b650>] (do_exit) from [<c002cc64>] (do_group_exit+0x48/0xd0)
[187564.376810] [<c002cc64>] (do_group_exit) from [<c002ccfc>] (__wake_up_parent+0x0/0x18)
[187564.385112] BUG: Bad rss-counter state mm:ddb41880 idx:2 val:-1

Went from 10 minute reboots to hour + uptime. I didn't enable all the debug fluff though.

If someone lurking this thread is still experiencing reboots with it's mamba, please try the patch less_ugly_4_9_kernel_roulette.diff or this build I made (Last patched snapshot + LuCI) and come back to give some feedback, I think we found the culprit.

If you don't want to use either one of them, you can just disable theses functions in kernel_menuconfig

Edit 1: Link refreshed with updated build.


Disclaimer: This will disable all your power management features. You router will draw more power and inherently heat up much more. Make sure your device is well vent. I am providing this build without any guarantee, use it as your own risk. The link to the build will expire in 10 days