[Solved] WRT1900ACV1 reboots: kernel 4.9

Hi and thanks to everyone for trying to sort this out.
Here is some additional data points from my side:

Environment: WRT-1900AC V1 (mamba)

  1. Have been attempting to run various kernel 4.9 releases for many months. All releases ultimately fail with a reboot within 48 hours. No error logging occurs at the time of the reboot. The router is not heavily loaded.
  2. Kernel version 4.4 has run rock solid on same WRT1900ACv1 for many months without errors.
  3. I compiled new kernel 4.9.65 last week with "CPU IDLE" and "CPUFREQ SCALING" both disabled in kernel config. Router has been now running for over 8 days now without reboots or errors.
  4. Today I have recompiled a new kernel with "CPUFREQ SCALING" re-enabled and left CPU_IDLE still disabled in kernel config. Will test for one week to see the results. I will report back.

Based on some anecdotal evidence, I'm speculating that the problem may arise when the CPU is coming out of an IDLE state. On a couple of occasions my router was pretty much idle when I logged into LUCI. As soon as LUCI started, the router rebooted.
The Marvell doc shows the Armada XP CPU has 3 possible states (IDLE, DEEP IDLE, and SLEEP) Each progressive state is more power efficient but will take longer to restart. I'm not sure which state we are actually using in this 4.9 kernel. Maybe someone could give us some insight how this all works.
Thanks much.

.

@anomeome Totally nonsense. I asked something different up a few posts above. He also offered to build a image for wrt1200ac which he doesnt answer anymore on it. The kernel log still has a quote line of this: "[178472.710791] [] (arch_cpu_idle) from [] (cpu_startup_entry+0xf0/0x19c)" why, if cpu_idle is disabled? It seems to me the issue is related to it, it is obvious, that all Armada CPUs have this issue, not just wrt1900v1 with XP CPU. The reboots I have also speak for it, that they happened under heavy CPU load or change from idle to load of the CPU. For example one happened the second, I opened Luci in web browser => cpu load.

@mfka8 I really suggest you to start a new thread. You have obviously the knowledge and came to us with the right bits of information (backtrace, information about your env, steps to reproduce) but hijacking an existing thread is a bad way to do it. Most of the "forums guys" don't like that. Frankly, I don't care, your issue is somewhat related but you'll have little to no feedback from this particular thread, as most of us don't even own a WRT1200AC. I'm not deliberately ignoring you and I hold no grudge against you. I don't know what you expect of me so i can't answer.

I am pretty sure I never said that and I'm too lazy to dig up my old posts. Feel free to quote me if i did. Will make my next build for all platforms but you should try to ask nicely, it work most of the time (heck it even works when you're not).

I have good news for you :slight_smile:

root@net002:~# uptime
 15:14:56 up 5 days, 17:05,  load average: 0.00, 0.00, 0.00
root@net002:~# cat /etc/openwrt_version
r5493+3-b8220883fd

Edit 2: @mfka8 Previous link has been updated with a build for all mvebu target. This build has not been tested at all, use at you own risk.

It was JTRealms who offered that, so I apologize, I thought it was you because of your patch file. If you find the time @NainKult please post a patch file for wrt1200 too. Or is your "kernel roulette" patch file not limited to armada XP CPUs, so are the settings you changed in the patch file global and would work out of the box for the other models too (armada 385)?

Patch is common to all mvebu targets

I am actually starting to believe, here. I removed just the 3 CPU_IDLE configs and the 5 RTC configs and one mamba has been up 3 1/2 days.

I just now flashed a 2nd mamba using a config with just the CPU_IDLE config flags unset.

I can't tell the difference in CPU temps with and without the patches. I flashed a Shelby and Caiman device as well, and the same -- these CPUs run hot whether CPU_IDLE is enabled or not.

This will be nice -- the wireless seems to play with 4.9 better than 4.4 -- transferring large files I see peaks of ~60MB/s and iperf approaches 600 Mb/s. Not to mention being able to build with just one config.

This, assuming that community build got it right.

@mfka8, you continue to conflate, what are imo, two very different issues that manifest in very different ways. As far as I can tell, you are the only one seeing whatever the issue is that you are reporting. I have not seen anything untoward occurring from an image running on a rango.

Is there a way to get CPU but more important RAM clock values? I am wondering if especially RAM clock and/or voltage values for whatever reason arent correct (for the model), maybe since a specific kernel version. This may be also reason, why disabling cpuidle and cpufreq could show some help for some people.

I'm testing the patch NainKult provided for a v1 build.

Are these the correct parameters or answers to the questions? Thanks ----

  • CPU Frequency scaling

CPU Frequency scaling (CPU_FREQ) [Y/n/?] y
CPU frequency transition statistics (CPU_FREQ_STAT) [Y/n/?] y
CPU frequency transition statistics details (CPU_FREQ_STAT_DETAILS) [N/y/?] n
Default CPUFreq governor
1. performance (CPU_FREQ_DEFAULT_GOV_PERFORMANCE)
2. powersave (CPU_FREQ_DEFAULT_GOV_POWERSAVE)
3. userspace (CPU_FREQ_DEFAULT_GOV_USERSPACE)

  1. ondemand (CPU_FREQ_DEFAULT_GOV_ONDEMAND)
    5. conservative (CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
    6. schedutil (CPU_FREQ_DEFAULT_GOV_SCHEDUTIL)
    choice[1-6?]: 4
    'performance' governor (CPU_FREQ_GOV_PERFORMANCE) [Y/?] y
    'powersave' governor (CPU_FREQ_GOV_POWERSAVE) [N/m/y/?] n
    'userspace' governor for userspace frequency scaling (CPU_FREQ_GOV_USERSPACE) [N/m/y/?] n
    'ondemand' cpufreq policy governor (CPU_FREQ_GOV_ONDEMAND) [Y/?] y
    'conservative' cpufreq governor (CPU_FREQ_GOV_CONSERVATIVE) [N/m/y/?] n
    'schedutil' cpufreq policy governor (CPU_FREQ_GOV_SCHEDUTIL) [N/y/?] n
  • CPU frequency scaling drivers

Generic DT based cpufreq driver (CPUFREQ_DT) [N/m/y/?] (NEW) y
Generic ARM big LITTLE CPUfreq driver (ARM_BIG_LITTLE_CPUFREQ) [N/m/y/?] n
CPU frequency scaling driver for Freescale QorIQ SoCs (QORIQ_CPUFREQ) [N/m/y/?] n
*

  • ARM CPU Idle Drivers

Generic ARM/ARM64 CPU idle Driver (ARM_CPUIDLE) [N/y/?] n
CPU Idle Driver for mvebu v7 family processors (ARM_MVEBU_V7_CPUIDLE) [N/y/?] (NEW) n
*

@davidc502
Hi David:
Just wanted to relate that I've been running now for almost 4 days without reboots on my mamba.
In my particular configuration only CPU IDLE is disabled. I left CPU FREQ (the 4.9 default) enabled. So far so good.
.
Here are the lines that were "deleted" by the "kernel menuconfig" program in the "config-4-9" file:
CONFIG_ARM_MVEBU_V7_CPUIDLE=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_PM=y

I believe you can either just delete those lines or change to:
# CONFIG_ARM_MVEBU_V7_CPUIDLE is not set
# CONFIG_CPU_IDLE is not set
# CONFIG_CPU_IDLE_GOV_LADDER is not set
# CONFIG_CPU_PM is not set

Sorry I can;t help with the CPU FREQ changes but I'm sure @NainKult or someone else can help with this..
In addition, my temperatures do not appear to be excessive: (however, this is a lightly loaded router)

root@linksys-router:~# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1: +63.9°C

tmp421-i2c-0-4c
Adapter: mv64xxx_i2c adapter
temp1: +50.8°C
temp2: +52.9°C

Hope this helps and thanks for all you do for the community.

@davidc502 On which LEDE tag are you trying to merge the patch ? I remember I tried to apply it to 17.01 earlier and I had similar issues. I could rework the patch if you tell me more about your tree.

Btw, sorry for the links instability today, I had to rework my whole frontend architecture and one of the downside was short urls redirects were down with a 503.

Edit 1: I really need to ditch the useless debug overhead anyway...

I'm building from lede trunk or daily "snapshot", so I wouldn't expect the results of the patch to be any different from 17.01. In this case it is r5572.

Appreciate the effort to rework this patch.

@davidc502 I can confirm my patch is applying correctly on a clean tree.
You may have forgotten to remove your tmp/ directory and/or do a make clean prior make world.

However, I reworked the patch anyway because of the -now useless- debug flags
[LEDE-DEV] kernel: mvebu: remove CPU power management features

Be careful, it applies to all mvebu targets.
And if it doesn't work, nuke your tree with make distclean and start again (backup your files first !).

Edit 1: Gonna bump my kernel with this new patch so I want to share some sweet, sweet uptime before it's gone.

root@net002:~# uptime
 18:20:03 up 9 days, 20:10,  load average: 0.12, 0.20, 0.09
root@net002:~# cat /etc/openwrt_version 
r5493+3-b8220883fd

Can you link to the wan6 issue on the openwrt forum please?

post124 has a link to what I think was the first reference I saw to this as a possible resolution; followed by others. I think most have retracted since that time though.

Mamba up now for 13 days with only the 4 CPU_IDLE entries disabled in 4.9 kernel config.
(this router typically rebooted within 48 hours without this change)

root@linksys-router:/dev# uptime
20:19:56 up 13 days, 4:14, load average: 0.00, 0.00, 0.00
root@linksys-router:/dev# cat /etc/openwrt_version
r5523-7f029c3924

root@flowernet ~ # cat /etc/openwrt_release && uptime
DISTRIB_ID='LEDE'
DISTRIB_RELEASE='SNAPSHOT'
DISTRIB_REVISION='r5506-a8d3d517d0'
DISTRIB_TARGET='mvebu/generic'
DISTRIB_ARCH='arm_cortex-a9_vfpv3'
DISTRIB_DESCRIPTION='LEDE SNAPSHOT r5506-a8d3d517d0'
DISTRIB_TAINTS='no-all'
 21:38:27 up 16 days, 15:07,  load average: 0.00, 0.00, 0.00

Fixed IPv6 dying with option keepalive '1000 5'.

For those looking for another possible avenue to explore regarding this issue. @hauke staging mvebu-4.14 branch builds and loads on the mamba. Have not had much of a chance to test things yet though.

4.14 may be the last hope for the V1. I believe 4.14 pulls in all the mvebu-next stuff with a fix for a reboot issue. Not sure if the fix is related to this mvebu reboot issue

I updated the original post with the patch I've been using. No issues on two mamba devices over roughly two weeks -- stable enough for me!