[Solved] WRT1900ACV1 reboots: kernel 4.9

Kernels 4.5 down to 2.6 are vulnerable to remote code execution within the kernel as root. Maybe this has already been patched? The information about this vulnerability was just released last month.

I'm pretty sure it has. Thing is, the kernel devs don't explicitly refer to the CVEs in the changelogs, so you really need to track them down, you can't just grep the changelog for CVEs...

Just check the NIST entry and you'll see the Linux kernel was patched in January 2016 (!). On top of that, the page clearly states the vulnerability is in 4.4.60 and older kernels ;).

You are running the DIR-860L as well, right? How's the 4.4.67 kernel treating you? Are you using SQM by any chance? The current master branch and 17.01 branch both have issues with SQM on the mt7621 devices. It can cause stack traces and crashes that result in a reboot. Kernel 4.9 seems to have fixed it for me, but it is causing other issues for me. Was wondering whether kernel 4.4.67 is any good on mt7621 devices with SQM enabled.

For further discussion on the aforementioned issues, please see the end of this thread:
https://forum.openwrt.org/t/optimized-build-for-the-d-link-dir-860l/

And this bug report (please vote for it if you would like to developers to focus on this bug):

I am, yes. But there's no difference between the 17.01.x 4.4 versions and the .67 one. I'm not switching to trunk until 4.9 is stable enough on ramips.

I already voted for your bug report as well :slight_smile:

Just notice crash dump is enabled for ARM via https://github.com/lede-project/source/commit/48d71ab5021e5238623bab2f87b6425b2609c60a, can anyone give it a try?

From the look of that it's enabled by default? I just tested r4228-43e4e1f and there was no crash log created at /sys/kernel/debug/ (unless it's supposed to be somewhere else now?). My uptime was ~2hrs before lockup/reboot

Yes, it is enabled by default. I have no idea then.
nbd, can you help to find the root cause?

@nbd Do you or anyone else have time to look into this now? I don't think we're making any progress on our own. Sorry to hassle you with a direct mention...

@Mushoz, thought this might be of interest to you. It might be a weird coincidence, but running on a recent 17.01 (pre 17.01.2, so to speak) branch with a GCC 6.3 build (instead of a GCC 5.4 build), I am looking at almost 4 days and 12 hours uptime. This is with kernel 4.4.70.

Fingers crossed of course, but remarkable (usually I have one or more reboots a day).

Sorry for the offtopic :slight_smile:

That sounds very good! Thank you very much for letting us know :slight_smile: Is this with Cake enabled? And if so, have you tried heavy traffic to see if it doesn't crash?

Yeah maybe move the DIR-860L chat elsewhere. We're already struggling to be noticed without being buried in our own topic.

Anyone know if this WRT1900AC V1 Reboot on 4.9 Kernel build problem is being worked on by somebody?

I would like to go to the 4.9 Kernel builds but have experienced the reboots and reverted back to 4.4.70.

DISTRIB_DESCRIPTION='LEDE Reboot SNAPSHOT r4512-f3ae0f8'
Kernel = 4.9.34
SQM enabled.

Spontaneous reboot during long large downloads. I have remote logging enabled but nothing collected.

I use Linksys wrt1900acs v2 with LEDE firmware version compiled by Daniel named SuperWRT, everything is working fine!

http://s.go.ro/8qfvfhf1

https://superwrt.download/firmware/

EDITED: sorry, prematurely posted from fat-finger

Even though I said I was dropping this... I can't let something like this go and I think I may be on to something.

Poring through changes from 4.4 to 4.9, the device tree changes appear straightforward enough but I decided to decompile the compiled device tree and voila... the mamba dts (decompiled) is very different on 4.9 than 4.4. Here is one excerpt:

			crypto@90000 {  // kernel 4.4
			compatible = "marvell,armada-xp-crypto";
			reg = <0x90000 0x10000>;
			reg-names = "regs";
			interrupts = <0x30 0x31>;
			clocks = <0x8 0x17 0x8 0x17>;
			clock-names = "cesa0", "cesa1";
			marvell,crypto-srams = <0xf 0x10>;
			marvell,crypto-sram-size = <0x800>;
		};


			
                    crypto@90000 {  // kernel 4.9
			compatible = "marvell,armada-xp-crypto";
			reg = <0x90000 0x10000>;
			reg-names = "regs";
			interrupts = <0x30 0x31>;
			clocks = <0x7 0x17 0x7 0x17>;
			clock-names = "cesa0", "cesa1";
			marvell,crypto-srams = <0xe 0xf>;
			marvell,crypto-sram-size = <0x800>;
		};

Clocks, interrupts, etc. are all varying from 4.4 to 4.9. Looking further, I think it's this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/arch/arm/boot/dts/armada-370-xp.dtsi?h=linux-4.9.y&id=55877f58b0be83a4ffb4639a83f99c28df418e3e

I created a patch to undo a couple changes and the device tree is now nearly identical. Without completely reverting a few changes I'm not sure it can be made identical but I've got a mamba running on 4.9 now and I've been beating up the wifi with iperf for a few hours now and everything is looking OK. Keep your fingers crossed...

In case you want to build and expand on test hours more quickly (since the reboots can be pretty fickle) here is the patch:

--- /arch/arm/boot/dts/armada-xp.dtsi	2017-07-17 21:13:53.372001000 -0500
+++ /arch/arm/boot/dts/armada-xp.dtsi	2017-07-18 20:08:18.874801712 -0500
@@ -372,6 +372,4 @@
 
 &spi1 {
 	compatible = "marvell,armada-xp-spi", "marvell,orion-spi";
-	pinctrl-0 = <&spi1_pins>;
-	pinctrl-names = "default";
 };
--- /arch/arm/boot/dts/armada-370-xp.dtsi	2017-07-15 05:17:55.000000000 -0500
+++ /arch/arm/boot/dts/armada-370-xp.dtsi	2017-07-15 04:58:03.000000000 -0500
@@ -148,6 +148,26 @@
 				interrupts = <50>;
 	    		};
     
    +			spi0: spi@10600 {
    +				reg = <0x10600 0x28>;
    +				#address-cells = <1>;
    +				#size-cells = <0>;
    +				cell-index = <0>;
    +				interrupts = <30>;
    +				clocks = <&coreclk 0>;
    +				status = "disabled";
    +			};
    +
    +			spi1: spi@10680 {
    +				reg = <0x10680 0x28>;
    +				#address-cells = <1>;
    +				#size-cells = <0>;
    +				cell-index = <1>;
    +				interrupts = <92>;
    +				clocks = <&coreclk 0>;
    +				status = "disabled";
    +			};
    +
     			i2c0: i2c@11000 {
     				compatible = "marvell,mv64xxx-i2c";
     				#address-cells = <1>;
    @@ -300,42 +320,6 @@
     				status = "disabled";
     			};
     		};
-
-		spi0: spi@10600 {
-			reg = <MBUS_ID(0xf0, 0x01) 0x10600 0x28>, /* control */
-			      <MBUS_ID(0x01, 0x1e) 0 0xffffffff>, /* CS0 */
-			      <MBUS_ID(0x01, 0x5e) 0 0xffffffff>, /* CS1 */
-			      <MBUS_ID(0x01, 0x9e) 0 0xffffffff>, /* CS2 */
-			      <MBUS_ID(0x01, 0xde) 0 0xffffffff>, /* CS3 */
-			      <MBUS_ID(0x01, 0x1f) 0 0xffffffff>, /* CS4 */
-			      <MBUS_ID(0x01, 0x5f) 0 0xffffffff>, /* CS5 */
-			      <MBUS_ID(0x01, 0x9f) 0 0xffffffff>, /* CS6 */
-			      <MBUS_ID(0x01, 0xdf) 0 0xffffffff>; /* CS7 */
-			#address-cells = <1>;
-			#size-cells = <0>;
-			cell-index = <0>;
-			interrupts = <30>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		spi1: spi@10680 {
-			reg = <MBUS_ID(0xf0, 0x01) 0x10680 0x28>, /* control */
-			      <MBUS_ID(0x01, 0x1a) 0 0xffffffff>, /* CS0 */
-			      <MBUS_ID(0x01, 0x5a) 0 0xffffffff>, /* CS1 */
-			      <MBUS_ID(0x01, 0x9a) 0 0xffffffff>, /* CS2 */
-			      <MBUS_ID(0x01, 0xda) 0 0xffffffff>, /* CS3 */
-			      <MBUS_ID(0x01, 0x1b) 0 0xffffffff>, /* CS4 */
-			      <MBUS_ID(0x01, 0x5b) 0 0xffffffff>, /* CS5 */
-			      <MBUS_ID(0x01, 0x9b) 0 0xffffffff>, /* CS6 */
-			      <MBUS_ID(0x01, 0xdb) 0 0xffffffff>; /* CS7 */
-			#address-cells = <1>;
-			#size-cells = <0>;
-			cell-index = <1>;
-			interrupts = <92>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
 	};
 
 	clocks {

Here is the remaining diff in the compiled dts:

> 						linux,default-trigger = "disk-activity";
...
> 				};
> 
> 				spi1-pins {
> 					marvell,pins = "mpp13", "mpp14", "mpp16", "mpp17";
> 					marvell,function = "spi1";

Both are artifacts of further changes that I haven't undone yet, TBD whether they matter.

Best of luck...

1 Like

Thanks for the work on this @InkblotAdmirer Will try a new build tonight.

Edit: Does /arch/arm/boot/dts/armada/xp-mamba-linksys-mamba.dts have anything to do with this? It is also different also between 4.4 and 4.9. Or does the *dtsi override *dts?
Pardon my ignorance just trying to learn.
And thanks for the boot area I was in aiming
for arch/arm/mach-mvebu but saw no real issues there.

@northbound

The linksys-mamba.dts is configuration specific to the device, you have to follow the includes as well. the *.dtsi files are intended to be "platform" information with configuration applicable to anything using that platform, and then specific additions and changes are added on top with *.dts.

From a couple of posts elsewhere, my understanding is that other release images are not experiencing the reboot; McDebian comes to mind, don't remember seeing anything regarding dd-wrt. My assumption would be that everything based on this would behave the same. So unless the issue is being patched out on another image, we must be patching the issue in to our image. But maybe this is based on the false premise that only LEDE is experiencing the reboot issue. Perhaps someone who has run another release can offer an opinion on their experience.

@anomeone In that case it should just be a matter of going through the dozens of patches added when kernel 4.9 was added and when mvebu was added to 4.9. I'm not particularly into that drudgery.

With the patch above (virtually no change to dtb) mamba rebooted after just under 48 hours.

Yep, I hear ya. I have experienced reboot time variance from less than an hour, to over 12 days, on the same image. So, even if the intersection of patches that involve the mamba in any manner reduces the count somewhat, without a hint as to the cause, it is still a long painful road; and that is assuming the premise of my previous post is valid.

I had hoped we might have had another avenue to debug this when I saw the kexec commits being made, but last I looked it did not appear to be ready for use on this target.