This are bad news :-/
But yeah PJ4B custom design cpu.
and those fixed are for cortex a9.
No one knows exactly on which design this cpu is based.
Could also be cortex a8. Only marvell knows for sure.
isn't this memory corruption issue in mwlwifi there since the beginning, atleast at seems.
Yes the commit fixed some issue that was introduced lately.
Im also not quite sure if the problem really is the mwlwifi driver, cause other devices doesnt suffer from this issue? Maybe more likely an mwlwifi firmware bug?
Could be many things. even a bad batch of power supplies x)
There is still the gcc patch left to try.
but stock firmware is stable yes?
//edit
i downloaded the source from linksys. (kernel 3.x) ?
In their config is following set:
CONFIG_SHEEVA_ERRATA_ARM_CPU_6124=y
CONFIG_SHEEVA_ERRATA_ARM_CPU_PMU_RESET=y
and for wrt1200 it is
CONFIG_PJ4B_ERRATA_4742
CONFIG_ARM_ERRATA_720789
No pl310 erratas checked.
but i leave mine running with
CONFIG_ARM_ERRATA_754322
CONFIG_ARM_ERRATA_754327
CONFIG_ARM_ERRATA_764369
and with the gcc patch.
cause i trust the official arm doc more then those marvell devs x)
With just the CPU IDLE disabled and mwlwifi dated 1/30/2018 I have had zero reboots on Mamba, Rango, Shelby or Caiman. I consider this problem solved (not necessarily in trunk, but certainly with custom build). This applies to both kernel 4.9 and 4.14.
Yes, that is understood, at least I know I'm aware of that. But for me (currently) that is treating the sympton, and I'm one of those find out the cause kind of people; but that is going to take what has been discussed above.
Regarding not necessarily in trunk, I've been pondering if maybe an RFC PR might not solicit some input or even let's run with it push, but I know I will not be opening one wink, wink, nod, nod.
I just noticed that I seem to be getting spontaneous reboots on my Netgear WNDR3800 running LEDE 17.01.4.
Network went down temporarily and when looking at uptime it was just a couple of minutes old. I get new external IPs on reboots and as I log these I can see that this has been happening every few days.
Should I start a new discussion thread or is this the thread the right place?
The WNDR3800 is an Atheros MIPS SoC whereas the WRT series use a Marvell ARM SoC. Furthermore, you're on a 4.4 kernel, and this thread is about Marvell's SoCs combined with a 4.9 kernel. So your issue is probably completely unrelated. I'd put my money on SQM - since I haven't heard about any stability problems on ar71xx with 4.4.
Given that this research has been going on for a more than a year, and that we do have a workaround with limited drawbacks, and that there is a need to move on and get other parts of v4.14 tested on mvebu (including Mamba), I think the time has come to push a patch to disable CPU_IDLE for now.
Yes, I would also prefer finding the real bug instead of papering over it. But that is obviously not easy. It would be nice if there was some way to do a kernel git bisect here. But the large number of out-of-tree patches coupled with the big version gap between known good (v4.4.?) and first known bad (v4.9.?) makes that too hard. I made a feeble attempt last night, but had to give up quickly after messing up the switch while attemting to backport the OpenWrt patches to the commit being tested. It's just too much work for every test, and very hard to get it right every time.
So I vote for disabling CPU_IDLE for now. Reenabling it along with a proper fix is simple once the proper fix is found. No need to prevent Mamba users from testing other parts of the OpenWrt trunk.
Unclear to me as to the intent here. I am certainly not holding things up, nor advocating to hold things up. I even have a 4.14.x cpuidle patched mamba build for public consumption off the link on my avatar. I just investigate when I feel like taking a look at the issue once again.
To me the holdup on submitting a patch as indicated in this thread, is the other mvebu targets. My guess, and I think also @InkblotAdmirer, is that owners of other Linksys wrt devices, Omnia Turris, SolidRun... devices are going to be less than keen on this solution. My suggestion as to a RFC PR was simply meant to see if we could maybe garner further input, maybe alternative implementation via Makefile that would target just mamba; but I for one don't see the targeted solution, if it exists. But by all means, submit a patch PR | ML and gauge the community reaction. I certainly will not be speaking against same.
Edit: and in getting around to checking today's ML I see you have done that. Thanks, should be interesting to see if there are any naysayers.
This may be a hack but with the commit now pushed to trunk mamba users can flash releases, snapshorts, or build from trunk without needing to custom patch.
It would be great to find the underlying cause but as bmork has mentioned (and I found a while back) the effort may not be worth it. I have been able to observe no negatives from disabling cpu idle and thus as I have hinted before, I've moved on from this issue.
Not that this changes anything. It just explains why the regression didn't affect the other variants. Which I think is very important to keep in mind when trying to figure out the real problem. The real bug does not have to be mamba specific at all.
The fix/workaround for mamba is still the correct one. A big thanks to @nbd for that.
well. First of, I don't have a WRT1900AC V1. So all I'm about to say could all be wrong . But I do know of a strangely similar issue with IPQ4019. It too would randomly reboot from time to time for no apparent reason. Luckily, the reason was found by Qcom and a patch was upstreamed:
So, in the IPQ4019 case, the CPU can't be freely switched between the frequencies.
For example, if the CPU was clocked at 48 MHz (lowest) it first needs to be switched to the 500MHz clock before it can go to the highest clock (716 MHz). Could something like this also be the case for the armada-370 cpu? I.e. when going to cpu_idle (or leaving) the cpu has to switch through the frequencies more like a gearbox and ramp up slowly (up or down) and only go from one frequency to the next?
Getting somewhat curious, given the @bmork link above, and that the CPU frequency scaling appears to function on the rango, just what the real situation is regarding what is truly borken.