And I think the reason why they show up in the bootlog is because ubusd uses /dev/urandom for random id generation in ubus_alloc_id().
I don't think the problem is in ubusd, since otherwise we would simply see a timeout after 50 seconds. Something must be blocking in procd and because of this the procd daemon does no longer ping the watchdog.... And hence the MBL DUO reboots.
One more strange thing (two disks, untouched openwrt):
[ 8.780355] eth0: link is up, 1000 FDX, pause enabled
[ 8.785520] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 8.791385] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1024.543024] random: crng init done
It seems that the random process is not initialized properly in the boot process. This message always comes a few minutes later. I haven't seen such a thing on any other device.
edit:
There is delay on other platforms as well (x86 virtual machine with 18.06-rc1), but much less:
[ 5.533512] kmodloader: done loading kernel modules from /etc/modules.d/*
[ 15.060086] random: crng init done
[ 15.070062] random: 1 urandom warning(s) missed due to ratelimiting
I played around with x86 virtual machine with two disks. Created two identical disks, and did the initial setup (network, dhcp, new partitions, etc.), like I would do on MBL Duo. I know it's not necessarily the same error, but after a few successful reboot, the machine suddenly started to boot off the second disk. Weird... could there be any problem with disk enumeration?
Peeking into the detailed kernel logs, I found that during device creation the /sbin/hotplug-call script runs several times. In a successful boot these log entries are in pairs: launched/finished.
hm, kinda. The x86 image uses either root=UUID (for ext4) or rootfs=squashfs to discover the rootfs on all partitions and disks (and this can be racy). On the MBLs this is currently fixed to root=/dev/sda2. So there's a slight difference there.
hm, could it be that the problem is the "concurrent access"? I.e. access to both disks at the same time? If so, you should be able to reproduce the bug in the failsafe by loading the drives on the same time... ie:
apm821xx: attempt to fix sata access freezes. It would be somewhat comical, if all that was necessary was to set a few bits. But oh well...
Anyway, once the patch gets picked up by master, the buildbots will automatically create images.
Thank you very much for all your efforts. Can i try your patch somehow? I am willing to try it. Do i need to go through the image generation process?
I am currently on 17.01.4 and experiencing the same issues as the other forum members.
so, it will depend on whenever it magically fixes the problem or not. If it does then it will get ported to all the active releases [17.01.6 (if there is such a release) and 18.06].
I can report that your fix worked! Boots up fine with two drives even with more partitions, and no more reboot on concurrent disk access!
Thanks again!
Edit: One issue I found but I think is irrelevant with your fix is that when I edit mount points through luci, openwrt will freeze and needs a hard restart
To replicate the issue with the mount points, do the following:
install luci
go to System->Mount points
add
select /dev/sda3 by uuid
mount to /mnt/data
save and apply
The applying configuration timer times out, I am thrown out of ssh, I can still ping the device but not ssh no http!
More testing shows that this issue existed on 17.01.4 as well
Still testing your fix by the way, everything is working smoothly
Well, the good news is: I quickly found this PR for luci. It talks about this very same issue
Default behaviour of changes to fstab (Mount Points) was
to use /etc/init.d/fstab restart, however this unmounts
filesystems via block umount which can cause the device
to fail, so replace the initscript call with an exec
of 'block mount'.
The bad news: The PR has been abandoned. I mean you can manually apply the patch
and try it again. At least for the MBL Single it did the trick.
As silly as it sounds, at the moment, I don't have the two unused disks to test on my MBL Duo.
What I do have is a single spare disk and nothing short of awesome news: It seems your patch unleashed the full power of the SATA port. Where I was previously hitting a really hard limit at around 82 MB/s for reading and 27 MB/s for writing, I am now getting this:
root@OpenWrt:/mnt# time dd if=/dev/zero of=tempfile bs=1M count=1024
1024+0 records in
1024+0 records out
real 0m 13.65s
user 0m 0.01s
sys 0m 11.89s
root@OpenWrt:/mnt# time dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
real 0m 8.41s
user 0m 0.01s
sys 0m 4.70s
This means: 121 MB/s reading and 75 MB/s writing!
Seems like our collective assumption -- that the original MBL firmware's significantly better read/write performance was owed to the nonstandard 64 kB block size -- was wrong. It was the SATA driver all along.
Edit: The drive is a WD Green WD10EARX taken from an older MBL Single. I repeated the test a few times with even larger files to rule out any caching, I'm still seeing the same great performance. OpenWrt is now completely on par with the original MBL firmware's performance.
root@RED:/mnt/data# time dd if=/dev/zero of=tempfile bs=1M count=1024
1024+0 records in
1024+0 records out
real 0m 15.38s
user 0m 0.01s
sys 0m 11.19s
root@RED:/mnt/data# time dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
real 0m 6.71s
user 0m 0.00s
sys 0m 5.36s
Thank you for confirming my findings. Your results correspond to 152 MB/s reading, 66 MB/s writing (which seems a tad low, I wonder if your drive is still lazy init'ing the file system).