Limitations on connections in ISP equipment (was LEDE packet per second performance versus other firmware)

Hi,

Have a Tomato router that is overloaded in terms of packets per second (CPU and bandwidth utilization are okay). Am working on the assumption that LEDE would experience the same packet per second bottleneck as Tomato does on the same hardware. Is this sentiment mostly accurate?

How is it that CPU utilization is ok but it can't push more pps?

Most likely you'll want to increase the available time for softirqs, check out the netdev_budget sysctl : https://www.kernel.org/doc/Documentation/sysctl/net.txt the _usecs version is only available on newer kernels but it would help a bit too.

1 Like

Wish that knew why CPU utilization is okay but connection is clogged (here is more in-depth information that posted on Tomato forum: https://linksysinfo.org/index.php?threads/packets-per-second-probable-overload.73982/. Tomato is stuck on the 2.6 kernel.

@dlakelan are you of the opinion that using a somewhat powerful computer as a hardware router using something like IPFire would solve the issue without needing to take additional action?

For routing more than say 50Mbit I think x86 is the way to go. Even the low end of x86 is going to nearly saturate a gigabit connection while doing sophisticated QoS and also acting as a nice NAS, running a Squid proxy, and a telephony server etc etc.

@schnappi having looked at the question on the Tomato thread, and seeing that you have older x86 hardware "lying around" I'd strongly recommend you set up an x86 based router. your 100/50 connection will fly. I do however believe that basic QoS is really needed for good network performance. It looks like IPFire has some kind of QoS probably based on HTB and fq_codel which is good.

EDIT: also I recommend you use a device with a minimum of two intel NICs. If needed buy a dual port intel pcie NIC to plug into one of your existing boxes.

Further Edit: it does sound like your problem is probably related to buffer-bloat, where DNS lookups and things should really go ahead in the queue instead because of your ISP failing to do a good job on buffering they are sitting in a long queue in some other piece of kit. The solution for this is a reasonably well set up QoS scheme. It's really not optional in modern high speed connections. Yes IPFire seems to have some such thing so I think it's probably a good starting point.

Do you use VOIP in this environment?

Tried VOIP a few years back and was a disaster. Took a few months to get numbers ported back to old copper lines. Still just got a pink slip saying someone from ATT is nagging me yet again to switch to their VOIP. Am considering switching a few light use lines to VOIP to test again.

Will build and install a hardware router this weekend. If this alleviates than the issue was with the RT-N66U. If problem still occurs agree that issue is upstream. Deploying QOS is on to do list and agree am getting to point where it is becoming necessary. Tried out the Tomato QOS yesterday and although the classifications were probably not perfect this did not alleviate the issue. Thank you for your help. Have saved countless thousands (and am more secure than if paid others) thanks to the open source community and people like you.

Only if you define 'dated' as "at least haswell or younger" (or at least one of the various Atom flavours), otherwise you'd be paying dearly in the form of electricity. For always-on devices it really makes sense to look into their power consumption - and that was seriously bad before sandy-bridge (in the terms of easily 130 watts idle) and dropped significantly between sandy-bridge and haswell (since then it is reasonable to run a full x86_64 system at 10-15 watts idle), less with Atom (but even there you want to avoid the first generation, N270/ 330).

While I agree with looking at x86_64 for high end WAN connections, I'd put the cut-off more around >>250 MBit/s, maybe >>400 MBit/s; 100 MBit/s are easily handled (including sqm) by modern upper end routers, like mvebu, mt7621 or ipq806x/ ipq40xx; easily and with quite some margin to spare. Even top end ar71xx should cope (sqm might be a bit much though) with 100/50 MBit/s, lantiq would be a bit more congested (definately no sqm, and only in smp mode := FXS support disabled) but should still play along.

For a 100/50 connection x86 is way overkill, unless you have rather demanding tasks planned on top - or if there's a very, very big chance (read certainty) that your WAN speed will double or quadrouple within the next 2-3 years.

Great information. Do not pay for electricity as part of leases so in this situation am not worried about power consumption.

Will probably use a 2005 era HP Pavillion with a new power supply and Intel network cards. These are only things that have not yet donated. Wish would have kept some of the newer Dell Optiplex's that just gave a bunch away to a school. Interested to see if can get 8 GB of RAM (not that it really matters I think) to work in a 32-bit Linux hardware router. Will post if hardware router alleviates the issue.

Even 130 watts continuous is something like $140/yr of electricity cost. So, sure, try to find one of the ones lying around that has low power consumption.

I consider very considerable QoS capacity to be an absolute requirement on a router. And I think there are very good reasons, particularly in a small business setting, to use a proxy like squid. This can help you reduce your bandwidth consumption from less important traffic (say facebook or youtube etc) because identifying that traffic at the http layer is easy.

So, I think in a small business setting, where having your connection not work well means costing you lost productivity each day, an x86 is a good and extremely flexible solution.

@schnappi private message me on this board if you want to buy some consulting time on this project.

Ok thanks. Will keep that in mind if have issues with an effective QOS setup over next few weeks.

Great. Also, other idea is see if you can get this thing to run off a usb stick or an SSD lying around or an SD card rather than having a spinning disk in it.

2005 era PC... that's pretty old. You might wind up more effective buying something like this (note I don't have personal experience with these but am considering them in a project that may come up soon)

Around 140 USD is an amount of money that can give you a pretty decent higher end router of the afforementioned architectures (much less for mt7621 or ipq40xx, about that or a bit more for mvebu or ipq806x), so following your calculation, a pretty decent router (which needs 6-10 watts idle, but includes 4+1 GBit/s switch ports and two wlan cards) pays for itself in somewhere between 8 and 18-26 months.

Just as a sidenote, the Celeron J1900, while being a pretty amazing CPU (pretty snappy, 6 watts idle at the outlet for a complete system), is not totally stable under linux (there are known hardware issues with its ACPI interaction and deep sleep states), which means these systems freeze from time to time - not quite ideal for a router.

Question, how did/do you assess CPU utilization?

Through the Tomato GUI and then through SSH and the "top" command to see if there was a discrepancy.

Routers shouldn't be in deep sleep ever. And worrying about power saving on a J1900 which is idling at 10watts is pointless. So just disable all cpufreq controls and have it run full speed all the time. I personally am running a Asrock Rack j1900 based mobo and its never frozen in the 2 years i've had it running continuously as a router + RAID NFS server + squid proxy + OpenVPN server. So, I could imagine it might be an issue for something like a media box that goes to sleep regularly... but it's not a problem I've ever experienced at all.

Agreed about the money and payback time. But at this point it seems switching to an x86 costs him zero up front and he doesn't pay for the electricity anyway... if an x86 works well for him, buying one that uses less power is probably a good choice environmentally but maybe he will upgrade a desktop machine and wind up with another x86 box that uses 15 or 20 watts idle he can swap into service...

Unless you're running a core router for a larger business (at least employee numbers in the 3 figure range), your router will always have idle time - while you can reduce the cstate state usage, you can't prevent them or cpu scaling from the OS side completely. The freezes associated with this on baytrail-d are documented and still unfixed, I'm affected by this myself with an ASRock Q1900DC-ITX board - while it can go weeks without a hitch, there are times when it freezes 3 times a day (and no, intel_idle.max_cstate=1 doesn't prevent this completely).

But I really don't want to derail the original topic much further, even less so as I agree with you in principle (I just put the cut-off for x86 being worthwhile at a significantly higher range than just 100 MBit/s and caution specifically against baytrail-d).

Ah, I am sure you are on top of this, but when I started using "top -d 1" to monitor CPU utilisation I initially thought everything was okay, until I realized that the number to really look at is "idle" if idle hits 0 for a noticeable amount of the top samples, chances are that you are bottlenecked on CPU... But again, you probably know that.

Oh sure, it'll have idle time, but going into low power mode on an already extremely power efficient processor that needs to have low-latency when packets do arrive, seems pointless. Actually thanks for pointing out the issue. It's never hit me at all, but it's a good thing to know about. I've just been reading about it, sounds like it can depend on BIOS settings so perhaps there's just something really good about the Asrock J1900D2Y mobo.

Are those freezes seen in more recent N3xxx series? Such as this device uses: https://www.amazon.com/Firewall-Appliance-Gigabit-AES-NI-Barebone/dp/B072ZTCNLK

1 Like

Did not know this. Am not a computer person. Could you elaborate on what this means? The only thing that really have ever looked at (in terms of CPU usage) is CPU percentage and load average taking into account the number of cores.

The cpu usage is divided into various types of usage, here's top in my linux desktop box:

%Cpu(s): 0.9 us, 0.3 sy, 0.0 ni, 98.8 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st

you can see it's 98.8 percent idle (id) right now. if it's having problems with network the "si" (softirq) usage would go through the roof and id (idle) would drop to zero.

you might have a different output for your "top" but often people report in GUI the "us" value (user) and that can be small while the kernel spends LOTS of time processing packets (in "si")