Shaping performance

I've actually used m4 in the past for crunching large text files into more useful formats, whereas I haven't used ANTLR (but I have skimmed through the book and was considering it for another project) and I don't do Java, ANTLR will work with several other languages, of which javascript is probably a thing I could do, but ... eh, I'm inclined to try the m4 thing, massage it all into JSON and let people use that.

That being said, it's probably over a decade since I wrote anything in m4...

I've been thinking about this a bit too and feel like a synthetic benchmark with a PC acting as a server might be more reproducible and better for building a database. It wouldn't be as good for crowdsourcing perhaps and might not be as "real-world" but I think it would be more scientific and potentially more reliable.

Something simple like setting SQM bandwidth to a really big number (e.g. 1Gbps) and then run iperf locally with a PC acting as a server. The actual bandwidth achieved should be limited by the CPU's ability to deal with SQM/Cake/Codel. Since SQM bandwidth is set too high, the actual latency might be high, but the bandwidth should be close to the max bandwidth the router can handle. At that point, you could try setting an SQM bandwidth number slightly lower (e.g. 10%) than the previous "max" value, and ensure that latency actually remains low with a simple ping. Not sure if Flent or something else could simplify this.

As an alternate approach, maybe inquire on one of the bloat/cake/make-wifi-fast/codel mailing lists to see if someone has an idea on how to benchmark or would consider writing a script to do so. The script could ramp up bandwidth while monitoring latency and CPU utilization, producing a final approximate max bandwidth a router can handle.

Just an idea. I have no skills to write any of this, but think the data would be meaningful to help people select a router. It could also be helpful to understand the characteristics of CPU performance that influence SQM performance, which could in theory guide future CPU or router designs (ok, now I'm really dreaming).

A synthetic benchmark will be more reproducible yes, but I don't think better for two reasons:

  1. Real world performance is what people care about, synthetic benchmarks represent lower noise (variance) but more biased (different from what's wanted) information relative to something like the speed-test results.
  2. Synthetic benchmarks won't have the ISP equipment in the picture, in particular the range of behaviors possible with DOCSIS or DSL

In the end the main thing is to provide a predictive estimate from a statistical model of the real-world performance that you can reliably achieve, so that some speed rating can be assigned. I personally imagine this speed rating being rounded to the nearest 25Mbps so you'd see something like 75 for some ancient WRT54GL and 125 for some 2007 hardware, and maybe 250 or 325 for a WRT3200ACM (based on results in a different thread) or whatever... This lets people do things like say "Hey I can get 100Mbps now and might go to 200 or 250 in a few years, but I doubt I'll get more than 300 Mbps in the next 7 years, I can go with any router on this list with more than 275 rating....

or alternatively, hey I might get 500Mbps in a year or two, I should really move up to an x86 now even though if I bought a WRT3200ACM it would handle the 200Mbps I have today...

As long as it facilitates that kind of decision making effectively, I don't think we need dramatically more precision than that. People who want to tinker and eke out a few extra tens of Mbps will need to do stuff on their particular link and particular load level, and particular set of packages installed anyway.

Ok, here's how to get a really basic JSON output. First grab the stat_output.txt file and put it in a directory.

Then, install GNU m4 on your linux machine, put this m4 macro script into a file called jsonify.m4 in same directory

divert(-1)dnl
define(`delfirstbrack',`define(`delfirstbrack',`')dnl')
define(`delcomma',`')
pushdef(`delcomma',`dnl')
define(`cpuent0',["$1"`,translit('$2`,` ',`,')]')
define(`cpuentn',`,'["$1"`,translit('$2`,` ',`,')]')
define(`date',`delfirstbrack]}
delcomma,
{
"time" : popdef(`delcomma')')
define(`procstat',`,
"cpuvals" :[')

define(`ctxt',`] dnl')
define(`intr',`dnl')
define(btime,`dnl')
define(processes,`dnl')
define(procs_running,`dnl')
define(procs_blocked,`dnl')
define(softirq,`dnl')
define(procnetdev,`,
"interfaces" :[pushdef(`delcomma',`dnl')dnl')
define(Inter,`
dnl')
define(face,`dnl')

define(interfacedata,`delcomma,
define(`delcomma',`')dnl
[patsubst("'$1`",` +',`')`,' patsubst('$2`,` +',`,')]')
divert dnl
m4wrap(]}])dnl
[

Then put this script called jsonify.sh into the same directory:

#!/bin/sh
cat jsonify.m4 $1 | sed -E -e "s:#*::g"\
			-e "s:/proc/stat:procstat:"\
			-e "s:/proc/net/dev:procnetdev:"\
			-e "s:(cpu) (.*):cpuent0(\1,\2):"\
			-e "s:(cpu[0-9]+) (.*):cpuentn(\1,\2):"\
			-e "/^ face/,/date/ s/(.*):(.*)/interfacedata(\1,\2)/" | m4


Then chmod u+x jsonify.sh and run

./jsonify.sh stat_output.txt

It should give you a sequence of JSON objects

EDIT: It should but I haven't actually tried to parse it, so if you find errors let me know. I'll get a chance to test reading it in probably some time next week.

EDIT: ok, I admit I screwed a few things up so fixing them...

with that script I was able to parse output using "jsonlite" in R, so apparently I didn't screw up too badly. Also it gives a single array of observation objects, so the whole file can be parsed as one object.

For now, I'm hosting donated data in this Google Drive

https://drive.google.com/drive/folders/1v_S3oFhLEIq49ShKMxjZkgvBQK8IP9ko?usp=sharing

it's read-only but if people want to donate msg me here, I'll turn on write permissions and let you upload.

@dlakelan I wonder, even though this is a bit preliminary, how about creating a github repository for all the scripts (all three ATM) just to have a better way to collect things and develop them. Since you so far did all the thinking and coding, I believe that honor should belong to you ;), I would then see to clone your repository and create pull requests for potential changes/additions.
Again many thanks for your expertise and time!

Sure I am planning to do that, will post when it's available. I'll put up some basic R scripts to read the json and make a few basic plots as well.

1 Like

date output as an integer in nanoseconds has a LOT of irrelevant digits, making it unable to be calculated with accurately in floating point or 32 bit integer arithmetic. I'm thinking of stripping the first 4 digits off the timestamp during the m4 transformation. Sure it means there are specific small windows of time every day that the truncated clock rolls over, but ... this is probably not a real concern, and generally detectable anyway. It should be done before the parse, because JSON parsers will try to read a number and screw everything up.

thoughts?

EDIT: I could perhaps in m4 split the number into the first 6 digits, and the rest... and do arithmetic on the first 6 digits to subtract the initial value, thereby fixing this problem using actual arithmetic.
EDIT2: actually that turned out to be pretty easy, so it basically eliminates 6 of the digits, letting the rest fit in a double-float. won't work well for languages that use 32 bit integers, but will be fine for things like R.

Proof of concept is working. I have parsed the json output in R and plotted the idle time vs bandwidth. It needs some thought how to extract useful low noise stats, but at it's core the basic idea seems to work. I'll package it up into GitHub soon.

Great, that would have been my approach as well, keeping nanosecond resolution precision seems like a good idea, and using 64bit floating point representation also seems acceptable.

Yepp, finding the best statistics seems like the real challenge :wink:

Here is a sample plot from @moeller0's data:

ExampPlot

As you can see, the cpu idle slope on this plot is related to percentage available cpu idle. When that plot goes very flat, it's because we're very nearly out of CPU. During that period we've got a certain receive transfer speed, related to the slope of this graph:

ExampPlot2

Unfortunately, doing adjacent differences on these time series produces very poor data due to the high noise associated with roundoff error and the like (on this router, there are only 10 cpu time ticks during 100ms intervals). Using the cumulative graphs like this, and then fitting smooth curves and extracting slopes of those smoothed curves should produce a better result.

I imagine in the upload direction the transfer will look less linear, as tcp streams ramp-up.

I think we have a proof of concept here, the data collection works.

Well, it seems this builds HZ value was set to 100 (tested with awk '{print$22/'$(tail -n 1 /proc/uptime|cut -d. -f1)"}" /proc/self/stat taken from stack overflow), so that seems like a good confirmation for the data collection.
I also like the simple cumulative plots you show. For the CPU(s) I believe that a more traditional display of percentage over time might be easier to read, I believe that simply showing a curve for all the columns in the /proc/stat per-cpu line(s) (at least idle and sirq as for kernel-shapers we expect sirq to show the cost of the shaper, while idle will show if other tasks user or kernel do eat too much of the left-over cycles and hence might stall the shaper*). For the interfaces it might also be easier to read if we translate back to instantaneous bandwidth use (which most likely will require some low-pass filtering/smoothing as at any one time an interface is either idle or 100% busy so the filter would probably need to take min(interface_speed, shaper_speed) into account when deciding on the actual filter, but I am getting ahead, I should try to find some time and actually play with what we have right now :wink: )

Again many THANKS (sorry but I needed to shout that)

*) This still would indicate that the shaper is too costly for the given circumstances, but it might be nice to know if it is the pure shaper that chokes or whether at a give shapig rate there still us some reserve for other (variable-load) duties like operating the wlan-radios...

My general thoughts are along the lines of yours. Here's a scenario where I first smooth all the cumulative data with a generalized additive model and then extract predictions 1 per second, and do differences in the predictions (a smoothed derivative). Plot is estimated cpu sirq vs estimated bandwidth:

ExampPlot2

From this I can infer that you've set SQM at around 45/10 Mbps and that your system is pretty well tuned so that you're using around 88% sirq at full download. A similar plot for percent idle shows you're pretty well out of idle time at 45Mbps

ExampPlotIdle

The biggest hurdle to automating this is really determining in an automated way which interface is the WAN :slight_smile:

In terms of analyzing the overall router landscape, my thoughts are that we figure out how to analyze individual routers as we're doing now, and we extract a max speed prediction for each router: (for example a simple way would be from the above graph take the intersection of the line with cpu-idle = 0 point or 1-1/N for N core devices), we build up a table of these predictions, together with info on the CPUs (core type, core frequency etc)

Then, using this database across multiple routers and multiple settings and multiple people etc we build a predictive model for max bandwidth supported, and for each device in the table of hardware we output a predicted "reliable shaping speed". For example for your router it'd probably be about 40Mbps.

In addition, I'd like a wiki page that shows predictions with shaping speed in Mbps vs cpu core frequency, and plot labeled points for different routers, so that somewhere on the wiki you could just look at this one plot and see where various different routers lie and make a selection that fit your criteria just visually.

Ideally I'd like to see x86 boxes on that plot too.

1 Like

Ok, I pushed current stuff to github with a GPL3 license

2 Likes

Since we look at this in the context of using sqm-scripts or qos-scripts, we should just cat /etc/config/qos and/or /etc/config/sqm in our output file, as these will usually be instantiated on the wan interface (if there should be more enabled configurations we can still decide to error out)

Also, I wonder whether we should include an ICMP lsatency probe in our data collection to allow to assess how well the shaper keeps latency under load in check; one of the diagnostics of an overloaded shaper is not only less bandwidth tan expected but also more bufferbloat. If we add this to our collector script we will not have to relay on using the (fine) dslreports speedtest as load generator, but anything will work out just fine; heck even better, we will be able to measure a "bufferbloat score" independent of the speedtest used. I will have a look in the "cost" involved in that (the question will be mostly the size of the log file for the RTT probe in relation to the free space in /tmp so this might need to be adaptive in that it will only start the probe with sufficient space, also this might require the full ping binary).

Sidenote: ingress shaping via ifb is more expensive than direkt shaping on egress, so I am not sure whether it makes sense to linearly extrapolate from ingress and egress data (that is initially that will be okay, but for more precise measurements we might need to take two measurements with different ingress shaper setttings and exraplolate from those, after all I guess we want a somewhat conservative estimate of shaping capability, no?).