Limitations on connections in ISP equipment (was LEDE packet per second performance versus other firmware)

Putting this machine on a VLAN is an excellent idea. Already have some large managed switches (in unmanaged mode). Might not be as safe but will probably segment it via a virtual VLAN at router level to keep it simple and not have to manage yet another device (currently unmanaged switches) while still vastly improving the security if something were to go wrong.

Am glad this all occurred now. If the issue is a connection limit issue bear in mind am already at >1500 for regular usage and this is not the largest office in terms of personnel.

if you have managed switches then just add the vlan to one of the ports of the existing switch, plug your onion server into that port, and add that vlan interface to the router. existing managed switches can probably bandwidth limit and low-prioritize this server's traffic already.

you could actually hang another 100Mbit ethernet card off your new dedicated router and just plug your onion server into that port too. I wasn't clear on whether that's what you meant. If you do that though, you'll have to set up QoS / bw limiting in your router software. Might be easier to just click some button in your managed switch and turn on 30 mbit max bw in and out of the port.

Yes I suspect this is a connection tracking issue. typical use case for Tomato doesn't involve 18,000 Chinese people trying to access the internet from your LAN :slight_smile: so connection count is more likely to be set at 1000 than say 150,000

Your x86 solution will handle this better I'm sure in part due to having more RAM and so defaulting to much higher connection counts.

Agreed. Theoretically Tomato software has a default connection count of: 16384

However it is quite possible that this is not reachable on all hardware (and not on the RT-N66U specifically) as likely gleaned from this case study. However wouldn’t one expect the RAM usage to be elevated if the issue is in fact a “connection tracking” problem? Have not seen high RAM usage but maybe high RAM usage would not result from this or possibly the GUI is not tracking RAM usage correctly.

See what you mean. Forgot that need a dedicated port for each VLAN no matter what. Am leaning towards creating a VLAN within the router using the original fast ethernet card that came with the old PC. Has been on list but never bothered to learn how the managed Cisco switches work (assume there is just a GUI login and it is self-explanatory) but this creates another device that needs to be managed. Still it is another option.

Yes, if you don't know much about the Cisco switch config then you probably just want to hang your server off the existing fast ethernet port and configure that, then put your real LAN off the new Intel GigE port. This solution is technically a separate physical LAN, whereas a VLAN is virtual (multiple virtual lans can travel over the same wire so to speak by tagging the packets as to which VLAN they belong to).

default connection count of 16k is a lot, so it's surprising if the connection tracking is the issue... however you wouldn't necessarily see RAM usage grow, the 16k table is pre-allocated.

Are you in a dual NAT situation? Does tomato get a public IP on its WAN or is the upstream router having its own NAT? That might be where the connection limit is.

you have an interesting issue here. some of it might be bufferbloat related, some conntrack related, some might even be related to a third piece of kit on your LAN, like a switch or something that doesn't like all the traffic from this server. You mentioned a WRT54GL somewhere in this equation, that could be maxing out its CPU for example.

Also, having many virtual chinese people all trying to access resources through your network is basically the same as some of the worst case conditions being tested in that router testing link above:

look at the entry on this graph for say 100k-10000 that's 10000 connections all trying to download a 100k file over and over again... the netgear nighthawk is doing something like 200mbits under these conditions, I assume your existing router is probably closer in performance to the linksys-n600 thing he tests, which is basically getting zero bandwidth under those conditions... so that's where you are.

Just use the switches as unmanaged right now from patch panel to router. Tomato actually does nice easy VLAN’s that use for wireless (aka hooking up the access points that provide the WiFi).

Dual NAT yes but tested on the ATT NVG 595 as direct router instead of RT-N66U and there were no issues. ATT Fiber requires use of NVG 595 or variant (it seems to not actually be an issue though).

Old WRT54GL’s are just for offices with cubicles where not enough wall ports for each.

Well, you've got a bunch of avenues to look at, at this point. I think best to go work on it some and see how it goes.

I'm pretty sure you can put that NVG 595 into an ip-passthrough mode and disable all of its firewall settings (I have ATT Fiber and an NVG 599 and it certainly can do that). It may be detecting a "flood" because with dual-nat ALL of your bandwidth is coming from one router IP and it sees say thousands of connections being opened and closed from that one IP... it wouldn't necessarily respond the same as if you were using it as a router and each device on your LAN had a different internal IP handed out by the 595.

That is a good point. All the advanced firewall features (especially the flood control) has been off on NVG595 and setup is same passthrough mode as yours. The new hardware router will tell if the issue continues to occur. Will figure out the wireless VLAN's too without Tomato on the RT-N66U. Thanks for great assistance. Next step is definitely to check what happens with a more powerful router.

Final thought for the day: whatever software you're running to provide the access on this problematic server, it may have resource limiting settings, you might do well to just tell it only allow say 100-300 connections and use no more than 30 Mbits of bandwidth or something like that.

To be fair, this could also be caused by a poor firmware implementation in these consumer routers. I'd love to see a similar test, but with Lede instead.

1 Like

Definitely true, with well tuned firmware it's very possible that you'd get more graceful degradation, but I doubt you'd get good performance just at least not a total failure.

I'm not so sure. If the code is efficient, and the router has muscle left (in RAM and CPU power), you shouldn't be hitting any bottlenecks, just like the x86 isn't hitting bottlenecks.

It could be possible that the hardware simply cannot cope, but you would be able to see that by running out of memory or capping out cpu utilization. I'm seeing neither on a 500 mbit connection with a mt7621 device. I am able to max my connection through a single file download, but also when using hundreds of torrent connections. I can max out my connection before I max out my router.

Well I'm guessing the mt7621 has a lot more cpu ability than the linksys-n600 thing he was testing, but your point is well taken. This kind of test would be a great thing to see on an array of current LEDE devices. I'd also really like to see it on ipv6 connections, where NAT is not an issue.

Agreed. But the main question is whether, given efficient firmware, a device such as the mt7621 is enough, or whether x86 is always needed for a large number of connections at high bandwidth. I'd wager that Lede would be sufficiently fast, as long as the device isn't bottlenecking.

The test he ran is a bit more stringent than hundreds of torrent connections actually. He runs N threads and each one opens an http connection, downloads a file of size X and then closes the connection and repeats. So each time a new connection is made a new NAT table entry has to be created etc etc. It's an extreme test, he had to use extremely beefy machines as the test client and test server.

Extreme tests are nice to compare devices when pushed to their absolute maximum. But if they don't have any comparison to real-world usage, then I would say the test isn't that interesting in determining whether a device is needed for a particular connection speed.

@Mushoz There is no need to always fullquote the previous posting.

This goes without saying, the question is whether the devices do bottleneck. Many commercial devices do run firmware based on linux kernels, and they often use proprietary drivers, so LEDE isn't necessarily going to be better. However, you can tune LEDE a lot more, so it might be possible to enlarge conntrack tables and do things that gave better graceful degradation.

Evidently the OP has a TOR node or some similar device that gives potentially thousands of people access to the internet through his LAN, and so in this case the test with 1000 threads is actually very relevant to the issue. Even in a small business of say 10 or 20 employees, there are times when you might see this kind of behavior. For example, updating a Debian desktop means you'll download thousands of .deb packages ranging from a few tens of KB to a few tens of MB. Similarly synchronizing a laptop directory to a cloud storage like Google Drive. Suppose you go somewhere with a camera and take 1000 or 2000 photos, come back to the office, and download the photos to a synced directory. Depending on the software it might be making and breaking thousands of connections in a short time. Similarly I am working on a project where I'm trying to help a YMCA that has potentially hundreds of people trying to stream music or surf the web on their phone while they work out. Here again hundreds or thousands of make-and-then-break connections are very possible. And in a separate thread here we're talking about similar issues, and several people suggested that they're providing public wifi at events like flea markets or through "Freifunk" in Germany. The multi-user performance seems to be very different from the small household use-case.

Installed x86 router. Tried OPNSense, PFSense, and IPFire (would have tried LEDE but didn't seemed geared towards X86 devices). All installed very easily. Both BSD variants had too many features and options that did not understand/ need. IPFire had good deal of quirks which won't go into here but reporting was excellent. The connection bottleneck issue however still persisted after x86 router installation. Due to help of everyone here figured out that issue is with NAT tablet session limit of 2048 on ATT NVG595. Previously when tested solely with ATT router did not test scientifically. Nothing else was connected, probably did not wait long enough to fill session limit, and prior to yesterday did not know how to view session connection table on ATT router. If purchase static IP ATT says that NAT connection table limit of 2048 will be bypassed (since passthrough will not be used) so will test this. However have received conflicting answers from ATT today to this question.

The Asus RT-N66U hardware with Tomato software previously used was certainly not the issue bottlenecking connection. Scientific testing would be needed but believe the RT-N66U could handle more than 2,000 sessions. As to how many do not know. It has an 600mhz processor and 256 MB of RAM. Linksys N600 EA2750 otherwise mentioned as comparative has 580 MHz cpu and 64 MB of RAM so the test above from the ArsTechnica graphic/ article is somewhat relevant.

To conclude the issue here bottlenecking connection was a double NAT internet service provider router limiting active sessions to no more than 2048.

Thanks for the update.

Here's the thread that lays out this issue. http://www.dslreports.com/forum/r29898675-U-Verse-Business-NVG585-NAT-limit

I haven't run into it myself, but it's not serving hundreds of users.... Also I checked on my NVG599 and it shows under Diagnostics > NAT Table that the table size is 4096 so you might ask ATT to switch out your 595 for a 599 with a larger NAT table.

Furthermore, I do various things that might reduce my connection table size. I use a squid proxy running on my router, I use ipv6, and I use firewall rules to force all my LAN machines to do DNS through my router which caches the results. The DNS redirect might itself do a lot for you, after all if each "user" is hitting DNS off the internet, and you're providing hundreds of people access through your special bridge node, you could have thousands of connections just for DNS and NATting that to your router could drop those thousands to just tens.

You might try something like that, which may reduce the load on the NAT table of your NVG device.

Finally, the cost of the connection is so low compared to going to a "real" business fiber connection for multi $K per month, that you might look into tunneling certain traffic to a cheap VPS. For example a $10/mo Linode will handle 2TB of transfer out per month. Just set up a GRE tunnel (doesn't need to be OpenVPN/encrypted) and let all your "donated" connections go over the GRE tunnel to your Linode. This will collapse all of the donated connections down to 1 connection as far as the NVG device is concerned.