Bug in the DHCP mechanism?

Willy · May 23, 2018, 11:38pm

I think I have found a bug in the DHCP mechanism, or at least it is not the operation I expect. I have a main router with IP 192.168.1.1 and active DHCP, and a second router with IP 192.168.2.1 and active DHCP as well. I connect a LAN port from the main router to the WAN port of the secondary router. Now the computers connected to the second router get IPs of type 192.168.2.x, but if I turn off the secondary router, wait a moment, and turn it on again, the computers get IP from the main router, even surf the Internet for a few seconds, before the secondary router is fully operational. The problem is that when it is already operational the computers that were already on, do not get a new IP from the secondary router, they keep the IP that the main router gave, and they are not in a network or surf the internet. I have to disable the network card and enable it again to obtain an IP of the secondary router.

TIA

slh · May 23, 2018, 11:50pm

What you're seeing is not an issue with dhcp itself, but with the internal switch of your device. You probably have a device with only a single CPU port to the switch, so the switch needs to isolate WAN and LAN via a VLAN setup, but the default configuration immediately after power-on (before OpenWrt got a chance to upload the desired VLAN setup into the switch from netifd/ userspace) probably comes up in an unmanaged mode, bridging all ports together - this (power-on until netifd uploads a correct switch setup) window is probably long enough for your clients to successfully acquire a DHCP lease from the upstream router.

This behaviour is effectively ingrained in hardware and 'unfixable', the only thing that could be done is trying to minimize the time frame for the unconfigured switch state, either by changing the bootloader or by inserting a secondary loader between bootloader and kernel whose only purpose it is to isolate all switch ports as quickly as possible. Neither will completely fix the problem, but it might help enough to prevent a successful DHCP handshake. This fixing is highly device specific and requires intimate knowledge of the hardware (down to the switch registers needed to enable port isolation).

jeff · May 23, 2018, 11:53pm

There is no "push" mechanism in the DHCP protocol to "require" a client to get a new lease. Once a client has a lease, there are two timers that it uses to determine when to look for a renewal (or a new lease, if the renewal is declined or the original DNCP server is not available). The first is when it starts to consider a renewal, the second is when it should ask for a renewal. Neither is "binding" on the client. Often these times are set (by the DHCP server) to be hours or more.

I personally set them much shorter, but they still are in the minutes, not seconds. From my kea.conf with a reference to the RFC describing the timers:

// https://tools.ietf.org/html/rfc2131
    // Times T1 and T2 are configurable by the server through options.
    // T1 defaults to (0.5 * duration_of_lease).
    // T2 defaults to (0.875 * duration_of_lease).  

    // Global timers specified here apply to all subnets, unless there are
    // subnet specific values defined in particular subnets.
    "renew-timer": 300,
    "rebind-timer": 600,
    "valid-lifetime": 900,

Willy · May 26, 2018, 4:29pm

@slh I think you're right in everything. In fact the secondary router is an ADSL router without a WAN port, I created the WAN port with a vlan2, so I imagine that for a short period of time the four ports form a switch and when the configuration is fully loaded that port is separated of the other three. What I have to do is to test if the same thing happens with a router that has a separate WAN port. If it happens, it would be a big problem, since we could only take advantage of this type of devices as access points, not like real routers. Could the developers do something about it? What do you think?
TIA

jeff · May 26, 2018, 9:58pm

I don't know that it could be prevented as I didn't see anything in the data sheet that suggested that the AR8327 switch I looked at could be brought up in a pre-configured state.

The standard OpenWRT switch stanza starts with

config switch
        option name 'switch0'
        option reset '1'

which would also give at least a brief period where the switch was "leaky".

Edit: Clarify VLAN guidance

One way to mitigate the issue might be to use VLAN tagging between the two routers. Let's say that you tag your main router with

VLAN 100 for the 192.168.1.0/24 net and run DHCP on that VLAN alone on that router.

Similarly, tag the secondary router with

VLAN 200 for the 192.168.2.0/24 and run its DHCP on that VLAN alone.

Now,

on the connection between the two, run it with VLAN 100 tagged.

If the switch on router two is not yet configured, it will be broadcasting untagged packets, which won't be seen by the DHCP on router one due to VLAN filtering. I can't confirm that the switch will reject untagged packets, or can be configured to do so, but at least the VLAN-aware interface on router 1 should only accept those on VLAN 100.

If the first router doesn't support VLAN tagging on the port, connect it to a cheap, managed switch where the port is set to PVID of 100, untagged, and connect another port tagged VLAN 100 to the second router. This "moves" the tagging out of the first router.

Willy · May 27, 2018, 3:56pm

Thanks @jeff, I guess with LEDE this can be done without problems, I still do not know how to, but... what happens if the main router (ISP router) does not support tagging? The configuration options of ISP routers are usually very limited. Anyway it's a good idea, thanks.

Willy · May 27, 2018, 11:28pm

Definitely when the router has a separate WAN port, DHCP works as expected. And it does not work properly when we convert one of the LAN ports into a WAN port. So it is very risky to use an ADSL router without a WAN port, since once one of the LAN ports is converted into WAN, the devices connected to the LAN ports could obtain an IP from the main router, and this is not what we want, especially if what we want is to separate a network from the main network.
I observed a peculiar behavior, when I turn on the ADSL router with the converted port, it connects and obtains an IP from the main router, but when I turn on the conventional router, it connects and disconnects up to four times from the ethernet card, and then gets the correct IP from secondary router. Would it be possible to prevent the switch from working until the secondary router configuration is full loaded? Or failing that, would it be possible to cause these disconnections to allow time for the router configuration to load, and thus obtain the correct IP?

Willy · May 29, 2018, 11:57pm

Nobody?
Maybe... Developers?

jeff · May 30, 2018, 12:50am

You could come up with some rather ugly, timing-dependent, and fragile ways of bringing up your interfaces after a delay. It isn't a course of action I'd recommend. Using only tagged VLANs between the devices will likely resolve these issues.

Plug a cheap, managed switch that does support tagging in between the two. It looks like they can be purchased for ~US$30.

lleachii · May 30, 2018, 2:28pm

I think @slh already told you:

On devices I own with the same or similar switch, the bootup isolation is done by the bootloader...so I'm not sure how the developers could help you.

Willy · May 30, 2018, 2:42pm

First, thanks to all...

If the LEDE developers do not cover the bootloader issues... the xDSL routers can only be configured as dumb access points but not as real routers with guarantees.
My problem... I have many, many of them.
Again... thanks to all.

lleachii · May 30, 2018, 2:45pm

Your manufacturer would cover that. It was included with the router.

You may be able to reconfigure your bootloader to isolate the ports; or there may be an upgrade for it; but this is very advanced and could brick your router.

jeff · May 30, 2018, 3:01pm

That's likely correct, as you've defined "guarantees". It's not a fault of the run-time software, including the boot loader, no matter who supplies the firmware.

The "problem" is the switch chips themselves. Enterprise-grade switches generally disable the Ethernet ports until the switch fabric is configured. These comparatively cheap chipsets do not appear to be able to be configured to have that type of behavior. The moment you supply power to the switch chip, it "leaks". Since you can't turn on the switch chip without having it leak across ports, there is nothing that any firmware can do other than reduce the time that a cheap, consumer-grade switch fails to perform in a way that most home and SOHO users would ever notice.

Willy · May 31, 2018, 2:19pm

@jeff you are right. But I have shown that a conventional router with a separate WAN port does not have the same behavior as an xDSL router, and I know that both have only one chip to control the switch, so ... How do they do it? And as a last resort, once the configuration has been fully loaded, would it be possible to send a reset command ONLY to the switch to disconnect all the devices and renew the IP with the correct configuration?
TIA

jeff · May 31, 2018, 2:41pm

Could be most anything

You just haven't tested in a way that reveals the other router's true behavior
Different switch chip that comes up configured differently by default
Mask-programmable switch chip
Switch chip has a phy-enable line that the manufacturer uses
Switch chip has separate power lines for the phys that the manufacturer switches with a GPIO or the like
...

Find out what chip is in your unit and download the data sheet if your are still curious.

Bottom line is that no sane consumer-product manufacturer is going to spend a penny on this in their design as it has no consumer-market value, only increased design and potentially fabrication costs. They can't sell their routers for any more, or sell any more of them by advertising "Our router works in really uncommon situations in ways that I can't explain to you."

Resetting the switch won't force the clients to renew their leases.

Resetting the switch will also wipe all the configuration you just did.

Willy · May 31, 2018, 3:12pm

I see that there is no way out...

Thanks