F5 - VLAN Failsafe configured with Route Domains result in Standby-Standby
You may observe both devices, within an F5 HA pair, going into a standby-standby when,
- VLAN Failsafe is enabled on a segment
- Route Domains are configured
- There is no server present on the given segment
- The F5 version is lower then 11.2.0
The reason for this is based around bug id 388270 and also in how VLAN Failsafe operates within versions lower then 11.4.0.
Bug ID 388267
When using route domains with VLAN failsafe, after 3/4 of the failsafe timeout, LTM sends out a multicast ping using TMMs real IP address instead of the self IP address. Fixed within 11.2 HF3.
From 11.4.0 onwards at the point aggressive mode is initiated (3/4 timer) the DB key failover.vlanfailsafe.resettimeronanyframe is ignored and the reception of any frame resets the VLAN Failsafe timer. Full details around this behavior change can be found within SOL16568.
Consider the following scenario,
- You have a single server on a segment. VLAN Failsafe is also configured on the segment.
- The server is rebooted.
- No traffic is seen on the segment by the F5s.
- The VLAN Failsafe timer starts.
- The ARP entry for the server expires from the ARP cache on both devices.
- The VLAN Failsafe 1/2 timeout expires as there is no ARP/NDP cache entries to generate traffic from.
- During the VLAN Failsafe 3/4 timeout (aka aggressive mode) a multicast ICMP Echo request is sent from each F5.
- Each F5 fails to respond to the ICMP Ping due to bug 388270 (Fixed 11.2.0 HF3), as it is sent from from the loopback address rather then the self IP.
- The Timer expires on both devices, resulting in VLAN Failsafe triggering and both boxes going to standby-standby.
There are 3 methods for resolving this issue,
- Upgrade to 11.2.0 or higher.
- Create a dummy node on the segment using the floating self ip., in turn generating the necessary traffic to reset the VLAN Failsafe timer.
- Modify the db key failover.vlanfailsafe.resettimeronanyframe to true. Further details around this db key can be found below.
The DB failover.vlanfailsafe.resettimeronanyframe by default is set to false.
Changing this DB key to true ensures that the VLAN Failsafe timer is reset on reception of any frame, i.e not just ARP and Neighbor Discovery. In turn meaning that when the F5 sees the ICMP Multicast Echo from its peer the VLAN Failsafe timer is reset.
With this db key set to true BPDUs will also reset the VLAN Failsafe timer if they are sent over the necessary VLAN. In the case of RPVSTP the BPDUs are sent across each VLAN. However for MST this is not the case.
When BPDUs are sent across the VLAN this results in the VLAN Failsafe feature acting as a mechanism to detect issues ONLY up to the neighbouring switchport. This in turn prevents the ability to detect issues further downstream.
- Action - It is recommended to set the action to failover only.
- Timeout - Based on the 2 common spanning tree protocols of RPVSTP and MST. To allow the convergence of any ports that are not configured with port-fast it is recommended to use a timeout of 30secs.
- SOL13297 : Overview of VLAN failsafe (10.x - 11.x)
- SOL16568 : The BIG-IP system resets the VLAN failsafe timer upon receiving any frame in aggressive mode
- SOL13668 : BIG-IP 11.2.0 cumulative hotfix - includes sdetails around the bug id 388270. This is also shown below,
|Fixed 11.2.0 HF3||388267||vlan_failsafe now correctly sends multicast icmp6 ping requests when in aggressive mode.|
|Fixed 11.2.0 HF3||388270||vlan_failsafe enabled on a vlan where the only available self-ip is on a non-default route domain now sends out multicast pings from a correct source address. In earlier versions, the source address would be from the loopback range (127.0.0.1 through 127.255.255.255)|