F5 VLAN Failsafe with Standby-Standby Route Domains

Problem

You may observe both devices, within an F5 HA pair, going into a standby-standby when,

  • VLAN Failsafe is enabled on a segment
  • Route Domains are configured
  • There is no server present on the given segment
  • The F5 version is lower then 11.2.0

Reason

The reason for this is based around bug id 388270 and also in how VLAN Failsafe operates within versions lower then 11.4.0.

Bug ID 388267

When using route domains with VLAN failsafe, after 3/4 of the failsafe timeout, LTM sends out a multicast ping using TMMs real IP address instead of the self IP address. Fixed within 11.2 HF3.

11.4.0 Behaviour

From 11.4.0 onwards at the point aggressive mode is initiated (3/4 timer) the DB key failover.vlanfailsafe.resettimeronanyframe is ignored and the reception of any frame resets the VLAN Failsafe timer. Full details around this behavior change can be found within SOL16568.

Scenario

Consider the following scenario,

  1. You have a single server on a segment. VLAN Failsafe is also configured on the segment.
  2. The server is rebooted.
  3. No traffic is seen on the segment by the F5s.
  4. The VLAN Failsafe timer starts.
  5. The ARP entry for the server expires from the ARP cache on both devices.
  6. The VLAN Failsafe 1/2 timeout expires as there is no ARP/NDP cache entries to generate traffic from.
  7. During the VLAN Failsafe 3/4 timeout (aka aggressive mode) a multicast ICMP Echo request is sent from each F5.
  8. Each F5 fails to respond to the ICMP Ping due to bug 388270 (Fixed 11.2.0 HF3), as it is sent from from the loopback address rather then the self IP.
  9. The Timer expires on both devices, resulting in VLAN Failsafe triggering and both boxes going to standby-standby.

Work Around

There are 3 methods for resolving this issue,

  1. Upgrade to 11.2.0 or higher.
  2. Create a dummy node on the segment using the floating self ip., in turn generating the necessary traffic to reset the VLAN Failsafe timer.
  3. Modify the db key failover.vlanfailsafe.resettimeronanyframe to true.  Further details around this db key can be found below.

DB Key

The DB failover.vlanfailsafe.resettimeronanyframe  by default is set to false.

Changing this DB key to true ensures that the VLAN Failsafe timer is reset on reception of any frame, i.e not just ARP and Neighbor Discovery. In turn meaning that when the F5 sees the ICMP Multicast Echo from its peer the VLAN Failsafe timer is reset.

BPDUs

With this db key set to true BPDUs will also reset the VLAN Failsafe timer if they are sent over the necessary VLAN. In the case of RPVSTP the BPDUs are sent across each VLAN. However for MST this is not the case.

When BPDUs are sent across the VLAN this results in the VLAN Failsafe feature acting as a mechanism to detect issues ONLY up to the neighbouring switchport. This in turn prevents the ability to detect issues further downstream.

Additional

Recommendations

  • Action – It is recommended to set the action to failover only.
  • Timeout – Based on the 2 common spanning tree protocols of RPVSTP and MST. To allow the convergence of any ports that are not configured with port-fast it is recommended to use a timeout of 30secs.

Reference

Fixed 11.2.0 HF3388267vlan_failsafe now correctly sends multicast icmp6 ping requests when in aggressive mode.
Fixed 11.2.0 HF3388270vlan_failsafe enabled on a vlan where the only available self-ip is on a non-default route domain now sends out multicast pings from a correct source address. In earlier versions, the source address would be from the loopback range (127.0.0.1 through 127.255.255.255)
Rick Donato

Want to become an F5 Loadbalancers expert?

Here is our hand-picked selection of the best courses you can find online:
F5 BIG-IP 101 Certification Exam – Complete Course
F5 BIG-IP 201 Certification Exam – Complete Course
and our recommended certification practice exams:
AlphaPrep Practice Tests - Free Trial