I've recently had some strange behaviour with a couple HA failover setups. I haven't changed the network configuration that I've been running with for a few years, only updated from 8.305 to 9.003-16. The production example is upgrade of hardware 120 --> 220 in the main office. The second example is exact same setup on home network with nearly same setup, home network only running in HA failover for last 5 days in hopes of replicating or ruling out and its occurred twice since. Here's the overview.
PROD HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
- VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
- VLAN-INSIDE port4 to node1, port5 to node2, port 6 connected to downstream L3 routing switch via untagged VLAN INSIDE (switch is Procurve 2824 I.10.73 one version behind, latest version is I.10.77 but has no documented fix for this, plus its worked for years with this config)
- ISSUE: One of the nodes will fall off the network, followed by the 2nd node falling off as well. Or the nodes will switch roles. Upon investigation on switch it indicates no MAC address associated with the ports that are connected to the nodes. Switch has been changed to spare and and problem continues. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will plug the node into an alternate switch to see if it gets the MAC as well as consoling into the node to see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.
HOME HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
- VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
- VLAN-INSIDE port4 to node1, port5 to node2, (home network uses tagged VLANs 98 and 99 as the VLAN INSIDE, no routing, switch is Procurve 2810-24G N.11.52 current version)
- ISSUE: Same as in production, nodes fall off network or switch roles and switch indicates no MAC address associated with the ports connected to the nodes. Home doesn't have luxury of spare switch to try with. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will console into the node see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.
Reviewed system and high-availability logs and don't see any indication of errors to assist with troubleshooting.
(I haven't escalated to Sophos Support yet until I have additional info from next failures. I'm also hesitant to make things fail again as its unpredictable as to when it will fail other than to say it will be at the most inconvenient time.)
A further point of clarification, I have 3 other networks in production and none are experiencing issues.
- ASG 120 HA Failover (8.305) with Procurve 2824
- UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode
- UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode
Is anyone else having issues with HA failover since moving to 9.003-16?
Is there something wrong with my setup at the network connection level?
Any suggestions?
PROD HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
- VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
- VLAN-INSIDE port4 to node1, port5 to node2, port 6 connected to downstream L3 routing switch via untagged VLAN INSIDE (switch is Procurve 2824 I.10.73 one version behind, latest version is I.10.77 but has no documented fix for this, plus its worked for years with this config)
- ISSUE: One of the nodes will fall off the network, followed by the 2nd node falling off as well. Or the nodes will switch roles. Upon investigation on switch it indicates no MAC address associated with the ports that are connected to the nodes. Switch has been changed to spare and and problem continues. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will plug the node into an alternate switch to see if it gets the MAC as well as consoling into the node to see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.
HOME HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
- VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
- VLAN-INSIDE port4 to node1, port5 to node2, (home network uses tagged VLANs 98 and 99 as the VLAN INSIDE, no routing, switch is Procurve 2810-24G N.11.52 current version)
- ISSUE: Same as in production, nodes fall off network or switch roles and switch indicates no MAC address associated with the ports connected to the nodes. Home doesn't have luxury of spare switch to try with. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will console into the node see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.
Reviewed system and high-availability logs and don't see any indication of errors to assist with troubleshooting.
(I haven't escalated to Sophos Support yet until I have additional info from next failures. I'm also hesitant to make things fail again as its unpredictable as to when it will fail other than to say it will be at the most inconvenient time.)
A further point of clarification, I have 3 other networks in production and none are experiencing issues.
- ASG 120 HA Failover (8.305) with Procurve 2824
- UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode
- UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode
Is anyone else having issues with HA failover since moving to 9.003-16?
Is there something wrong with my setup at the network connection level?
Any suggestions?