Hi All,
I have an installation on VMware using HA (hot standby). The systems are stable for the most part, but about once per day things go a bit sideways for a little bit. It does correct itself, but there is a small outage that results (this is especially troublesome because tunnels may take a few minutes to come back).
The basic setup is that we have two ESX servers, each is running an instance of ASG (downloaded from ftp.astaro.com - VMware image). eth5 (the sync NIC) is a direct hardware link from one ESX system to another - it doesn't use a vswitch. I have also used a backup interface but this does not help, regardless of which interface I choose.
So far, I know that some heartbeats are indeed getting lost, but I don't know why. There are only a few missing sometimes which causes a failover and then a master/master conflict which resolves by using the preferred master. The one thing that I have done that has helped the problem is increasing the dead_time$ (cc > ha > times > dead_time$) to 6 (from 3). I am hesitant to increase it further.
So after that long message, my main question: Does anybody know of a way to enable debugging in order to find missing heartbeats? I can't see why, if one system sends a heartbeat the other doesn't get it because it's a direct hardware link between the two. Historically, I've seen high load cause this type of problem, but that is not the case here.
I have an installation on VMware using HA (hot standby). The systems are stable for the most part, but about once per day things go a bit sideways for a little bit. It does correct itself, but there is a small outage that results (this is especially troublesome because tunnels may take a few minutes to come back).
The basic setup is that we have two ESX servers, each is running an instance of ASG (downloaded from ftp.astaro.com - VMware image). eth5 (the sync NIC) is a direct hardware link from one ESX system to another - it doesn't use a vswitch. I have also used a backup interface but this does not help, regardless of which interface I choose.
So far, I know that some heartbeats are indeed getting lost, but I don't know why. There are only a few missing sometimes which causes a failover and then a master/master conflict which resolves by using the preferred master. The one thing that I have done that has helped the problem is increasing the dead_time$ (cc > ha > times > dead_time$) to 6 (from 3). I am hesitant to increase it further.
So after that long message, my main question: Does anybody know of a way to enable debugging in order to find missing heartbeats? I can't see why, if one system sends a heartbeat the other doesn't get it because it's a direct hardware link between the two. Historically, I've seen high load cause this type of problem, but that is not the case here.