VMware vSphere High Availability Q/A

Looking for VMware vSphere High Availability based Interview questions? I have attempted to gather important interview questions which you might find useful for your preparation Here we go –

1. How will you define VMware HA?

As per VMware Definition,VMware® High Availability (HA) provides easy to use, cost effective high availability for applications running in virtual machines. In the event of server failure, affected virtual machines are automatically restarted on other production servers with spare capacity.

The High Availability (HA) feature in vSphere 4.1 allows a group of ESX/ESXi hosts in a cluster to identify individual host failures and thereby provide for higher availability of hosted VMs. HA will restart VMs which were running on a failed host; it is a high-availability solution, not a zero-downtime solution such as application clustering or VMware Fault Tolerance. There will be a period of time when VMs are offline following a physical host failure, this is important to understand and you should ensure that your customers and management are aware of this. HA is a complex topic, but setting it up and using it are fairly straight-forward.

2. List out key features of VMware HA?

Ans: Key features of VMware HA include:

Proactive monitoring of all physical servers and virtual machines
Automatic detection of server failure
Rapid restart of virtual machines affected by server failure
Optimal placement of virtual machines after server failure
Scalable availability up to 32 nodes across multiple servers

2. We have lot of features like VMotion, DRS, SMP etc, but why we need HA?

We need this because we need our services running without interruption. Assume like, for some reason if any one of the ESX server in the cluster goes down suddenly, what happens to the virtual machines which are running on that particular server? Are they continue to run or go down. Yes, they also goes down. But with the help of VMware HA, these vm’s can be restarted immediately on the other ESX servers in the same cluster. But here you will get a down time of 5 –10 mins. Because server crash is an unexpected thing.

3. Does HA use vMotion?

No. VMware HA doesn’t use vMotion. Infact, VM stops and restarts on other ESX host.

4.What architecture changes were seen between ESXi 4.1 and 5.0?

vSphere 5.0 comes a new HA architecture. HA has been rewritten from the ground up to shed some of those constraints that were enforced by AAM. HA as part of 5.0, also referred to as FDM (fault domain manager), introduces less complexity and higher resiliency. From a UI perspective not a lot has changed, but there is a lot under the covers that has changed though, no more primary/secondary node concept as stated but a master/slave concept with an automated election process.

5. Can you brief what difference did you find between ESXi 4.1 and 5.0?

VMware vSphere 4.1 HA	VMware vSphere 5.0 HA

It is called as Automated Availability Manager in this version.	It is called as Fault Domain Manager in this version
When we configure HA on vSphere 4.1 cluster, the first 5 hosts will be designated as Primary nodes, out of these 5 one node will act as “Master Primary” and which will handle restarts of VM’s in the event of a host failure. All the remaining hosts will join as Secondary Nodes.	When we configure HA on vSphere 5.0 cluster, the first node will be elected as master and all other nodes will be configured slaves. Master node will be elected based on number of data stores it is connected to, and if all the hosts in cluster are connected to same number of data stores, host’s managed id will be taken into consideration. Host with highest managed id will be elected as master.
Primary nodes maintain information about cluster settings and secondary node states. All these nodes exchange their heartbeat with each other to know the health status of other nodes.· Primary nodes sends their heart beats to all other primary and secondary nodes. Secondary nodes sends their heart beats to primaries only. Heart beats will be exchanged between all nodes every second.In case of a primary failure, other primary node will take the responsibility of restarts.If all primaries goes down at same point, no restarts will be initiated, in other words to initiate reboots at least one primary is required. Election of primary happens only during following scenarioso When a host is disconnectedo When a host is entered into maintenance modeo When a host is not respondingo And when cluster is reconfigured for HA.	All hosts exchanges their heartbeats with each other to know about their health states. Host Isolation response has been enhanced in this version, by introducing data store heart beating. Every host creates a hostname-hb file on the configured data stores and keeps it updated at specific interval. Two data stores will be selected for this purpose. If we want to know who is master and who are slaves, just need to go to vCenter and click on Cluster Status from homepage in HA area.

6.Is HA dependent on Vmware vCenter Server?
Ans: Yes. But only during the initial installation and configuration.

7 Does HA works without vCenter Server?

Ans: Yes. HA works as a master and slave relationship in the cluster.

8. Does HA works with DRS?

Ans. In vSphere 4.1, HA can work with and utilize Distributed Resource Scheduler (DRS) if it is also enabled on the cluster so it is important to understand what DRS is…though a full description of DRS is outside the scope of this article. DRS continuously monitors the resource usage of hosts within a cluster and can suggest or automatically migrate (vMotion) a VM from one host to another to balance out the resource usage across the cluster as a whole and prevent any single host from becoming over-utilized. HA is based on Legato’s Automated Availability Manager, and as such you will see some HA-related files and logs on an ESX host labeled with “AAM”. HA requires vCenter for initial configuration, but unlike DRS it does not require vCenter to function after it is up and running.

In short, using VMware HA with Distributed Resource Scheduler (DRS) combines automatic failover with load balancing. This combination can result in faster rebalancing of virtual machines after VMware HA has moved virtual machines to different hosts. When VMware HA performs failover and restarts virtual machines on different hosts, its first priority is the immediate availability of all virtual machines. After the virtual machines have been restarted, those hosts on which they were powered on might be heavily loaded, while other hosts are comparatively lightly loaded. VMware HA uses the virtual machine’s CPU and memory reservation to determine if a host has enough spare capacity to accommodate the virtual machine.

9. How does primary and secondary nodes work under HA?

Ans: HA will elect up to five hosts to become primary HA nodes, all other nodes in a cluster are secondary nodes up to a maximum of 32 total. (Note: Host and Node are used interchangeably)

By default, the first 5 nodes to join the HA cluster will be the primary nodes. If a primary node fails, is removed from the cluster, is placed in Maintenance Mode, or an administrator initiates the “Reconfigure for HA” command, HA will initiate the re-election process to randomly elect five primary nodes. The purpose of a primary node is to maintain node state data, which is sent by all nodes every 10 seconds by default in vSphere 4.1.

One primary node must be online at all times for HA to function, as such it is recommended to have primary nodes physically separated across multiple racks or enclosures if possible to ensure at least one remains online in the event that a rack or enclosure goes down. With a limit of five primary nodes, the maximum allowable host failures for a single HA cluster is four. One of the five primary nodes will automatically be designated as the active primary (also called Failover Coordinator), and it will be responsible for keeping track of restart attempts and deciding where to restart VMs.

10. Is it possible to determine which nodes are currently the primary nodes from the ESX console?

Ans. It is possible to find it by launching the AAM CLI using the following syntax:

/aam-installation/opt/vmware/aam/bin # ./Cli
From the AAM CLI, enter the ln command:
AAM> ln
From the AAM CLI you can also promote and demote primary nodes manually using the promoteNode and demoteNode commands, respectively, though this is not generally recommended.

11.Any idea how HA 4.1 determines a host has failed?
Ans: This happens in two ways, a host can determine that it is isolated from all other hosts and initiate its configured isolation response, and other nodes can determine that one host is failed and attempt to restart the VMs hosted on the failed host elsewhere. By default, all nodes send heartbeats to other nodes every second across the management network. Primary nodes send heartbeats to all other nodes, and secondary nodes send heartbeats to primary nodes only.

12. Can you explain in details what is Isolation Response?

The isolation response setting determines what action a host will take when it determines that it is isolated from all other nodes in the HA cluster. When configuring HA for a cluster, you have three options for the isolation response: Power Off, Leave Powered On, and Shutdown. The options are pretty self explanatory, the main thing to know is that the power off setting is equivalent to pulling the power on a physical server, it is not a clean shutdown. In vSphere 4.1, the default isolation response is shutdown.

When a host determines that it is no longer receiving heartbeats from any other hosts, it will attempt to ping its isolation address which by default is the default gateway of the management network. If this fails, the isolation response is triggered. Additional isolation addresses can be configured using the advanced setting das.isolationaddressX, where X is a number starting with 2 and incrementing upwards for each additional address. This is useful to detect a situation where the management network may have failed while the VM networks are still operational. The isolation detection timeline is 16 seconds, with an additional second added for each additional isolation address. The timeline breaks down as follows; failure occurs at 0 seconds, at 13 seconds without receiving a heartbeat the isolation address is pinged, if this fails, at 14 seconds the isolation response is triggered by the host. At 15 seconds the host is declared failed by other hosts in the cluster, and finally at 16 seconds with no heartbeats received the failover coordinator attempts to restart the failed host’s VMs on other nodes. Should the initial restart fail, HA will attempt to restart the VM 5 more times before abandoning the restart attempt.

There is some planning to be done when configuring the isolation response. If you use the default isolation address and isolation response settings (management default gateway and shutdown, respectively), it is possible for the management network of the host to become disconnected while the VM networks are still online. In this situation, the isolation response would be triggered and your VMs would be shutdown even though they are still online and functioning normally. Alternatively, setting the isolation response to leave powered on while suffering a complete network failure on a node will prevent your VMs from being restarted on a functioning host, effectively taking them offline until an administrator intervenes.

13. What is HA Admission Control?

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected. Three types of admission control are available.

1. Host

2. Resource Pool

3.VMware HA

Host ensures that a host has sufficient resources to satisfy the reservations of all virtual machines running on it.

Resource Pool Ensures that a resource pool has sufficient resources to satisfy the reservations, shares, and limits of all virtual machines associated with it.

VMware HA Ensures that sufficient resources in the cluster are reserved for virtual machine recovery in the event of host failure.

Admission control imposes constraints on resource usage and any action that would violate these constraints is not permitted. Examples of actions that could be disallowed include the following:

– Powering on a virtual machine.

– Migrating a virtual machine onto a host or into a cluster or resource pool.

– Increasing the CPU or memory reservation of a virtual machine.

Of the three types of admission control, only VMware HA admission control can be disabled. However, without it there is no assurance that all virtual machines in the cluster can be restarted after a host failure. VMware recommends that you do not disable admission control, but you might need to do so temporarily, for the following reasons: n If you need to violate the failover constraints when there are not enough resources to support them (for example, if you are placing hosts in standby mode to test them for use with DPM). n If an automated process needs to take actions that might temporarily violate the failover constraints (for example, as part of an upgrade directed by VMware Update Manager). n If you need to perform testing or maintenance operations

14.Is it possible to configure VMware HA to tolerate a specified number of host failures?

Ans; Yes.

You can configure VMware HA to tolerate a specified number of host failures. With the Host Failures Cluster Tolerates admission control policy, VMware HA ensures that a specified number of hosts can fail and sufficient resources remain in the cluster to fail over all the virtual machines from those hosts. With the Host Failures Cluster Tolerates policy, VMware HA performs admission control in the following way:

1 Calculates the slot size. A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine in the cluster.

2 Determines how many slots each host in the cluster can hold.

3 Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines.

4 Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user). If it is, admission control disallows the operation.

The maximum Configured Failover Capacity that you can set is four. Each cluster has up to five primary hosts and if all fail simultaneously, failover of all virtual machines might not be successful.

15. How is slot size calculated?

Slot Size Calculation Slot size is comprised of two components, CPU and memory.

1. VMware HA calculates the CPU component by obtaining the CPU reservation of each powered-on virtual machine and selecting the largest value. If you have not specified a CPU reservation for a virtual machine, it is assigned a default value of 256 MHz. You can change this value by using the das.vmcpuminmhz advanced attribute.)

2. VMware HA calculates the memory component by obtaining the memory reservation, plus memory overhead, of each powered-on virtual machine and selecting the largest value. There is no default value for the memory reservation. If your cluster contains any virtual machines that have much larger reservations than the others, they will distort slot size calculation. To avoid this, you can specify an upper bound for the CPU or memory component of the slot size by using the das.slotcpuinmhz or das.slotmeminmb advanced attributes, respectively.

16. What are pre-requites for HA to work?

1.Shared storage for the VMs running in HA cluster
2.Essentials plus, standard, Advanced, Enterprise and Enterprise Plus Licensing
3.Create VMHA enabled Cluster
4.Management network redundancy to avoid frequent isolation response in case of temporary network issues (preferred not a requirement)

17. What is maximum number of primary HA hosts in vSphere 4.1?

Maximum number of primary HA host is 5. VMware HA cluster chooses the first 5 hosts that joins the cluster as primary nodes and all others hosts are automatically selected as secondary nodes.

18. What is AAM in HA?

AAM is the Legato automated availability management. Prior to vSphere 4.1, VMware’s HA is actually re engineered to work with VM’s with the help of Legato’s Automated Availability Manager (AAM) software. VMware’s vCenter agent (vpxa) interfaces with the VMware HA agent which acts as an intermediary to the AAM software. From vSphere 5.0, it uses an agent called “FDM” (Fault Domain Manager).

19.What is maximum number of primary HA hosts in vSphere 4.1?

Maximum number of primary HA host is 5. VMware HA cluster chooses the first 5 hosts that joins the cluster as primary nodes and all others hosts are automatically selected as secondary nodes.

20. How to see the list of Primary nodes in HA cluster?

View the log file named “aam_config_util_listnodes.log” under /var/log/vmware/aam using the below command

cat /var/log/vmware/aam/aam_config_util_listnodes.log

21. What is the command to restart /Start/Stop HA agent in the ESX host?

service vmware–aam restart

service vmware–aam stop

service vmware–aam start

22. Where to located HA related logs in case of troubleshooting?

/Var/log/vmware/aam

23. What the basic troubleshooting steps in case of HA agent install failed on hosts in HA cluster?

Below steps are taken from blog posts Troubleshooting HA

1. Check for some network issues

2. Check the DNS is configured properly

3. Check the vmware HA agent status in ESX host by using below commands

service vmware–aam status

4. Check the networks are properly configured and named exactly as other hosts in the cluster. otherwise, you will get the below errors while installing or reconfiguring HA agent.

5. Check HA related ports are open in firewall to allow for the communication

Incoming port: TCP/UDP 8042-8045
Outgoing port: TCP/UDP 2050-2250

6. First try to restart /stop/start the vmware HA agent on the affected host using the below commands. In addition u can also try to restart vpxa and management agent in the Host.

service vmware–aam restart

service vmware–aam stop

service vmware–aam start

7. Right Click the affected host and click on “Reconfigure for VMWare HA” to re-install the HA agent that particular host.

8. Remove the affected host from the cluster. Removing ESX host from the cluster will not be allowed untill that host is put into maintenance mode.

9.Alternative solution for 3 step is, Goto cluster settings and uncheck the vmware HA in toturnoff the HA in that cluster and re-enable the vmware HA to get the agent installed.

10. For further troubleshooting , review the HA logs under /Var/log/vmware/aam directory.

24. What is the maximum number of hosts per HA cluster?

Maximum number of hosts in the HA cluster is 32

25. What is Host Isolation?

VMware HA has a mechanism to detect a host is isolated from rest of hosts in the cluster. When the ESX host loses its ability to exchange heartbeat via management networkbetween the other hosts in the HA cluster, that ESX host will be considered as a Isolated.

26. How Host Isolation is detected?

In HA cluster, ESX hosts uses heartbeats to communicate among other hosts in the cluster.By default, Heartbeat will be sent every 1 second.

If a ESX host in the cluster didn’t received heartbeat for for 13 seconds from any other hosts in the cluster, The host considered it as isolated and host will ping the configured isolation address(default gateway by default). If the ping fails, VMware HA will execute the Host isolation response

27. What are the different types isolation response available in HA?

Power off – All the VMs are powered off , when the HA detects that the network isolation occurs

Shut down – All VMs running on that host are shut down with the help of VMware Tools, when the HA detects that the network isolation occurs.If the shutdown via VMWare tools not happened within 5 minutes, VM’s power off operation will be executed. This behavior can be changed with the help of HA advanced options. Please refer http://www.vmwarearena.com/2012/07/vmware-ha-advanced-options.html

Leave powered on – The VM’s state remain powered on or remain unchanged, when the HA detects that the network isolation occurs.

27. How to add additional isolation address for redundancy?

By default, VMWare HA use to ping default gateway as the isolation address if it stops receiving heartbeat.We can add an additional values in case if we are using redundant service console both belongs to different subnet.Let’s say we can add the default gateway of SC1 as first value and gateway of SC2 as the additional one using the below value

1. Right Click your HA cluster

2. Goto to advanced options of HA

3. Add the line “das.isolationaddress1 = 192.168.0.1″

4. Add the line “das.isolationaddress2 = 192.168.1.1″ as the additional isolation address

To know more about the http://www.vmwarearena.com/2012/07/vmware-ha-advanced-options.html

28. What is HA Admission control?

As per “VMware Availability Guide”,

VCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.

29. What are the 2 types of settings available for admission control?

Enable: Do not power on VMs that violate availability constraints

Disable: Power on VMs that violate availability constraints

30. What are the different types of Admission control policy available with VMware HA?

There are 3 different types of Admission control policy available.

Host failures cluster tolerates
Percentage of cluster resources reserved as fail over spare capacity
Specify a fail over host

31. How the Host Failures cluster tolerates admission control policy works?

Select the maximum number of host failures that you can afford for or to guarantee fail over. Prior vSphere 4.1, Minimum is 1 and the maximum is 4.

In the Host Failures cluster tolerates admission control policy , we can define the specific number of hosts that can fail in the cluster and also it ensures that the sufficient resources remain to fail over all the virtual machines from that failed hosts to the other hosts incluster. VMware High Availability(HA) uses a mechanism called slots to calculate both the available and required resources in the cluster for a failing over virtual machines from a failed host to other hosts in the cluster.

32. What is SLOT?

As per VMWare’s Definition,

“A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.”

If you have configured reservations at VM level, It influence the HA slot calculation. Highest memory reservation and highest CPU reservation of the VM in your cluster determines the slot size for the cluster.

33. How the HA Slots are Calculated?

Refer http://www.vmwarearena.com/2012/07/ha-slots-calculation.html.

34. How to Check the HA Slot information from vSphere Client?

Click on Cluster Summary Tab and Click on “Advanced Runtime Info” to see the the detailed HA slots information.

35. What is use of Host Monitoring status in HA cluster?

Let’s take an example, you are performing network maintenance activity on your switches which connects your one of th ESX host in HA cluster.

what will happen if the switch connected to the ESX host in HA cluster is down?

It will not receive heartbeat and also ping to the isolation address also failed. so, host will think itself as isolated and HA will initiate the reboot of virtual machines on the host to other hosts in the cluster. Why do you need this unwanted situation while performing scheduled maintenance window.

To avoid the above situation when performing scheduled activity which may cause ESX hostto isolate, remove the check box in ” Enable Host Monitoring” until you are done with the network maintenance activity.

36. How to Manually define the HA Slot size?

By default, HA slot size is determined by the Virtual machine Highest CPU and memory reservation. If no reservation is specified at the VM level, default slot size of 256 MHZ for CPU and 0 MB + memory overhead for RAM will be taken as slot size. We can control the HA slot size manually by using the following values.

There are 4 options we can configure at HA advanced options related to slot size

das.slotMemInMB – Maximum Bound value for HA memory slot size
das.slotCpuInMHz – Maximum Bound value for HA CPU slot Size
das.vmMemoryMinMB – Minimum Bound value for HA memory slot size
das.vmCpuMinMHz – Minimum Bound value for HA CPU slot size

For More HA related Advanced options, Please refer the link: http://www.vmwarearena.com/2012/07/vmware-ha-advanced-options.html

37. How the “Percentage of cluster resources reserved as failover spare capacity” admission control policy works?

In the Percentage of cluster resources reserved as failover spare capacity admission control policy, We can define the specific percentage of total cluster resources are reserved for failover.In contrast to the “Host Failures cluster tolerates admission control policy”, It will not use slots. Instead This policy calculates the in the way below

1.It calculates the Total resource requirement for all Powered-on Virtual Machines in the cluster and also calculates the total resource available in host for virtual machines.
2.It calculates the current CPU and Memory Failover capacity for the capacity.
3.If the current CPU and Memory Failover capacity for the cluster < configured failover capacity (ex 25 %)
4.Admission control will not allow to power on the virtual machine which violates the availability constraints.

38. How the “Specify a failover host” admission control policy works?

In the Specify a failover host” admission control policy, We can define a specific host as a dedicated failover host. When isolation response is detected, HA attempts to restart the virtual machines on the specified failover host.In this Approach, dedicated failover hist will be sitting idle without actively involving or not participating in DRS load balancing.DRS will not migrate or power on placement of virtual machines on the defined failover host.

39. What is VM Monitoring status?

HA will usually monitors ESX hosts and reboot the virtual machine in the failed hosts in the other host in the cluster in case of host isolation but i need the HA to monitors for Virtual machine failures also. here the feature called VM monitoring status as part of HA settings.VM monitoring restarts the virtual machine if the vmware tools heartbeat didn’t received with the specified time using Monitoring sensitivity.