Remote Services HA Using Pacemaker

Introduction

The Abiquo API can be set up in a load balanced distributed configuration so it can handle heavy network traffic and offer resilience if a node goes down, thus avoiding downtime. However, if the remote services are still running on a single machine, then if that machine crashes, it can bring down the operations for a given datacenter.

To avoid this situation, this document will describe a scenario to set up 2 remote service servers in an active-passive configuration, so in case of a crash of the main server, downtime is avoided. Note that as this is an active-passive setup, no load balancing is done, it just provides a safety net to avoid downtime in case of a system crash.

Sample Configuration and Essential Guidelines

This document contains a sample configuration but as the sysadmin, it is your responsibility to choose your own HA software. However, you must follow these guidelines:
- Active / Passive
- Floating IP (which is assigned to the Active node and is the IP configured in Abiquo DC)
- Same Redis / Rabbit and abiquo.properties configuration for each RS node in the cluster

 

Scenario

We are going to describe this setup assuming the scenario described below. Note that you will need to adjust the rest of the commands in this document to suit your environment (mostly host names and IPs).

MachineHost nameIP Address
Abiquo APIabiquo10.60.13.28
Remote services 1rsha110.60.13.57
Remote services 2rsha210.60.13.56
Cluster IPrscluster10.60.13.58
Redis serverredis10.60.13.59
NFS repositoryrepository10.60.1.72

Remote services 1 and 2 will be the nodes forming our cluster, and serving the remote services webapps on the cluster-wide IP address. This IP will switch from one server to another in the case of a failure. Redis needs to be on a separate host, as the data must be shared for both remote services machines. It is out of the scope of this document to describe a highly available setup for Redis. If you want to run Redis in such a setup, please refer to to Redis's documentation.

It is assumed in this guide that all the machines can reach each other using either the IP or the short name (via hosts file or DNS). Also, it is assumed that both rsha1 and rsha2 have Abiquo remote services installed on them. Follow the remote services installation guide to install them. You also need to keep the abiquo.properties configuration file synced on both RS nodes, which in our case is:

[remote-services]
abiquo.appliancemanager.localRepositoryPath = /opt/vm_repository
abiquo.appliancemanager.repositoryLocation = 10.60.1.72:/opt/vm_repository
abiquo.datacenter.id = rsha
abiquo.rabbitmq.host = 10.60.13.28
abiquo.rabbitmq.password = guest
abiquo.rabbitmq.port = 5672
abiquo.rabbitmq.username = guest
abiquo.redis.host = 10.60.13.59
abiquo.redis.port = 6379

Installation of the cluster stack

Abiquo recommends that you use Clusterlab's pacemaker as the cluster resource manager. As described on Clusterlab's site, in RedHat based distributions, the pacemaker stack uses CMAN for cluster communication. The steps below are extracted from pacemaker's quick start guide for RedHat systems. We will use the same conventions used in pacemaker's guide, that is, [ALL] # denotes a command that needs to be run on all cluster machines, and [ONE] # indicates a command that only needs to be run on one cluster host.

So, start by installing pacemaker and all the needed tools:

[ALL] # yum install pacemaker cman pcs ccs resource-agents

Next, create a CMAN cluster and populate it with oyur nodes:

[ONE] # ccs -f /etc/cluster/cluster.conf --createcluster pacemaker1
[ONE] # ccs -f /etc/cluster/cluster.conf --addnode rsha1
[ONE] # ccs -f /etc/cluster/cluster.conf --addnode rsha2

Then you need to configure cluster fencing, even if you don't use it:

[ONE] # ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk
[ONE] # ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect rsha1
[ONE] # ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect rsha2
[ONE] # ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk rsha1 pcmk-redirect port=rsha1
[ONE] # ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk rsha2 pcmk-redirect port=rsha2

When setting up a 2 node CMAN cluster, you need to specify some parameters on cluster.conf file to allow the cluster to bring online the hosts. Edit cluster.conf and change the following tag:

<cman/>

to:

<cman two_node="1" expected_votes="1"/>

If you have problems with hosts disconnecting from cluster, you may need to force cluster communication over unicast instead of multicast:

<cman transport="udpu"/>

At this point, you need to copy the /etc/cluster/cluster.conf file over to every node that will form up your cluster.

CMAN was originally written for rgmanager and assumes the cluster should not start until the node has quorum, so before trying to start the cluster, disable this behavior:

[ALL] # echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman 

Now you are ready to start up your cluster:

[ALL] # service cman start 
[ALL] # service pacemaker start
[ALL] # chkconfig cman on
[ALL] # chkconfig pacemaker on

Setting basic cluster options

With so many devices and possible topologies, it is nearly impossible to include Fencing in a document like this. For now, disable it.

[ONE] # pcs property set stonith-enabled=false

As we are using a 2-node setup, the concept of quorum does not make sense, as you can't have more than half of the nodes available in case of a failure, so disable it too:

[ONE] # pcs property set no-quorum-policy=ignore

Also, we will set a resource stickiness value which will prevent the resources to be moved back to the original host when the cluster recovers from a failure:

[ONE] # pcs resource defaults resource-stickiness=100

Adding resources

So up to this point you have a functional cluster but it is not managing any resources. We will add resources to the cluster to manage every component needed to run the Abiquo services, but as the cluster will be in charge of starting up the abiquo-tomcat service, first stop it and disable it at boot time:

[ALL] # service abiquo-tomcat stop
[ALL] # chkconfig abiquo-tomcat off

Now, start adding resources to the cluster:

[ONE] # pcs resource create vm_repository ocf:heartbeat:Filesystem device=10.60.1.72:/opt/vm_repository directory=/opt/vm_repository fstype=nfs options=defaults
[ONE] # pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.60.13.58 cidr_netmask=24
[ONE] # pcs resource create tomcat-service lsb:abiquo-tomcat

We've added 3 resources here:

  1. A filesystem mount, responsible for mounting the NFS template repository on the active node.
  2. An IP address (cluster IP) that will switch to the remaining node in case of failure.
  3. Finally, the abiquo-tomcat init script that needs to be run to start or stop the Abiquo tomcat service.

Then set up a group of resources so that these resources always run on the same machine:

[ONE] # pcs resource group add abiquo_rs vm_repository ClusterIP tomcat-service

Note the order of the names in the command also determines the startup and shutdown order in the group. In this example, the cluster will first mount the NFS share, then bring up the cluster IP and then start the tomcat service. Shutdown order is the reverse order.

This alone will suffice for the cluster to switch the resource group from node to node in case of a crash. In the case of a network failure though, services might be running on both machines because they won't have any way to contact each other to determine the status, which will cause a "split brain" situation. To avoid this, add an extra resource to ping your gateway IP address and shutdown services in case of a network failure:

[ONE] # pcs resource create ping ocf:pacemaker:ping host_list=10.60.13.1 timeout=5 attempts=3
[ONE] # pcs resource clone ping connection

And lastly, add a colocation constraint so the tomcat service, cluster IP and NFS mount are located on the node with a successful ping:

[ONE] # pcs constraint colocation add abiquo_rs with ping-clone score=INFINITY

And thats it! You can check the actual status of the cluster with crm_mon command:

[root@rsha1 ~]# crm_mon -1
Last updated: Thu Sep 18 01:35:00 2014
Last change: Wed Sep 17 04:53:07 2014 via cibadmin on rsha1
Stack: cman
Current DC: rsha2 - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
5 Resources configured

Online: [ rsha1 rsha2 ]
 Resource Group: abiquo_rs
     vm_repository	(ocf::heartbeat:Filesystem):	Started rsha1
     ClusterIP	(ocf::heartbeat:IPaddr2):	Started rsha1
     tomcat-service	(lsb:abiquo-tomcat):	Started rsha1
 Clone Set: ping-clone [ping]
     Started: [ rsha1 rsha2 ]
[root@rsha1 ~]#

Running crm_mon -1 prints the info and quits. Running crm_mon with no arguments will enter a "watch" mode that will periodically refresh the info.

You can now add the RS cluster to Abiquo, remembering to use the cluster IP to ensure the failover is available.

 

Unable to render {include} The included page could not be found.