Introduction
The Abiquo API can be set up in a load balanced distributed configuration so it can handle heavy network traffic and offer resilience if a node goes down, thus avoiding downtime. However, if the remote services are still running on a single machine, then if that machine crashes, it can bring down the operations for a given datacenter.
To avoid this situation, this document will describe a scenario to set up 2 remote service servers in an active-passive configuration, so in case of a crash of the main server, downtime is avoided. Note that as this is an active-passive setup, no load balancing is done, it just provides a safety net to avoid downtime in case of a system crash.
Sample Configuration and Essential Guidelines
Scenario
We are going to describe this setup assuming the scenario described below. Note that you will need to adjust the rest of the commands in this document to suit your environment (mostly host names and IPs).
Machine | Host name | IP Address |
---|---|---|
Abiquo API | abiquo | 10.60.13.28 |
Remote services 1 | rsha1 | 10.60.13.57 |
Remote services 2 | rsha2 | 10.60.13.56 |
Cluster IP | rscluster | 10.60.13.58 |
Redis server | redis | 10.60.13.59 |
NFS repository | repository | 10.60.1.72 |
Remote services 1 and 2 will be the nodes forming our cluster, and serving the remote services webapps on the cluster-wide IP address. This IP will switch from one server to another in the case of a failure. Redis needs to be on a separate host, as the data must be shared for both remote services machines. It is out of the scope of this document to describe a highly available setup for Redis. If you want to run Redis in such a setup, please refer to to Redis's documentation.
It is assumed in this guide that all the machines can reach each other using either the IP or the short name (via hosts file or DNS). Also, it is assumed that both rsha1 and rsha2 have Abiquo remote services installed on them. Follow the remote services installation guide to install them. You also need to keep the abiquo.properties configuration file synced on both RS nodes, which in our case is:
[remote-services] abiquo.appliancemanager.localRepositoryPath = /opt/vm_repository abiquo.appliancemanager.repositoryLocation = 10.60.1.72:/opt/vm_repository abiquo.datacenter.id = rsha abiquo.rabbitmq.host = 10.60.13.28 abiquo.rabbitmq.password = guest abiquo.rabbitmq.port = 5672 abiquo.rabbitmq.username = guest abiquo.redis.host = 10.60.13.59 abiquo.redis.port = 6379
Installation of the cluster stack
Abiquo recommends that you use Clusterlab's pacemaker as the cluster resource manager. As described on Clusterlab's site, in RedHat based distributions, the pacemaker stack uses CMAN for cluster communication. The steps below are extracted from pacemaker's quick start guide for RedHat systems. We will use the same conventions used in pacemaker's guide, that is, [ALL] # denotes a command that needs to be run on all cluster machines, and [ONE] # indicates a command that only needs to be run on one cluster host.
So, start by installing pacemaker and all the needed tools:
[ALL] # yum install pacemaker cman pcs ccs resource-agents
Next, create a CMAN cluster and populate it with oyur nodes:
[ONE] # ccs -f /etc/cluster/cluster.conf --createcluster pacemaker1 [ONE] # ccs -f /etc/cluster/cluster.conf --addnode rsha1 [ONE] # ccs -f /etc/cluster/cluster.conf --addnode rsha2
Then you need to configure cluster fencing, even if you don't use it:
[ONE] # ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk [ONE] # ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect rsha1 [ONE] # ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect rsha2 [ONE] # ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk rsha1 pcmk-redirect port=rsha1 [ONE] # ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk rsha2 pcmk-redirect port=rsha2
CMAN was originally written for rgmanager and assumes the cluster should not start until the node has quorum, so before trying to start the cluster, disable this behavior:
[ALL] # echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
Now you are ready to start up your cluster:
[ALL] # service cman start [ALL] # service pacemaker start [ALL] # chkconfig cman on [ALL] # chkconfig pacemaker on
Setting basic cluster options
With so many devices and possible topologies, it is nearly impossible to include Fencing in a document like this. For now, disable it.
[ONE] # pcs property set stonith-enabled=false
As we are using a 2-node setup, the concept of quorum does not make sense, as you can't have more than half of the nodes available in case of a failure, so disable it too:
[ONE] # pcs property set no-quorum-policy=ignore
Also, we will set a resource stickiness value which will prevent the resources to be moved back to the original host when the cluster recovers from a failure:
[ONE] # pcs resource defaults resource-stickiness=100
Adding resources
So up to this point you have a functional cluster but it is not managing any resources. We will add resources to the cluster to manage every component needed to run the Abiquo services, but as the cluster will be in charge of starting up the abiquo-tomcat service, first stop it and disable it at boot time:
[ALL] # service abiquo-tomcat stop [ALL] # chkconfig abiquo-tomcat off
Now, start adding resources to the cluster:
[ONE] # pcs resource create vm_repository ocf:heartbeat:Filesystem device=10.60.1.72:/opt/vm_repository directory=/opt/vm_repository fstype=nfs options=defaults [ONE] # pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.60.13.58 cidr_netmask=24 [ONE] # pcs resource create tomcat-service lsb:abiquo-tomcat
We've added 3 resources here:
- A filesystem mount, responsible for mounting the NFS template repository on the active node.
- An IP address (cluster IP) that will switch to the remaining node in case of failure.
- Finally, the abiquo-tomcat init script that needs to be run to start or stop the Abiquo tomcat service.
Then set up a group of resources so that these resources always run on the same machine:
[ONE] # pcs resource group create abiquo_rs vm_repository ClusterIP tomcat-service
Note the order of the names in the command also determines the startup and shutdown order in the group. In this example, the cluster will first mount the NFS share, then bring up the cluster IP and then start the tomcat service. Shutdown order is the reverse order.
This alone will suffice for the cluster to switch the resource group from node to node in case of a crash. In the case of a network failure though, services might be running on both machines because they won't have any way to contact each other to determine the status, which will cause a "split brain" situation. To avoid this, add an extra resource to ping your gateway IP address and shutdown services in case of a network failure:
[ONE] # pcs resource create ping ocf:pacemaker:ping host_list=10.60.13.1 timeout=5 attempts=3 [ONE] # pcs resource clone ping connection
And lastly, add a colocation constraint so the tomcat service, cluster IP and NFS mount are located on the node with a successful ping:
[ONE] # pcs constraint colocation add abiquo_rs with ping score=INFINITY
And thats it! You can check the actual status of the cluster with crm_mon command:
[root@rsha1 ~]# crm_mon -1 Last updated: Thu Sep 18 01:35:00 2014 Last change: Wed Sep 17 04:53:07 2014 via cibadmin on rsha1 Stack: cman Current DC: rsha2 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ rsha1 rsha2 ] Resource Group: abiquo_rs vm_repository (ocf::heartbeat:Filesystem): Started rsha1 ClusterIP (ocf::heartbeat:IPaddr2): Started rsha1 tomcat-service (lsb:abiquo-tomcat): Started rsha1 Clone Set: ping-clone [ping] Started: [ rsha1 rsha2 ] [root@rsha1 ~]#
Running crm_mon -1 prints the info and quits. Running crm_mon with no arguments will enter a "watch" mode that will periodically refresh the info.
You can now add the RS cluster to Abiquo, remembering to use the cluster IP to ensure the failover is available.