Troubleshooting monitoring

Monitoring services

The monitoring system has the following services:

cassandra
kairosdb
abiquo-emmett
abiquo-delorean

Restarting the monitoring system

When you restart the monitoring server or cluster, the Cassandra server may take some time to start up, which may cause KairosDB to crash.

Check that all services are up with the following command
```
ps aux | grep service_name
```
Manually start any services that are not running, such as KairosDB

Troubleshoot 500 Monitoring 30 error

If Abiquo cannot connect to the monitoring server, then it will usually trigger the following error.

500. MONITORING-30 - Something has gone wrong on the watchtower server

To resolve this error, do the following steps in this order.

Check the monitoring queue in RabbitMQ (on the API server or Datanode server)

# rabbitmqctl list_queues messages consumers name

Listing queues
0	0	abiquo.vmactionplan.execution
0	1	abiquo.scheduler.slow.requests
0	1	abiquo.scheduler.fast.requests
0	1	abiquo.vsm.eventsynk
0	1	abiquo.actionplan.execution
0	1	abiquo.nars.requests.mothership7
0	1	abiquo.nodecollector.notifications
0	0	abiquo.vappspec.parking-expect-no-consumers
0	1	watchtower.alarm.notificacion
0	1	abiquo.bpm.notifications
0	1	abiquo.tracer.traces.tenantevents.mothership7
0	1	abiquo.nars.requests.mothership-pcr
0	1	watchtower.events.event
0	1	abiquo.virtualfactory.notifications
0	1	abiquo.datacenter.requests.mothership7.virtualfactory
0	1	abiquo.api.synchrs.requests
0	1	abiquo.tracer.traces.userevents.mothership7
0	0	abiquo.pcrsync.parking-expect-no-consumers
0	1	abiquo.nars.responses
0	0	abiquo.vmactionplan.schedule
0	1	abiquo.scheduler.requests
0	1	abiquo.am.notifications
0	0	abiquo.datacenter.requests.mothership7-pcr.virtualfactory
0	1	abiquo.vappspec.messages
0	1	abiquo.pcrsync.messages
0	1	abiquo.tracer.traces.allevents.mothership7
0	1	abiquo.ha.tasks
0	1	abiquo.datacenter.requests.mothership7.bpm
204	1	watchtower.alarm.evaluation
0	1	abiquo.actionplan.schedule
0	1	abiquo.datacenter.requests.mothership-pcr.virtualfactory
0	1	abiquo.virtualmachines.definitionsyncs
0	1	abiquo.tracer.traces.eventpersister.mothership7

If there are watchtower.alarm.evaluation events in the queue, check the watchtower host.

Check that the location of the watchtower host is correctly configured in the abiquo.properties file

# cat /opt/abiquo/config/abiquo.properties | grep watchtower.host
abiquo.watchtower.host = monitoring.bcn.abiquo.com

Check for storage space in the watchtower.host

Remove files older than 30 days.

find /var/lib/cassandra/data/kairosdb/data_points -mtime +30 -print -delete;

Check if the services are listening.

# netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:9160            0.0.0.0:*               LISTEN      3565/java           
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      3930/mysqld         
tcp        0      0 127.0.0.1:43605         0.0.0.0:*               LISTEN      3565/java           
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3559/sshd           
tcp        0      0 10.60.20.32:7000        0.0.0.0:*               LISTEN      3565/java           
tcp        0      0 127.0.0.1:7199          0.0.0.0:*               LISTEN      3565/java           
tcp6       0      0 :::9100                 :::*                    LISTEN      3562/node_exporter  
tcp6       0      0 0.0.0.0:9042            :::*                    LISTEN      3565/java           
tcp6       0      0 :::22                   :::*                    LISTEN      3559/sshd           
tcp6       0      0 :::36638                :::*                    LISTEN      3561/java

Reboot the server because a service restart does not make the Cassandra service start correctly. After the monitoring server reboot, make sure all services are up and running. If KairosDB starts up too quickly after Cassandra, it will fail. Then Abiquo will throw exceptions from Emmett about connection refused to localhost. To resolve this issue, on the monitoring server start KairosDB manually.