Managing and Monitoring VxRail
Recently I’ve been tasked with setting up management and monitoring of a Dell EMC VxRail Hyper–converged Infrastructure Appliance . Setting this up has not been a smooth ride. In this blog post in share the some of the details to help you setup.
Setup monitoring (EMC does not want you to)
After VxRail has been setup the first thing I wanted to do is setup monitoring. After consulting with support we’ve been told no external monitoring is supported. The system does not support SNMP , nor can it send out E-Mail alerts when a critical event is found. The only option is a dial-home to Dell/EMC using EMC Secure Remote Services (ESRS) . When critical events occur we don’t want to wait for EMC to contact us, we wan’t to be aware right away.
The VxRail does not integrate with an existing ESRS cluster. We already had one in place for other EMC products and we are not very keen on adding more administrative load. The ESRS that ships with VxRail doesn’t support a proxy, therefore if you don’t have direct internet access you are out of luck. Update: after providing feedback EMC has fixed this in VxRail Appliance software 4.0.300 . In this update EMC added support for external ESRS.
Monitoring over vCenter
Monitoring of vSan and Host Hardware Sensors is available through vCenter SDK. Unfortunately this does not cover any of the VxRail appliance internal events. Therefore, this is not a satisfying solution to me.
So what can I do? Monitor the PostgreSQL DB!
As there is no direct support from EMC for setting this up we decided to take matters into our own hands. The VxRail Appliance (VXRA) ships with an internal PostgreSQL database to holds the ‘events’.
Let’s start exploring the database.
#List databases: psql -U postgres -l #list tables psql -U postgres -d mysticmanager -c #list columns psql -U postgres -d mysticmanager -c "select * from event_code where false" psql -U postgres -d mysticmanager -c "select * from mystic_event where false" #find criticalities psql -U postgres -d mysticmanager -c "SELECT DISTINCT severity from event_code" severity ----------- 3Info 2Warning 1Error 0Critical
Listing the tables shows us two things of potential interest. Tables, mystic_event and event_code are the tables we need. We are not interested in events of severity ‘3Info’.
We can put this all together as follows:
SELECT count(*) FROM event_code AS ec INNER JOIN mystic_event AS me ON ec.code = me.code WHERE ec.severity ~ '[0-2].*' AND me.unread = 't';
This is a SQLquery that, using a regex, will get us all the events we are interested in. This query will return all unread events with severity
Integrating this in your monitoring solution.
Most monitoring tools support running a SQL query to the most popular databases. For example in NAGIOS you can use check_sql . However this brings along some other requirements as well. First of all, PostgreSQL only listens on the loopback interface by default.
Change the postgresql.conf file like so:
#listen_addresses='127.0.0.1' # Listen on local Unix domain and TCP/IP socket listen_addresses='*' # Listen on all interfaces logging_collector = on # Redirect output to pg_log directory # log timestamp, process-id, session log line#, user and database log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_min_messages=warning log_min_error_statement=warning log_min_duration_statement=3000 # log SQL which takes longer than 3000ms. log_lock_waits = on log_temp_files=1024 # log temp files with size >= 1024kb log_filename='postgresql%Z.log' # name without date/time so it can be rotated using Linux logrotate log_rotation_age='0' # disable log rotation as it is handled by Linux logrotate log_rotation_size='0' # disable log rotation. client_min_messages=warning # Do not report debug and notice level messages. max_connections = 100 effective_cache_size = 128MB shared_buffers = 8MB work_mem = 1MB maintenance_work_mem = 16MB wal_buffers = 64kB
Next, we need to setup client authentication . The below example ‘trusts’ the monitoring system to login as any user, without supplying a password. Obviously don’t do this in a production environment. Setup a monitoring account with RO access instead.
Change the pg_hba.conf file like so:
#/var/lib/pgsql/data/pg_hba.conf local all all trust # Trust connections over the Unix domain socket host all all 127.0.0.1/32 trust host all all /32 trust
Next, restart the services (or the full appliance) and you are in business!
When you are in trouble you should engage EMC support immediatly. EMC does not allow customers to self-help when there are issues. However, that hasn’t stopped us from trying ourself, afterall no one likes waiting for support when a serious issue is bringing down production. Here is a quick overview of useful commands:
|Restart Loudmouth||systemctl restart vmware-loudmouth|
|Restart Marvin||systemctl restart vmware-marvin|
|Restart mystic||systemctl restart runjars|
|Query Loudmouth||/usr/lib/vmware-loudmouth/bin/loudmouthc query|
|Create database dump||pg_dump -U postgres >|
You can find the most recently updated log files like this:
find / -name '*.log' -printf "%T+t%pn" | sort
Overall, these are the most important logs to watch:
Quick summary on my experience
Draw your own conclusions from this….
- Every appliance needs it’s own ESRS
- Does not integrate with existing ESRS for VPLX etc.
- Fixed in 4.0.300
- No third party monitoring supported.
- Does not easily integrate with existing vSphere environments.
- Using VUM is not supported.
- Integrating with existing DvSwitch is not supported.
- No live migrations from other cluster when using DvSwitch.
- Renaming appliance VM is not supported.
- Fixed in 4.0.300
- Behind on software updates (ESXi, vCenter, vSan).
- vCenter 6.5 not supported.
- Fixed for external vCenter in 4.0.310 (Sept 2017).
- Native support in 4.5.
- No CLI.
- VxRail analysis is done only by the Dell EMC Engineering team only. Any issue you face on VxRail you need to contact Dell EMC Engineering team for help.