So in summary they had: (1) no accurate inventory of what should be running or what was actually running (2) fundamental reliability issues with their cluster configuration not the least of which was a failure to implement basic self-integrity checks on services or clusters to ensure they were actually running and in a valid state (3) a wholly unimplemented mechanism for mapping high availability clusters across physical infrastructure (4) no network layer automation to ensure the accurate provisioning and segmentation of services and access to service data (5) unknown internal parties randomly accessing and altering cluster member configuration.
Yes, this describes a management problem. Sounds like shoot the messenger would be an easy de-facto choice for middle management, given how many years and spent dollars creating that situation must have taken...
Yes, this describes a management problem. Sounds like shoot the messenger would be an easy de-facto choice for middle management, given how many years and spent dollars creating that situation must have taken...
For rapid use next time you can run lsof with various arguments to obtain a list of listening services on UDP and TCP ports. Cluster health and service status check wise, I have had good experience with https://clusterlabs.org/pacemaker/ + https://github.com/corosync/corosync