MongoDB Failures and Resolution

We performed some maintenance on our backend systems over the last 24 hours which may have caused some brief periods of missing data within the VPSMon historical graphs as well as downtime at the VPSMon system portal.

The maintenance has been completed and has increased the stability and failover capabilities of our systems.

This maintenance involved realizing an issue with our MongoDB replica set that appeared when a single secondary member loses its network connectivity, followed by implementing a resolution for the issue. A combination of bugs/issues with MongoDB PHP and Python drivers and their configuration defaults would lead to a situation where all MongoDB activity halts when a single, secondary slave in the replica set would degrade. We've introduced a better mechanisms for handling these failure situations which will prevent future downtime caused by our system's database.

Written by Bryon Elston on January 20, 2013.