Root cause analysis:
The main reason for the incident was the degradation of one of the database servers, which led to a general degradation of the entire product and affected such processes as login, signup, opening the dashboard and loading the board. At the same time, users who had already been working on the boards could continue their work with a slight decrease in performance. Analysis of the cause showed that this degradation is caused by a rare case in which complicated queries were performed on the server. Unfortunately, we did not have a number of triggers that could have notified us about the problem in advance.
What has been done:
The database servers have been replaced with more productive ones. The configuration of the application servers is optimized to reduce the load on the database servers. We’ve also added monitoring of the necessary parameters for the timely detection of such problems. This case is added to the load test scripts. The system refactoring of the application to optimize operations for working with the database is still in progress.