On July 4 at 21:00 UTC, a rare technical failure in a microservice impacted the WalkMe Editor’s loading mechanism. The failure of the microservice did not activate automatic alerting that would usually have identified and escalated the issue more rapidly to WalkMe development teams for resolution.
The database continually tracks issues, logs them, and escalates them as required. There are settings that determine how many of these issues the database can log. Error logs are typically removed manually after a period of time. Thresholds were set at a level that is no longer appropriate for logging issues for this microservice. When too many requests are received by the microservice, the database automatically blocks hosts/IPs. This is what resulted in the WalkMe Editor not being usable.
To resolve this issue permanently, we cleared the error records for this microservice and increased the threshold for errors that can be logged in the future. We introduced a new alert to notify development teams that the error log needs to be manually cleared before thresholds are met. Finally, we also reviewed all thresholds across all microservices to ensure that alerting is appropriately activated to the development teams for any similar issue in the future.