~Engineers solve problems, I solve engineer's problems 🤘
Alerting is a journey, not a destination, and many organizations follow a familiar path:
The natural progression leads to a two-tiered alerting setup: critical alerts (those that require immediate action) and non-critical alerts (handled asynchronously via email or dashboards). While this structure reduces noise, it introduces a new challenge i.e. non-critical alerts are often ignored, leading to a pileup of unresolved issues.
To tackle this, we introduced regular Alerting review meetings. These biweekly sessions serve as an opportunity to refine our alerting system. For critical alerts, we evaluate if they’re truly urgent enough to justify waking someone up. For non-critical ones, we brainstorm ways to reduce noise. It can be done by tweaking thresholds, creating automation, or determining if the alert is still relevant.
This iterative process ensures alerting remains actionable and aligned with our evolving priorities, striking the perfect balance between responsiveness and sanity.