A practical approach to alerting
I started working in tech 10 years ago at a 3 person shop. 10 years later the team is now 100+ people, and this may sound unusual in tech, I stuck around the whole time and saw the team evolve, professionalize, the stack getting more complex as clients grew.
This is a chronicle of how monitoring evolved as the team grew, the challenges faced along the way and things that worked.
The culture of the team was to limit dependencies on big tech, so we started out deploying on VPS on Linode and ended up managing a full fledged Kubernetes cluster with every account having its own namespace — which meant its own database, Redis, RabbitMQ etc.
When monitoring was simple
Ten years ago, monitoring felt simple.
It boiled down to an ELK stack, with a custom logging setup where critical logs would be sent over email. This was very indie, but for the scale, it worked really well.
As time went on, the team and accounts grew and we had to professionalize the setup: Kubernetes, Prometheus, Grafana changed the way we worked. Prometheus increasingly became the de facto standard in the industry and each open source tool shipped with its exporter or OpenMetrics endpoint. This felt great — you would install Postgres and get Grafana dashboards, builtin alerts about WAL, hot replication, etc. for free.
Observability was the word. We were able to know how each service was behaving at the deepest level, and our alerting evolved organically around these prebuilt alerts. When something fires, this would trigger a Slack message in a dedicated channel. A pattern that is widely common amongst startups.
The paradox
Though we were highly equipped and engineers built Grafana dashboards for nearly everything, here’s what I kept seeing: smart engineers glued to Grafana, dozens of alerts firing in Slack, everyone half-watching dashboards.
Meanwhile, a major feature is broken and nobody notices for hours.
How? Because when everything is alerting, nothing is anyone’s job. Responsibility gets diluted. The signal-to-noise ratio is so bad that people stop trusting the system. They start filtering out alerts mentally, waiting for someone else to look.
The tricky thing was that all of these alerts were not false positives per se. The alerts were indeed justified from the perspective of the components they came bundled with. The answer to “what should I do about this alert” was not “this is a false positive”, it was “yeah we know that happens sometimes, but it’s on the bottom of the priority list”.
This made it hard to unwire these alerts. Deep down inside, this meant recognizing you were not able to manage the system, and unconsciously engineers are programmed to not cut corners when it’s about monitoring.
The situation was paradoxical: we were equipped with state of the art tech, but less reactive than when we received critical logs via email.
In most engineers’ minds the pyramid of alerts goes: builtin alerts → healthchecks → business alerts. We were never able to achieve the “business alerts” stage — which was the most valuable to me — because we were drowning in stage 1 or 2.
I decided to flip the pyramid around: start with business alerts, and move down the funnel.
The question that changed everything
I started asking a different question. Not “is this service healthy?” but: what does a working app actually look like? What things, if they were to break, would make you stop everything you’re doing to investigate?
For the systems I worked on — revenue management for train companies — the answer was concrete:
- Are we receiving fresh booking data?
- Are we sending optimized prices?
- Are customers’ main workflows working?
That’s it. Not “is Redis responding in under 5ms” or “is the message queue consumer lagging slightly.” Those are symptoms. I wanted to alert on outcomes.
The framework
Here’s what I implemented with my teams:
0. Shut down all existing alerts.
This was radical and very controversial at the time. But I felt like we had to start from a clean sheet. Moving with prebuilt alerts felt like a weight we had to carry.
1. Define “working” together, in writing.
Get the team in a room. Write down precise bullet points of what a functioning app looks like from the end user’s perspective. No infrastructure metrics — just outcomes. Each bullet point should map to one or multiple alerts.
2. Draft alerts that answer only that question.
Each alert should map to something a user would notice. “Bookings aren’t being processed” is an alert. “CPU at 80%” is not.
3. Be ruthless on review.
Every time an alert fires, classify it:
- Was this a real problem for the end user?
- If yes, good. Keep it.
- If no, you have two choices: delete the alert or relax the threshold.
No middle ground. No “well, it could have caused issues in an hour.” That’s not what we’re optimizing for.
The goal is to detect problems when they happen, not when they might happen.
Why ruthless matters
This is the hard part. Engineers want to be proactive. We’re trained to anticipate problems, to catch things early. That instinct is good in code review. It’s poison for alerting.
Every “just in case” alert you keep is one more notification someone will learn to ignore. It’s alert creep. And once you’re back in the noise, you’ve lost.
So you have to be ruthless. If an alert fired and it wasn’t a real problem, it goes. Every time. No exceptions.
The uncomfortable truth
The sophisticated setup, 50 dashboards, alerts on every metric, Slack channels full of warnings, felt like good engineering. It felt like we were on top of things.
The simple setup, a handful of alerts that only fire when users are actually affected, felt almost embarrassingly basic.
But the simple one worked. We caught real outages faster. On-call became sustainable. People trusted the alerts again because when one fired, it meant something.
What about proactive monitoring?
We saw this as a first step. The plan was to stabilize on business alerts, then layer back in proactive monitoring — the “CPU is trending high, you might have a problem in an hour” kind of stuff.
We never did.
Not because we forgot, but because the simple setup turned out to be good enough. With the 80/20 rule in mind, proactive alerts felt like 80% effort for 20% output. We were catching real incidents fast, on-call was sustainable, and the team had bandwidth to actually fix root causes instead of chasing warnings.