Error Budgets Are a Management Tool, Not an Engineering One

June 09, 2026 / Mikael Danielian

I have watched a lot of error budgets die. Not loudly. They die quietly, in a dashboard nobody opens, with a number that goes red and changes nothing.

The pattern is always the same. A good engineer reads the Google SRE book, gets excited, defines an SLO, wires up the burn tracking, and presents it in a sprint review. Everyone nods. The dashboard goes live. For two months it works because nothing is on fire. Then the budget gets exhausted in the same week product wants to ship the big launch. And the launch ships anyway.

That is the moment the error budget dies. Not because the engineering was wrong. Because the budget never had any authority behind it.

Here is the thing nobody tells you when you start this journey: an error budget is not an engineering artefact. It is a management decision rule. The maths is the easy part. The hard part is that the number has to be able to stop a launch, reassign a team, and override a roadmap that an executive already promised the board. If it cannot do those things, you do not have an error budget. You have a chart.

What an error budget actually is

Quick version for anyone who needs it, then we move on.

You pick a target for how reliable a service should be. Say 99.9% of requests succeed over 30 days. That is your SLO. The flip side — the 0.1% you are allowed to fail — is your error budget. It is a permission to be imperfect, expressed as a hard number.

Over a 30-day window, 99.9% leaves you about 43 minutes of full downtime, or the equivalent spread across a much longer period of slightly degraded service. That 43 minutes is real budget. You can spend it on risky deploys, on experiments, on shipping fast. Or you can burn it on incidents and bad releases you did not plan for.

The whole point is the tradeoff. Reliability and feature velocity are in direct competition, and the error budget is the currency you trade between them. If you have budget left, you should be shipping aggressively — sitting on a full budget means you are being too cautious. If you have blown the budget, you stop shipping and you fix reliability. That is the deal.

It is a genuinely elegant idea. And on its own it does almost nothing.

Why bottom-up adoption dies

If you have read the earlier posts in this series, you know I am a fan of letting engineering teams own their reliability work. SLIs, runbooks, on-call rotation, blameless postmortems — that stuff lives best with the people closest to the code. So it feels natural to assume error budgets belong there too.

They do not. And here is why bottom-up adoption keeps failing.

An error budget only does its job at the exact moment it is most inconvenient. It has to say "no" when product most wants a "yes". And an engineering team does not have the authority to say that no and make it stick.

Picture the real conversation. The budget is exhausted. A senior engineer goes to the product lead and says we should freeze features and spend the next two weeks on reliability. The product lead has a launch date that the VP of Sales committed to a customer. Who wins that argument? Every single time, in every company I have worked with, the launch wins. The engineer gets overruled, the budget gets ignored, and everyone learns a quiet lesson: the error budget is advisory.

Once it is advisory, it is dead. People are not stupid. They stop wiring up SLOs because they have seen that the output changes nothing. Why invest effort in a number that gets overridden the first time it is inconvenient?

There are three specific failure modes I see again and again:

  • No teeth. The budget has no policy attached. Going red triggers a Slack message and nothing else. There is no rule that says what must happen.
  • Wrong owner. The person who can declare the budget blown is the same person whose deadline gets hit by it. That is a conflict of interest baked into the org chart.
  • No executive air cover. When the budget says freeze and a director says ship, there is nobody senior enough who has pre-committed to backing the budget. So it folds.

All three are organisational problems. None of them are fixed by better dashboards.

The budget has to change what leadership does

This is the whole argument, so I will state it plainly.

An error budget works if and only if burning it changes a decision that leadership owns. Roadmap. Headcount allocation. Launch go/no-go. If the budget cannot reach into those decisions, it is decoration.

That means the budget cannot be introduced by engineering and merely blessed by management. It has to be owned by management. The CTO, the VP of Engineering, the head of product — someone at that level has to stand up and say: this number governs what we ship, and I will personally enforce it, including against my own roadmap.

I have only ever seen error budgets work when a senior leader made that commitment out loud, in front of the people who would later want to override it. The commitment is the product. The dashboard is just how you measure compliance with it.

So the real work is not technical. It is getting an executive to agree, in advance and in writing, that they will let a number override their own plans on a bad month. That is a big ask. Let me show you how to actually get there.

The policy: what "exhausted" must trigger

Vague policies do not survive contact with a deadline. You need a written rule with hard thresholds and an automatic consequence. Here is the policy I have rolled out, and it is deliberately blunt.

Budget healthy (more than 50% remaining): ship freely. Take risks. This is explicit permission to move fast. If a team is sitting on a healthy budget and being timid, that is also a problem worth flagging — you are leaving velocity on the table.

Budget low (under 25% remaining): a warning state. New risky changes need a named approver. Reliability work gets prioritised into the next planning cycle. No freeze yet, but the team is on notice and leadership has been told.

Budget exhausted (zero or burning faster than the window allows): feature freeze. Full stop. No new feature work ships to that service until the budget is restored to a healthy level. The team's capacity is reallocated to reliability — fixing the actual causes, not patching symptoms. The freeze lifts when the budget recovers, not when someone decides they have waited long enough.

The freeze is the part with teeth, and it is the part everyone wants to negotiate away. So write down the exceptions in advance, because there will be exceptions and you want them to be rare and explicit rather than improvised under pressure.

A reasonable exception clause: the only changes that ship during a freeze are ones that improve reliability, or genuine security and compliance fixes. Overriding the freeze for a feature requires a named executive to sign off, in writing, and that override gets logged and reviewed at the next exec meeting. Make the override possible but expensive and visible. If overriding the freeze is easy and invisible, you are back to advisory, and the budget is dead again.

Wiring it into the decisions leadership already makes

You do not get adoption by adding a new meeting. You get it by attaching the budget to meetings and decisions that already happen and already have authority. Three places matter.

Roadmap and planning

Every planning cycle should open with budget status for each critical service, before anyone discusses new features. The order matters. If a service is out of budget, its reliability work goes into the plan first, and feature capacity for that service is what is left over. This makes the tradeoff visible at the exact moment capacity is allocated, which is the only moment it can actually be changed. Done this way, reliability stops being the thing that gets bumped and becomes a line item with the same standing as features.

Executive reviews

Budget status belongs in whatever your regular leadership review is — the monthly business review, the ops review, whatever you call it. One slide. Green, amber, red per service, plus burn trend. The point is not the slide. The point is that executives are now accountable for the number in front of their peers. When a service is chronically red, that is a conversation the responsible VP has to have at the top table. That accountability is what gives the freeze its backbone, because the same leaders who will enforce the freeze are the ones being measured on it.

Feature-freeze conversations

When the budget is blown and the freeze triggers, the conversation should already be settled. That is the entire benefit of writing the policy in advance. Nobody is debating whether to freeze in the heat of a missed deadline — the rule decided that weeks ago, in calm conditions, with leadership's signature on it. The only live question is the narrow exception process. You have moved the hard decision out of the pressured moment and into a cool one. That is what a good policy buys you: not the absence of pressure, but a pre-agreed answer to it.

The org buy-in you actually need

None of this works without a specific set of agreements locked down before you turn anything on. From experience, the minimum is:

  1. An executive sponsor who owns the policy — not a champion, an owner — and who will enforce the freeze against their own roadmap when it comes to it.
  2. A written, public policy with the thresholds and the freeze rule, agreed by both engineering and product leadership. Both signatures. Product has to own this as much as engineering, or it becomes engineering versus product the first time it bites.
  3. A small number of SLOs that genuinely matter. Pick the two or three services where downtime hurts the business. Do not boil the ocean. A budget on a service nobody cares about teaches everyone to ignore budgets.
  4. A logged, visible override path so exceptions are possible but accountable.

Get those four and the budget has teeth. Miss the first one and nothing else matters — you will have built a very rigorous chart.

The bottom line

Error budgets are not hard because the maths is hard. The maths is trivial. They are hard because they are a way of forcing an organisation to make an honest, explicit tradeoff between shipping and reliability — and most organisations would rather keep that tradeoff implicit so they can ship now and pay later.

If you are an engineer trying to introduce one from the bottom up, I will save you some pain: you cannot, not on your own. What you can do is build the measurement cleanly and then go find the executive who will own the policy. Hand them a number that, when it goes red, forces a decision they have already agreed to make.

That is the real test. Not "do we have an SLO?" The real test is: the next time the budget is blown the same week as a big launch, what happens? If the answer is "the launch ships and the budget gets ignored," you do not have an error budget yet. You have homework. The number has to be able to stop the launch. Build the authority before you build the dashboard, and the rest follows.

Hit like if you enjoyed this post!