SRE Org Design: Centralized, Embedded, or Platform?

Most companies do not choose an SRE org model. They back into one.
Someone gets tired of being paged for a service they did not write. A staff engineer who is good at reliability gets quietly turned into a one-person SRE team. A platform group forms because three teams built three different deployment pipelines and somebody had to stop the bleeding. None of this is a decision. It is sediment.
That works until it doesn't. Around the point where you have more than a handful of product teams and reliability stops being one person's hobby, the shape of your SRE function starts to matter more than the talent inside it. The wrong shape will burn out good engineers, hide your real risks, and quietly slow every team down.
There are three models worth knowing. Centralized, embedded, and platform. Each one solves a real problem. Each one breaks in a specific, predictable way. The job is not to find the "best" one. It is to know which problem you actually have, and to recognise when you have outgrown the model you are running.
Centralized: one team owns reliability
In the centralized model, there is a single SRE team. They own the on-call rotation for production, they own the incident process, they own the dashboards, and in the strong version of this model they own whether a service is allowed to ship.
This is where almost everyone starts, and for good reason. When you have five or ten services and a couple of dozen engineers, you cannot afford reliability specialists scattered across every team. You concentrate them. One team builds the monitoring, sets the standards, runs the postmortems, and carries the pager.
When it works
It works when the surface area is small enough for one team to actually hold in their heads.
A central SRE team can be genuinely excellent when the number of services is low and the architecture is reasonably uniform. They develop deep expertise. They see every incident, so they spot patterns no single product team would ever notice. They enforce consistency, one logging standard, one alerting philosophy, one way to do runbooks. For a company under roughly fifty engineers, this is usually the right answer and I would not talk anyone out of it.
It also gives you a clean answer to "who owns reliability," which matters more than people admit. Diffuse ownership is how things rot.
When it fails
It fails at scale, and it fails through a specific mechanism: cognitive load.
A central team that is responsible for everything is responsible for nothing in depth. When you have forty services across twelve product teams, no central SRE can understand all of them well enough to debug a 3am page. They become a routing layer, a group of tired people who get woken up, realise they do not understand the failing service, and escalate to the team who actually wrote it. You have added a hop, not removed a problem.
The other failure is organisational. The central team becomes a bottleneck and then a villain. Every team needs them to ship. They cannot say yes fast enough, so they become the department of no. Meanwhile the product teams have offloaded reliability entirely, "that's SRE's job", which means nobody who writes the code feels the pain of operating it. That gap is where bad systems get built.
The warning signs are easy to name. SRE is in the critical path of every release. The same five engineers are on every major incident regardless of which service broke. People talk about "throwing it over the wall to SRE." Your most reliable systems are the ones SRE happens to like.
Embedded: SREs inside product teams
In the embedded model, you take your SREs and you put them inside the product teams. Each squad has one or two reliability-minded engineers who sit in the same standups, share the same backlog, and carry the pager for that team's services alongside the developers who built them.
This directly attacks the cognitive load problem. The embedded SRE only has to understand their team's services. They are in the room when design decisions get made, so reliability is baked in rather than bolted on. And on-call burden is shared with the people who actually wrote the code, which is the single healthiest thing you can do to an on-call rotation.
When it works
It works when your teams are large enough to absorb a specialist and your services are genuinely different from one another.
If your payments team and your search team and your video team have almost nothing in common architecturally, a central team trying to cover all three is a fantasy. Embedded SREs let each team go deep. Blast radius shrinks too, because an incident in one team's services is contained by people who understand exactly that domain, and they are not also distracted by an unrelated outage somewhere else.
The cultural win is real. When the person who can fix the reliability problem sits next to the person who caused it, the feedback loop is hours, not sprints.
When it fails
It fails through fragmentation and through people.
Fragmentation first. Twelve embedded SREs in twelve teams will build twelve subtly different ways of doing everything. Twelve alerting setups. Twelve definitions of what "down" means. Knowledge stops flowing. The thing a central team was good at, seeing patterns across the whole estate, disappears entirely. You trade a bottleneck for a thousand silos.
Then the people problem, which is worse and which nobody warns you about. A lone SRE embedded in a product team is isolated. They have no peers who do what they do. They get pulled into feature work because the team is under pressure and reliability is invisible until it breaks. They have no career ladder that makes sense, no one senior in their discipline to learn from. Your best reliability engineers will quietly leave, and you will not understand why until the third one resigns.
The warning signs: every team's monitoring looks different and none of it is portable. Incident reviews from different teams read like they came from different companies. Your embedded SREs never talk to each other. And the moment one of them goes on holiday, that team's reliability knowledge goes with them.
Platform: build the paved road
The platform model is the most mature and the most often misunderstood. Here, SRE does not run your services and does not sit in your team. SRE builds the paved road, the deployment pipeline, the observability stack, the standard runbook tooling, the autoscaling, the SLO framework, the incident tooling, and product teams run their own services on top of it.
The deal is explicit. The platform team gives you golden paths that make the reliable thing the easy thing. In exchange, you own your services. You carry your own pager. You write your own runbooks. The platform makes that bearable; it does not do it for you.
This is the model that scales, and it is the one large engineering organisations converge on for a reason.
When it works
It works when you have enough teams that building leverage beats providing labour.
Below a certain size, a platform team is overkill, you are building tooling for an audience of three. Above it, the maths flips hard. One platform engineer who ships a better deployment pipeline improves reliability for forty teams at once. That is leverage no embedded or central model can match. Cognitive load is handled structurally: the platform absorbs the generic hard parts (how do I roll back safely, how do I get a dashboard, how do I define an SLO) so each product team only carries the cognitive load of its own domain.
Blast radius is the quiet superpower here. A good platform makes safe deploys, progressive rollouts, and fast rollback the default, so the average incident is smaller before anyone is even paged. You are buying reliability at the infrastructure layer instead of paying for it one heroic incident at a time.
When it fails
It fails in two ways, and I have seen both.
The first is building a platform nobody asked for. A platform team that operates without product teams as customers builds beautiful, abstract infrastructure that solves problems no one has. The road is paved, it just does not go anywhere people want to drive. The discipline that prevents this is brutal and simple: the platform is a product, and product teams are your customers. If adoption is voluntary and nobody adopts, you have failed, no matter how elegant the code.
The second failure is the abdication fantasy. Leadership hears "product teams run their own services" and decides on-call expertise will simply materialise. It will not. The platform lowers the floor; it does not eliminate the need for engineers who can operate software. Hand a team a paved road and zero operational skill and you get teams confidently driving off a cliff with excellent telemetry.
The warning signs: platform adoption is mandated rather than chosen. The platform team has not spoken to a product engineer in a month. Product teams are building shadow tooling to route around the platform. Or, conversely, product teams treat the platform as a magic box and have no idea how their own service fails.
How orgs actually migrate
Almost nobody designs the end state up front. You evolve, and there are two common paths out of centralized.

Centralized to platform is the path most large product organisations take. You start with a central SRE team. They get overloaded. Instead of cloning themselves into every product team, they reframe their work as building leverage, turning the things they used to do by hand into self-service tooling. The on-call rotation pushes outward to the teams; the central group becomes a platform group. This is the cleanest evolution and the one I would steer most companies toward, because it scales sub-linearly with headcount.
Centralized to embedded is the other path, and it usually happens when your services diverge faster than they converge. If every team's domain is genuinely different, deep specialists embedded in teams beat generic tooling. Fintech, healthcare, anything with hard domain-specific reliability requirements tends to drift this way.
In practice, big organisations end up with a hybrid, and that is fine. A platform team builds the paved road. The largest or highest-stakes product teams get an embedded SRE on top of the platform. A small central group still owns company-wide incident command and the things that genuinely have to be uniform, security response, the status page, the major-incident process. This is not indecision. It is matching the model to the blast radius and cognitive load of each part of the org.
The mistake is migrating for fashion rather than for pain. You move models when the current one is actively hurting you, when the central team is the bottleneck on every release, when the embedded SREs are quitting, when the platform is gathering dust. Not because a conference talk told you platform engineering is the future.
The bottom line
Pick the model that matches your size and the problem in front of you.

Small and uniform, centralize. Large with divergent domains, embed. Large and scaling, build a platform and push on-call out to the teams who own the code. Most companies will travel from the first to the third over their lifetime, and most will end up with a pragmatic hybrid rather than a pure form of anything.
The three questions that actually decide it are always the same. Where does the blast radius of a failure land, and who is best placed to contain it? Whose cognitive load are you trying to protect? And who carries the pager at 3am, because that is who truly owns reliability, no matter what the org chart says.
Get those three right and the model mostly designs itself. Get them wrong and no reorg will save you.
Hit like if you enjoyed this post!
Keep reading
Error Budgets Are a Management Tool, Not an Engineering One
Most error budgets die quietly because engineers introduced them with no authority behind them. The number only matters when it changes what leadership does. Here is how to wire budget burn into roadmap decisions, exec reviews, and feature-freeze conversations so it actually has teeth.
June 09, 2026Support & SREL1/L2/L3 at Scale: Where the Tiered Support Model Breaks
The tiered support model works beautifully until it doesn't. At enterprise scale, tickets stop flowing up the tiers and start flowing sideways. Here's exactly where L1/L2/L3 breaks, and the three patterns that replace it.
June 01, 2026