Why high availability architecture matters more than your SLA

An SLA tells you how fast someone will pick up the phone. Good architecture means the phone never rings.

Alan DeKok, CEO, InkBridge Networks

What is high availability architecture?

High availability architecture is the design of a system so that it keeps running even when something goes wrong - whether that’s a hardware failure, a botched upgrade, a security incident, or a power outage in the wee hours. A highly available system has redundancy built in, automated recovery mechanisms, and monitoring that catches problems before they become outages.

In networking, high availability architecture typically means: no single point of failure, automated failover between redundant components, deployment scripts that rebuild systems reliably without human intervention, and ongoing log analysis that surfaces issues before users notice them.

That last part - the ongoing piece - is where most deployments fall short. I’ll come back to it.

Why enterprises lead with SLAs - and why that’s the wrong question

When enterprises come to us looking for network authentication support, they often push us with questions on service levels. What’s your SLA? Four 9s of uptime? Two-hour response times? Guaranteed resolution within 24 hours?

I understand why. SLAs feel concrete. They’re something you can put in a contract, something you can point to if things go wrong. They set expectations and, at minimum, give you grounds for a complaint when a vendor doesn’t perform.

But an SLA is really a security blanket. What you’re buying is the reassurance that someone has agreed to pick up the phone. The SLA doesn’t fix your architecture. And if your architecture is bad, it will go down again and again - regardless of how fast your vendor responds.

We accept reasonable SLAs for our support customers. But I’ll be direct: we’re cautious about resolution SLAs, because in our experience, many outages are not caused by the RADIUS server. They’re caused by something upstream - a network change made during the night shift, an upgraded BNG with different behaviour, a third-party system that changed without notice. We work in good faith on root cause analysis, and we propose fixes and workarounds. But no vendor can promise a resolution time for a problem that lives in someone else’s product.

An SLA for response time: reasonable. An SLA for resolution: be suspicious of anyone who agrees to that without asking a lot of questions first.

What a 2017 database incident tells us about architecture

In January 2017, GitLab suffered a significant database outage. An engineer, working at 2 a.m. to resolve a database replication problem, accidentally deleted the primary database instead of the secondary. It turned out that two separate backup systems had both been failing, unnoticed, for months. GitLab ultimately recovered - but only because they found a partially usable backup from a different source.

GitLab is a hosting provider, not a network authentication vendor. But the lesson applies directly to any infrastructure team managing critical systems.

An SLA wouldn’t have addressed this. The problem was that they had no tested recovery process. No automated scripts. No way for an engineer, exhausted and under pressure at 2 a.m., to simply run a documented procedure and watch it succeed.

I have thought about that scenario many times. At 2 a.m., I am not at my best. I am not going to design a recovery process under pressure and execute it correctly. But I can read a script called “restore secondary database from master” and run it. I can watch the output and verify it completed successfully. If I was smart at 10 a.m. with a cup of coffee, and I wrote that script properly, and I tested it - then I can run it at 2 a.m. without making the problem worse.

That’s what automation does. You write the process when your brain is working. You run the script when it isn’t.

What good high availability architecture looks like

In our experience, a RADIUS deployment built for high availability has four non-negotiable elements.

Full documentation. Every configuration decision should be written down: what the settings are, and why. When an engineer who didn’t build the system has to diagnose a problem at an inconvenient hour, documentation is the difference between a controlled recovery and a guess.

Full automation. No manual installation steps. No “remember to do X before you do Y.” Every deployment, every upgrade, every rebuild should be scriptable and repeatable. If something breaks in a weird way, the answer should be: run the rebuild script and bring up a clean system in minutes.

Redundancy and failover. No single point of failure in a production authentication system. For ISP-scale deployments especially, this means multiple RADIUS servers, properly designed failover, and replication that is configured AND tested. The architecture required depends on the size and structure of the network.

Ongoing monitoring. A well-configured system still needs to be watched. Log files should be analysed on a continuous basis. Anomalies should surface before they become outages. This is not a “set it and forget it” situation - and if you’re treating it that way, the first sign of a problem will be users calling to say they can’t connect. Our blog post on ongoing RADIUS maintenance covers why this discipline matters beyond the initial deployment.

When these four elements are in place, SLAs become much less urgent. Some problems still happen, but most problems either resolve automatically or are caught early enough that a scheduled response is sufficient.

The architecture conversation we have with new customers

When we take on a new customer - particularly a migration from another system - our first step is a configuration review. We look at how the system is built, how it’s documented, how recovery and failover are handled, and what monitoring is in place.

What we find varies considerably. Some systems have been well-maintained. Others have accumulated years of manual changes with no documentation, no automation, and no tested recovery path.

After the review, we make recommendations and implement them. Then we recommend that people don’t change the system unnecessarily. A stable, well-configured, well-monitored system is its own form of high availability architecture. The most common source of production incidents is a change made without full understanding of its downstream effects.

Discipline around change management is not glamorous. But it is far more effective than a two-hour response SLA.

A faster response is moot. The goal is fewer calls.

The underlying purpose of an SLA is to fix the problem. The best way to fix problems is to automate the response - so the system handles it before anyone is paged.

We build RADIUS deployments that our customers rarely need to call us about. That’s the architecture goal. When something unusual does happen, our response is fast and our knowledge of the system is deep - because we built it. But the calls are rare, because the architecture is sound.

If you are evaluating RADIUS support options and your first question is about SLA response times, I’d encourage you to ask a different question first: what does the vendor’s approach to architecture and automation look like? That answer will tell you far more about your actual risk.

If you’d like an assessment of your current RADIUS architecture, contact the InkBridge Networks team.

Frequently asked questions about high availability architecture

What is the difference between high availability and disaster recovery?

High availability (HA) focuses on keeping systems running continuously and minimising downtime through redundancy and automated failover.

Automated disaster recovery is the process of restoring systems after a major failure - a more significant event that HA systems are designed to prevent from happening in the first place.

Can an SLA guarantee uptime?

No. An SLA defines response and resolution commitments from your vendor - it does not change the underlying architecture of your systems. SLA uptime guarantees are only meaningful if the architecture supporting them is built correctly. A fast response from a vendor cannot fix structural problems; it can only begin working on them after the outage has already started.

What does high-availability RADIUS architecture look like in practice?

A production-grade RADIUS deployment needs multiple redundant servers, tested failover between them, automated deployment and rebuild scripts, and ongoing log monitoring. Single-server RADIUS deployments are a liability in any environment where authentication downtime has direct business impact.

How do I know if my current RADIUS architecture is at risk?

If your current deployment has undocumented configurations, manual upgrade procedures, no tested recovery process, or limited monitoring, it carries meaningful risk - regardless of what SLA your vendor has agreed to. A configuration review is the first step.

Don’t "set it and forget it"

So you decided that whatever you were using for network security wasn’t getting the job done… either it didn’t scale with the growth in your user base, devices, or network design, or it was hindering your organization’s productivity. Or maybe you suffered a security breach. Whatever the case, you decided to make the jump to RADIUS authentication, and you’ve implemented a RADIUS server.

Building RADIUS high availability systems that eliminate network downtime

Enterprise networks require authentication systems that never fail. Here's how to build RADIUS high availability architecture that delivers true network reliability.

in Blog

# Enterprises Internet Service Providers Network Architecture

We won’t fill your inbox with self-serving marketing fluff.

Our newsletter serves up best practices, how-to guides, and answers to common questions we encounter in the networking community.