Active-active architecture — eliminating single point of failure

What it actually takes for an API gateway to disappear from the failure list. Active-active, nodeAffinity, and geographic redundancy on Kubernetes.

May 16, 2025 · 6 min read · Serkan, Engineering Lead · Engineering

Tags: #high-availability · #kubernetes · #spof · #architecture

API platform teams have one expectation that doesn't move: the system must be reachable, reliable, and ready to absorb traffic spikes — under every condition. Shipping an API is no longer enough. The gateway in front of it has to stay up, everywhere, and respond now.

So what happens when your gateway becomes unreachable?

The single point of failure problem

A common weakness in API management deployments is the single point of failure (SPOF) — one component whose loss takes the whole platform down. Route every API call through one gateway, and that gateway becomes the risk. When it fails, every downstream service fails with it.

This is exactly where Apinizer's active-active topology earns its keep. It preserves the operational benefits of a single logical entry point while making sure no physical node carries the whole platform on its back.

Single point of entry, multiple active nodes

A single point of entry means external traffic enters the system at one well-known address. That helps with security, identity, and governance — every request goes through the same checks. The downside, in a naive setup, is that the single point of entry becomes a single point of failure.

Apinizer separates the two. Logically, callers see one gateway. Physically, the gateway is multiple active replicas. The entry point is unified; the failure domain is split.

What active-active actually means

In an active-active topology, every replica handles traffic at the same time. A load balancer in front of them distributes requests across the pool. If a replica fails, the rest absorb its share and keep serving — no failover dance, no transition window, no warm spare to wake up.

Why teams move to it

High availability — losing one replica is operationally boring. The rest continue. SPOF disappears.
Load balancing — capacity is the sum of every active replica, not the capacity of one box with a spare.
Horizontal scalability — new replicas slot in. Traffic redistributes without manual intervention.
Higher throughput — multiple replicas working in parallel means more requests per second on the same hardware budget.

Each of these matters in API management, where uninterrupted service is not a "nice to have" — it's the product.

Active-active vs. active-passive

Traditional active-passive deployments run one primary and one (or several) standbys. The primary takes traffic; the rest sit idle. When the primary fails, a standby promotes itself.

That's a real architecture and it has a place. But:

Failover takes time. Detection, promotion, DNS or load-balancer rewire — measurable downtime.
Standby capacity is paid for and unused. You're paying for hardware you can't route traffic to until the primary dies.
SPOF is reduced, not removed. The promotion logic itself is a single point of failure; the standby can be stale; the cut-over can fail.

Active-active flips this. There is no primary to lose. Capacity is paid for and used. Failover is "the rest keep going."

What this looks like on Kubernetes

Apinizer uses Kubernetes nodeAffinity and podAntiAffinity to keep replicas spread across nodes. The cluster gets two things from this:

At least one Apinizer pod runs on each worker node that's marked as eligible. No single-node concentration risk.
Pods don't pile up on the same node. Anti-affinity rules prevent the "all replicas on one node, that node dies, everything dies" failure mode.

With a small cluster of two worker nodes, you can already run active-active — one Apinizer pod per node, both serving. Lose a node, the surviving pod absorbs the traffic. Add a third node, and the operator schedules a third pod onto it the next time it rolls.

# Worker deployment — spread across nodes
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: apinizer-worker
                topologyKey: "kubernetes.io/hostname"

The architecture's load-bearing property is simple: if any host or node becomes unavailable, the pods running on the rest keep serving. API access stays uninterrupted, and the operator doesn't have to be paged to recover.

Geographic redundancy — when one data center isn't enough

Active-active inside one cluster solves the hardware-failure story. Some platforms need more than that — natural disasters, regional outages, backbone failures. For those teams, Apinizer supports clusters running in physically separate locations as one logical platform.

The single point of entry stays a single point of entry. The geographically distant clusters — say Istanbul and Ankara — present as one. Each region runs active pods; a higher-level load balancer distributes between them.

What that buys you

If one region's infrastructure goes completely dark, the other region continues serving.
Geographic redundancy covers disaster scenarios that single-region active-active doesn't.
Load distributes dynamically across regions.
Users see lower latency — the load balancer can prefer the geographically closer cluster.

The platform isn't tied to a physical location. It is high-availability across hardware failures, region failures, and disaster scenarios — without giving up the unified entry point you spent so long building.

What an outage actually costs

Most customers reach your services through APIs now. Payment flows, mobile clients, partner integrations — they all live or die on the API plane. When the gateway in front of them is unreachable, every minute:

Loses direct revenue
Erodes customer trust
Disrupts internal operations
May break a regulatory obligation
May halt a public service

Industry analyses repeatedly land in the same range: large API outages cost tens of thousands of dollars per hour and reach into the millions for the worst incidents. API resilience isn't optional. It's the business continuity story.

What you get from this on Apinizer

Through Apinizer's active-active topology, you can:

Take SPOF out of the API platform, making it resilient against outages of any single node.
Use system resources more efficiently — no idle standby pool.
Run maintenance and upgrades with near-zero downtime — drain one replica at a time, the rest keep serving.
Scale horizontally in response to growing traffic without rebuilding the architecture.

Combined with geographic redundancy, the gateway becomes more than an API management platform — it becomes the foundation other systems can build on without designing around its failure.

Next steps

If you're running an API gateway today as a single replica because "that's how we started," it's worth asking what the next outage costs you and what active-active would cost to migrate to. For most teams running on Kubernetes, the answer is shorter than expected — the primitives are already there.

If you want a 30-minute walkthrough of what an active-active Apinizer deployment looks like on your cluster, the door is open.

All posts · Book a Demo · Read the docs