Step 22: Service Fault Tolerance

Understand how ControlBird provides automatic failover and high availability for services.

Full reference

For complete details, field tables, and limitations, see the Fault Tolerance reference.

In the previous step, you learned how multiple nodes keep their data in sync. But what happens when a service crashes? ControlBird provides automatic failover so a healthy instance takes over and keeps your system running without manual intervention.

How high availability works

When you deploy the same service across multiple nodes, only one instance actively processes requests at a time. This active instance is the leader. The others stay on standby, ready to take over if the leader fails.

Service groupDevice Manager
Instances
Leader
node-a / device-manager
Standby
node-b / device-manager
Standby
node-c / device-manager

Think of it like a team of pilots: one flies the plane while the others are co-pilots ready to take the controls instantly if needed. The system automatically decides who's in charge based on availability and health.

What can participate

Two kinds of components can participate in fault tolerance: background services (such as the device manager) and protocol endpoints (such as an MQTT or Modbus connection). Both report their health and compete for leadership the same way.

Service
Device Manager
StatusHealth
Endpoint
MQTT Broker
TLSStatusTraffic counters

How health detection works

Each instance reports its health at regular intervals (typically once a second). If an instance stops reporting for long enough, the system marks it as unavailable so a healthy instance can take over.

Instance reports healthSends a health signal about once a second
Last-seen time recordedThe platform tracks when each instance was last healthy
Health is evaluatedIf an instance goes silent past the timeout, it's treated as failed
Failure detection window~5 s
0sLast HB
1s
2s
3s
4s
5sDeclared failed

Choosing a leader

Once an instance is healthy and ready to lead, the platform considers it available. From the pool of available instances, one is automatically chosen as the active leader.

1
Instances signal readiness

Each instance indicates it is healthy and ready to take the lead

2
The available pool is assembled

Only instances that are ready and reporting healthy within the timeout are eligible

3
A leader is selected

One eligible instance becomes the active leader for the service group

4
Instances react to the change

Every instance learns of the new leader and adjusts its behavior accordingly

Only the leader works

Only the active leader processes requests. Standby instances keep running but stay idle, reducing resource usage while remaining ready for instant failover.

Failover in action

When a leader fails and stops reporting health, the platform automatically promotes the next available instance. This typically happens within 5 to 10 seconds.

Normal Operation
Leadernode-aHealth: active
Standbynode-bHealth: active
node-a crashes
Detection Phase
Failednode-aHealth: silent (5s+)
Standbynode-bHealth: active
grace period (~1s)
New Leader Elected
Offlinenode-aNo longer eligible
Leadernode-bProcessing requests

Endpoints and connection modes

For protocol endpoints (connections like MQTT or Modbus), you can choose how many instances accept connections at once:

LeaderOnly

Only the leader endpoint accepts connections. Used when you need exactly one active connection to an external device.

Example:Modbus TCP to a PLC, where only one controller should write commands

AllWarm

All available endpoints accept connections. Used for read-only or idempotent operations where multiple readers are safe.

Example:MQTT subscription, where all nodes can receive sensor readings

Viewing status

You can monitor failover status in the Database Browser. Each service group shows the instances that belong to it, which ones are currently available, and which one is the active leader.

Database Browser
+Service groups
+Services
Device Manager
Instances[node-a, node-b, node-c]
Available[node-a, node-b]
Active leadernode-a
Failover grace period~1 s

Troubleshooting

Service not becoming leader
  • Confirm the instance is marked ready to lead
  • Verify it is reporting health within the detection window
  • Make sure the instance belongs to the service group
  • Check whether another instance is already the active leader
Failover taking too long

Total failover time is the failure detection window plus a short grace period, typically around 6 seconds by default.

  • Shorten the detection window for faster detection (but watch for false positives)
  • Shorten the grace period for faster promotion (but watch for flapping)
Leadership keeps switching (flapping)
  • Lengthen the grace period to debounce rapid changes
  • Check for network instability between nodes
  • Make sure the health-reporting interval is well below the detection window
What happens if all instances fail?

If every instance becomes unavailable, no active leader remains and the service stops processing requests. As soon as any instance recovers and starts reporting health, it is automatically promoted to leader.