Step 22: Service Fault Tolerance
Understand how ControlBird provides automatic failover and high availability for services.
Full reference
For complete details, field tables, and limitations, see the Fault Tolerance reference.
In the previous step, you learned how multiple nodes keep their data in sync. But what happens when a service crashes? ControlBird provides automatic failover so a healthy instance takes over and keeps your system running without manual intervention.
How high availability works
When you deploy the same service across multiple nodes, only one instance actively processes requests at a time. This active instance is the leader. The others stay on standby, ready to take over if the leader fails.
Think of it like a team of pilots: one flies the plane while the others are co-pilots ready to take the controls instantly if needed. The system automatically decides who's in charge based on availability and health.
What can participate
Two kinds of components can participate in fault tolerance: background services (such as the device manager) and protocol endpoints (such as an MQTT or Modbus connection). Both report their health and compete for leadership the same way.
How health detection works
Each instance reports its health at regular intervals (typically once a second). If an instance stops reporting for long enough, the system marks it as unavailable so a healthy instance can take over.
Choosing a leader
Once an instance is healthy and ready to lead, the platform considers it available. From the pool of available instances, one is automatically chosen as the active leader.
Each instance indicates it is healthy and ready to take the lead
Only instances that are ready and reporting healthy within the timeout are eligible
One eligible instance becomes the active leader for the service group
Every instance learns of the new leader and adjusts its behavior accordingly
Only the leader works
Only the active leader processes requests. Standby instances keep running but stay idle, reducing resource usage while remaining ready for instant failover.
Failover in action
When a leader fails and stops reporting health, the platform automatically promotes the next available instance. This typically happens within 5 to 10 seconds.
Endpoints and connection modes
For protocol endpoints (connections like MQTT or Modbus), you can choose how many instances accept connections at once:
LeaderOnly
Only the leader endpoint accepts connections. Used when you need exactly one active connection to an external device.
AllWarm
All available endpoints accept connections. Used for read-only or idempotent operations where multiple readers are safe.
Viewing status
You can monitor failover status in the Database Browser. Each service group shows the instances that belong to it, which ones are currently available, and which one is the active leader.
Troubleshooting
Service not becoming leader
- Confirm the instance is marked ready to lead
- Verify it is reporting health within the detection window
- Make sure the instance belongs to the service group
- Check whether another instance is already the active leader
Failover taking too long
Total failover time is the failure detection window plus a short grace period, typically around 6 seconds by default.
- Shorten the detection window for faster detection (but watch for false positives)
- Shorten the grace period for faster promotion (but watch for flapping)
Leadership keeps switching (flapping)
- Lengthen the grace period to debounce rapid changes
- Check for network instability between nodes
- Make sure the health-reporting interval is well below the detection window
What happens if all instances fail?
If every instance becomes unavailable, no active leader remains and the service stops processing requests. As soon as any instance recovers and starts reporting health, it is automatically promoted to leader.