Fleet OTA Strategy¶
For single-device OTA implementation details (pre-flight checks, signing, validation window), see firmware_guide.md — OTA Update.
This document defines the over-the-air firmware update strategy for production fleet deployments.
Update Architecture¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Build Server │ │ Update Server│ │ Robots │
│ │ │ (HTTPS CDN) │ │ │
│ sign + build ├────►│ host binary ├────►│ verify + OTA │
│ │ │ │ │ A/B partition │
└──────────────┘ └──────────────┘ └──────────────┘
Version Compatibility¶
| Firmware Version | Protocol Version | Compatible App Versions |
|---|---|---|
| 0.1.x | 1 | 0.1.x companion |
| 0.2.x | 1 | 0.1.x, 0.2.x companion |
| 1.0.x | 2 | 1.0.x+ companion |
Rules: - Minor versions maintain backward compatibility within the same major - Major version bumps may break protocol (require coordinated companion update) - Firmware advertises protocol version; companion checks compatibility before triggering OTA
Staged Rollout Process¶
Phase 1: Internal Validation¶
- Deploy to 1-2 internal test robots
- Run 24-hour stability test
- Verify diagnostics telemetry reports healthy status
- Gate: all test robots report DiagnosticStatus.OK for 24h
Phase 2: Canary (10%)¶
- Deploy to 10% of fleet (randomly selected)
- Monitor for 48 hours
- Compare error rates against control group
- Gate: no increase in ERROR/WARN diagnostics vs control
Phase 3: Wider Rollout (50%)¶
- Deploy to 50% of fleet
- Monitor for 24 hours
- Gate: no regressions in topic rates, battery life, or safety events
Phase 4: Full Fleet (100%)¶
- Deploy to remaining robots
- Monitor for 24 hours post-completion
- Mark release as stable
Rollback Triggers¶
Automatically halt rollout and trigger rollback if any of: - Safety E-stop triggered without operator action (firmware bug) - Boot loop detected (>3 reboots in 5 minutes) - DTLS handshake failure rate > 5% - Battery drain rate > 2x baseline - Diagnostics topic stops publishing for > 60s
ESP-IDF A/B Partition Scheme¶
┌─────────────────────────────────────────────────┐
│ Bootloader │ OTA Data │ Factory │ OTA_0 │ OTA_1 │
│ │ │ (app) │ (app) │ (app) │
└─────────────────────────────────────────────────┘
Factory: Initial firmware flashed at manufacturingOTA_0/OTA_1: Alternating partitions for updatesOTA Data: Tracks which partition is active
Boot Validation Flow¶
- New firmware boots from OTA partition
- 30-second validation window: safety self-test, relay check, sensor initialization
- If validation passes (safety NORMAL, motors idle, no boot fail escalation):
esp_ota_mark_app_valid_cancel_rollback() - If validation fails or firmware crashes: reboot triggers automatic rollback to previous partition
- After 3 consecutive boot failures (tracked in NVS): firmware refuses to self-confirm, watchdog resets to previous partition
OTA Trigger Mechanisms¶
1. Scheduled Check (Recommended)¶
- Robot checks update server every 6 hours
- URL:
https://updates.robot-platform.org/v1/check?sku={sku}&version={current_version} - Response includes download URL and expected SHA-256
- Update only proceeds if battery > 50%
2. Push via MQTT¶
- Fleet management sends MQTT message to
fleet/{robot_id}/commands/ota - Payload:
{"url": "...", "sha256": "...", "version": "0.2.1"} - Robot validates and initiates download
3. Manual via ROS2 Service¶
- Operator calls: `ros2 service call /ota/trigger robot_interfaces/srv/OtaTrigger "{url: '...'}"``
Firmware Signing¶
All OTA images must be signed with ECDSA P-256:
# Generate signing key (one-time, store securely)
openssl ecparam -name prime256v1 -genkey -noout -out ota_signing_key.pem
# Extract public key (provision to devices)
openssl ec -in ota_signing_key.pem -pubout -outform DER -out ota_pubkey.der
# Sign a firmware binary
python3 scripts/sign_firmware.py \
--input build/robot-platform.bin \
--key ota_signing_key.pem \
--output build/robot-platform-signed.bin \
--version 0.2.1
The signed binary format:
Update Server Requirements¶
- HTTPS only (TLS 1.2+)
- CDN-backed for fleet scale
- Version manifest endpoint: returns latest version per SKU
- Binary download with range request support (resume interrupted downloads)
- Rate limiting: max 100 downloads per robot per day
Monitoring OTA Health¶
Track in fleet dashboard: - Update success rate (target: >99%) - Average download time - Rollback rate (target: <1%) - Fleet version distribution (pie chart) - Time to full fleet deployment
Emergency Procedures¶
Critical Bug in Deployed Firmware¶
- Halt rollout immediately (remove binary from update server)
- Push rollback command via MQTT to all affected robots
- If MQTT unreachable: robots auto-rollback after boot failure
- Root cause analysis before re-deploying fix
- Document incident in
CHANGELOG.md
Signing Key Compromise¶
- Revoke the compromised key from update server
- Generate new signing key pair
- OTA push firmware with new public key embedded (signed with old key — last use)
- Verify all devices now have new public key
- Destroy old private key from all storage locations