Gates genuine reconnects PER DEVICE before the heavy register work (DB writes +
playlist build) runs, so a single flapping device can no longer saturate the
event loop and take down the server.
- Actuator is per-device, keyed on device_id (modeled on lastPlayLogAt). A device
is flagged only when it exceeds reconnectBaseMax genuine reconnects per window.
Same-socket playlist refreshes (isPlaylistRefresh) are exempt.
- Load-awareness is BANDED (normal/elevated/critical from the step-2 lag signal),
not a continuous controller. The band only MULTIPLIES an already-flagged
device's backoff; global lag never gates a healthy device.
- Hysteresis: escalate immediately while storming (tighten fast); decay one level
per reconnectReleaseMs of calm (release slow).
- HARD CEILING per device, independent of band and warm-up — a slow-ramp attacker
can't train through it.
- COLD START: for reconnectWarmupMs after boot, force the normal band and apply
only the hard ceiling, so a full-fleet reconnect after a deploy doesn't throttle
healthy screens. State is in-memory, resets on restart.
- Observability: every throttle engagement logs device, band, observed vs allowed
rate, and backoff. Throttled device gets device:throttled + a deferred disconnect.
Tests (api.test.js style):
- unit: healthy-never-throttled, storm-throttled-with-growing-backoff, band
multiplies backoff, hard-ceiling-even-in-warmup, warm-up leniency, neighbor
isolation, slow release.
- integration GATE (the required one): full-fleet reconnect right after restart
throttles NO healthy device; a single device storming IS throttled; a neighbor
stays unaffected while another storms.
- also fixes pre-existing test PORT collisions (my new integration files clashed
with totp.test.js:3979 and totp-keyrotation.test.js:3980 -> moved to 3982/3983);
full suite now green serially AND in parallel.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>