Gates genuine reconnects PER DEVICE before the heavy register work (DB writes +
playlist build) runs, so a single flapping device can no longer saturate the
event loop and take down the server.
- Actuator is per-device, keyed on device_id (modeled on lastPlayLogAt). A device
is flagged only when it exceeds reconnectBaseMax genuine reconnects per window.
Same-socket playlist refreshes (isPlaylistRefresh) are exempt.
- Load-awareness is BANDED (normal/elevated/critical from the step-2 lag signal),
not a continuous controller. The band only MULTIPLIES an already-flagged
device's backoff; global lag never gates a healthy device.
- Hysteresis: escalate immediately while storming (tighten fast); decay one level
per reconnectReleaseMs of calm (release slow).
- HARD CEILING per device, independent of band and warm-up — a slow-ramp attacker
can't train through it.
- COLD START: for reconnectWarmupMs after boot, force the normal band and apply
only the hard ceiling, so a full-fleet reconnect after a deploy doesn't throttle
healthy screens. State is in-memory, resets on restart.
- Observability: every throttle engagement logs device, band, observed vs allowed
rate, and backoff. Throttled device gets device:throttled + a deferred disconnect.
Tests (api.test.js style):
- unit: healthy-never-throttled, storm-throttled-with-growing-backoff, band
multiplies backoff, hard-ceiling-even-in-warmup, warm-up leniency, neighbor
isolation, slow release.
- integration GATE (the required one): full-fleet reconnect right after restart
throttles NO healthy device; a single device storming IS throttled; a neighbor
stays unaffected while another storms.
- also fixes pre-existing test PORT collisions (my new integration files clashed
with totp.test.js:3979 and totp-keyrotation.test.js:3980 -> moved to 3982/3983);
full suite now green serially AND in parallel.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
(C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new
event_loop_lag table and recomputes a coarse load band (normal/elevated/critical)
from the window p99. Standalone value: current lag is exposed on /api/status and
band changes are logged, so site lag is diagnosable independent of throttling.
The band feeds the #142 reconnect throttle (next commit) but ships first as its
own subsystem.
- event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune
(LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune.
Deliberately NOT another unbounded-growth table.
- Band transitions are asymmetric: jump up immediately (tighten fast), release one
level at a time after N calm samples below a deadband (release slow, no flap).
Pure nextBand() function, unit-tested deterministically.
- config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS,
LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES.
- tests: band-transition unit tests; integration proves sampling persists, stays
bounded under the prune, and surfaces on /api/status.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>