#142's content-ack dedup is insufficient: a device cycling 2-4 content IDs makes
every ack look unique so dedup never fires, while aggregate volume from ~30 devices
saturates the event loop (the #142 reconnect throttle kept the server responsive,
which is how this was even observable).
Folded ONE control on the content-ack path (no competing limiters; reconnect-
throttle.js untouched) in lib/content-ack-limiter.js:
- Step 1 — per-device RATE budget: caps TOTAL non-duplicate acks per device per
window regardless of differing content_id (the case dedup misses). Over budget =
DROP silently (the per-ack log+emit is the cost); log ONCE per device per window
when shedding starts. Keeps the #142 dedup (dedup'd repeats don't consume budget).
Per-device, in-memory, resets on restart (modeled on lastPlayLogAt; does NOT reuse
reconnect-throttle's ban-semantics bucket).
Env (TUNING GUESSES, validate vs Bold's fleet): CONTENT_ACK_MAX_PER_WINDOW=20,
CONTENT_ACK_RATE_WINDOW_MS=10000 (=2/s, above legit ~<=1/s, below the flood).
- Step 2 — global pressure valve: reuses the #142 loop-lag band (+ its hysteresis,
no second control loop). Under CRITICAL band, shed content-acks even for an
in-budget device; reconnects + dashboard/HTTP are ALWAYS processed; a healthy
device in a non-critical band is never touched by the valve. Valve open/close
logged once at the band edge in services/loop-lag.js (not per shed message).
Tests (unique ports 3985/3986, not the 3982/3983/3984 set):
- unit: the #143 regression (cycling ids evading dedup IS rate-limited), under/over
budget, dedup still works + doesn't consume budget, valve sheds in-budget under
critical while normal is untouched, rate precedence, window reset, per-device
isolation.
- integration: socket flood is capped to budget with a single shed-start log;
under-budget passes every ack; valve OPEN sheds content-acks while a reconnect +
/api/status still succeed.
Full suite green serial AND parallel (208 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
(C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new
event_loop_lag table and recomputes a coarse load band (normal/elevated/critical)
from the window p99. Standalone value: current lag is exposed on /api/status and
band changes are logged, so site lag is diagnosable independent of throttling.
The band feeds the #142 reconnect throttle (next commit) but ships first as its
own subsystem.
- event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune
(LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune.
Deliberately NOT another unbounded-growth table.
- Band transitions are asymmetric: jump up immediately (tighten fast), release one
level at a time after N calm samples below a deadband (release slow, no flap).
Pure nextBand() function, unit-tested deterministically.
- config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS,
LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES.
- tests: band-transition unit tests; integration proves sampling persists, stays
bounded under the prune, and surfaces on /api/status.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>