Archive/screentinker

Fork 0

mirror of https://github.com/screentinker/screentinker.git synced 2026-06-29 09:23:16 -06:00

Commit graph

Author	SHA1	Message	Date
ScreenTinker	dbac699854	fix(#143 ): content-ack flood control — per-device rate budget + loop-lag valve #142's content-ack dedup is insufficient: a device cycling 2-4 content IDs makes every ack look unique so dedup never fires, while aggregate volume from ~30 devices saturates the event loop (the #142 reconnect throttle kept the server responsive, which is how this was even observable). Folded ONE control on the content-ack path (no competing limiters; reconnect- throttle.js untouched) in lib/content-ack-limiter.js: - Step 1 — per-device RATE budget: caps TOTAL non-duplicate acks per device per window regardless of differing content_id (the case dedup misses). Over budget = DROP silently (the per-ack log+emit is the cost); log ONCE per device per window when shedding starts. Keeps the #142 dedup (dedup'd repeats don't consume budget). Per-device, in-memory, resets on restart (modeled on lastPlayLogAt; does NOT reuse reconnect-throttle's ban-semantics bucket). Env (TUNING GUESSES, validate vs Bold's fleet): CONTENT_ACK_MAX_PER_WINDOW=20, CONTENT_ACK_RATE_WINDOW_MS=10000 (=2/s, above legit ~<=1/s, below the flood). - Step 2 — global pressure valve: reuses the #142 loop-lag band (+ its hysteresis, no second control loop). Under CRITICAL band, shed content-acks even for an in-budget device; reconnects + dashboard/HTTP are ALWAYS processed; a healthy device in a non-critical band is never touched by the valve. Valve open/close logged once at the band edge in services/loop-lag.js (not per shed message). Tests (unique ports 3985/3986, not the 3982/3983/3984 set): - unit: the #143 regression (cycling ids evading dedup IS rate-limited), under/over budget, dedup still works + doesn't consume budget, valve sheds in-budget under critical while normal is untouched, rate precedence, window reset, per-device isolation. - integration: socket flood is capped to budget with a single shed-start log; under-budget passes every ack; valve OPEN sheds content-acks while a reconnect + /api/status still succeed. Full suite green serial AND parallel (208 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 22:21:57 -05:00
ScreenTinker	ed3cf72b82	feat(#142 ): event-loop lag telemetry (perf_hooks) + bounded storage Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay() (C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new event_loop_lag table and recomputes a coarse load band (normal/elevated/critical) from the window p99. Standalone value: current lag is exposed on /api/status and band changes are logged, so site lag is diagnosable independent of throttling. The band feeds the #142 reconnect throttle (next commit) but ships first as its own subsystem. - event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune (LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune. Deliberately NOT another unbounded-growth table. - Band transitions are asymmetric: jump up immediately (tighten fast), release one level at a time after N calm samples below a deadband (release slow, no flap). Pure nextBand() function, unit-tested deterministically. - config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS, LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES. - tests: band-transition unit tests; integration proves sampling persists, stays bounded under the prune, and surfaces on /api/status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:01:08 -05:00

Author

SHA1

Message

Date

ScreenTinker

dbac699854

fix(#143 ): content-ack flood control — per-device rate budget + loop-lag valve

#142's content-ack dedup is insufficient: a device cycling 2-4 content IDs makes
every ack look unique so dedup never fires, while aggregate volume from ~30 devices
saturates the event loop (the #142 reconnect throttle kept the server responsive,
which is how this was even observable).

Folded ONE control on the content-ack path (no competing limiters; reconnect-
throttle.js untouched) in lib/content-ack-limiter.js:
- Step 1 — per-device RATE budget: caps TOTAL non-duplicate acks per device per
  window regardless of differing content_id (the case dedup misses). Over budget =
  DROP silently (the per-ack log+emit is the cost); log ONCE per device per window
  when shedding starts. Keeps the #142 dedup (dedup'd repeats don't consume budget).
  Per-device, in-memory, resets on restart (modeled on lastPlayLogAt; does NOT reuse
  reconnect-throttle's ban-semantics bucket).
  Env (TUNING GUESSES, validate vs Bold's fleet): CONTENT_ACK_MAX_PER_WINDOW=20,
  CONTENT_ACK_RATE_WINDOW_MS=10000 (=2/s, above legit ~<=1/s, below the flood).
- Step 2 — global pressure valve: reuses the #142 loop-lag band (+ its hysteresis,
  no second control loop). Under CRITICAL band, shed content-acks even for an
  in-budget device; reconnects + dashboard/HTTP are ALWAYS processed; a healthy
  device in a non-critical band is never touched by the valve. Valve open/close
  logged once at the band edge in services/loop-lag.js (not per shed message).

Tests (unique ports 3985/3986, not the 3982/3983/3984 set):
- unit: the #143 regression (cycling ids evading dedup IS rate-limited), under/over
  budget, dedup still works + doesn't consume budget, valve sheds in-budget under
  critical while normal is untouched, rate precedence, window reset, per-device
  isolation.
- integration: socket flood is capped to budget with a single shed-start log;
  under-budget passes every ack; valve OPEN sheds content-acks while a reconnect +
  /api/status still succeed.
Full suite green serial AND parallel (208 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-27 22:21:57 -05:00

ScreenTinker

ed3cf72b82

feat(#142 ): event-loop lag telemetry (perf_hooks) + bounded storage

Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
(C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new
event_loop_lag table and recomputes a coarse load band (normal/elevated/critical)
from the window p99. Standalone value: current lag is exposed on /api/status and
band changes are logged, so site lag is diagnosable independent of throttling.

The band feeds the #142 reconnect throttle (next commit) but ships first as its
own subsystem.

- event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune
  (LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune.
  Deliberately NOT another unbounded-growth table.
- Band transitions are asymmetric: jump up immediately (tighten fast), release one
  level at a time after N calm samples below a deadband (release slow, no flap).
  Pure nextBand() function, unit-tested deterministically.
- config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS,
  LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES.
- tests: band-transition unit tests; integration proves sampling persists, stays
  bounded under the prune, and surfaces on /api/status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-27 19:01:08 -05:00

2 commits