Commit graph

16 commits

Author SHA1 Message Date
ScreenTinker e73428182d fix(#143): fingerprint-reclaim stuck loop — reclaim by runtime liveness, throttle log
Bold beta1: three devices spam "Fingerprint reclaim rejected ... device active
(status=offline, ~2500s since heartbeat, liveConn=false)" twice/~2s indefinitely —
contradictory: gone by every signal yet treated as active.

Root cause (NOT a missing clear — corrected the hypothesis). The reject condition
was `liveConn || status==='online' || secondsSince < RECLAIM_GRACE_SECONDS(24h)`.
For the observed devices liveConn=false and status=offline, so the ONLY true term
is `secondsSince < 24h` — an effective 24h CALENDAR grace, not a stale flag. Audited
the clears: liveConn (deviceConnections) is removed on the debounced disconnect
(heartbeat.removeConnection) AND the offline_timeout sweep (deviceConnections.delete);
status is set 'offline' on both. liveConn=false + status=offline PROVE the clears
ran — there is nothing stale to clear. The 24h time gate (mislabeled "device active")
blocked a legitimately-gone device from reclaiming for up to 24h, so it retried
every ~2s forever-in-practice. The "twice per ~2s" is two reclaim ATTEMPTS per cycle
(client reconnect + re-pair-on-auth-error), each hitting the single console.warn —
not double-logging in one attempt.

Fix:
- Decide "still alive" from RUNTIME signals: `!!liveConn || secondsSince <
  reclaimSettleSeconds`. A device with no live socket and a heartbeat older than the
  settle window is gone -> reclaimable. A live (or just-seen) device is still
  rejected, so reclaim-abuse protection holds. NOT just ignoring "active" — it fixes
  WHY it was stuck (the 24h gate). RECLAIM_SETTLE_SECONDS default 300 (was 24h).
  SECURITY TRADEOFF flagged in config: shortens the anti-fingerprint-theft window;
  raise to re-tighten. Tuning guess to validate vs Bold.
- Log throttle: the deferral logs at most once per device per RECLAIM_REJECT_LOG_
  WINDOW_MS (default 60s) — collapses the double-log + the per-2s flood (same
  discipline as the content-ack shed log). Cleared when a reclaim proceeds.

Recovery of the 3 wedged devices (2febcaa9, 1984694c, 139159eb): they SELF-HEAL on
their next reclaim attempt (~2s) once this ships — their heartbeats are ~2500s stale
(>300s settle) and liveConn=false, so the reclaim now succeeds. No operator SQL needed.

Tests (port 3988): gone device reclaims; live device still rejected; clear-on-leave
(disconnect clears liveConn -> stale device reclaims); deferral log <=1 per window.
Full suite green serial+parallel (217). reconnect-throttle.js, the dbac699 content-ack
limiter, and the 404c330 block/auth code untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 22:56:48 -05:00
ScreenTinker dbac699854 fix(#143): content-ack flood control — per-device rate budget + loop-lag valve
#142's content-ack dedup is insufficient: a device cycling 2-4 content IDs makes
every ack look unique so dedup never fires, while aggregate volume from ~30 devices
saturates the event loop (the #142 reconnect throttle kept the server responsive,
which is how this was even observable).

Folded ONE control on the content-ack path (no competing limiters; reconnect-
throttle.js untouched) in lib/content-ack-limiter.js:
- Step 1 — per-device RATE budget: caps TOTAL non-duplicate acks per device per
  window regardless of differing content_id (the case dedup misses). Over budget =
  DROP silently (the per-ack log+emit is the cost); log ONCE per device per window
  when shedding starts. Keeps the #142 dedup (dedup'd repeats don't consume budget).
  Per-device, in-memory, resets on restart (modeled on lastPlayLogAt; does NOT reuse
  reconnect-throttle's ban-semantics bucket).
  Env (TUNING GUESSES, validate vs Bold's fleet): CONTENT_ACK_MAX_PER_WINDOW=20,
  CONTENT_ACK_RATE_WINDOW_MS=10000 (=2/s, above legit ~<=1/s, below the flood).
- Step 2 — global pressure valve: reuses the #142 loop-lag band (+ its hysteresis,
  no second control loop). Under CRITICAL band, shed content-acks even for an
  in-budget device; reconnects + dashboard/HTTP are ALWAYS processed; a healthy
  device in a non-critical band is never touched by the valve. Valve open/close
  logged once at the band edge in services/loop-lag.js (not per shed message).

Tests (unique ports 3985/3986, not the 3982/3983/3984 set):
- unit: the #143 regression (cycling ids evading dedup IS rate-limited), under/over
  budget, dedup still works + doesn't consume budget, valve sheds in-budget under
  critical while normal is untouched, rate precedence, window reset, per-device
  isolation.
- integration: socket flood is capped to budget with a single shed-start log;
  under-budget passes every ack; valve OPEN sheds content-acks while a reconnect +
  /api/status still succeed.
Full suite green serial AND parallel (208 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 22:21:57 -05:00
ScreenTinker 15448d1c5d fix(#142): dedup repeated content-ack reports (secondary load)
device:content-ack logged + emitted every message, so a device repeatedly
reporting the same "content <id>: ready" (observed from an older app version)
added avoidable load per message.

- Suppress identical (device_id, content_id, status) reports within
  config.contentAckDedupMs (default 10s), modeled on the lastPlayLogAt throttle.
  A status change has a different key and passes immediately; a fresh report after
  the window passes too. In-memory, resets on restart. The handler does no DB
  writes, so this is purely shedding redundant log+emit work.

test: integration over a real authenticated device socket — a burst of identical
"ready" collapses to one log/emit, a "ready" after the window passes, and a status
change is never deduped. Unique PORT (3984).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:35:04 -05:00
ScreenTinker 29a8896aa8 fix(#142): global device_status_log retention sweep + STATUS_LOG_RETENTION_DAYS
The per-device insert-time prune (deviceSocket.js) only ever touches a device
that is actively inserting, so it misses two paths: removed/idle devices whose
rows linger forever, and heartbeat.js's offline_timeout insert that bypasses
logDeviceStatus entirely. The reporter's 1.2M-row bloat accumulated UNDER a 7-day
per-device prune for exactly this reason.

- pruneStatusLog() (db/database.js): a GLOBAL time-range sweep across ALL devices,
  modeled on the play_logs prune. Run once on startup (recovers a bloated table
  right after deploy) and on the heartbeat interval (services/heartbeat.js).
- STATUS_LOG_RETENTION_DAYS env, default 3 (lower than the old hardcoded 7d; the
  dashboard only shows a 24h uptime window, so 2-3d is ample for diagnostics).
- Deliberately NO per-device row cap: Step 3's throttle already bounds how fast a
  storming device can generate status rows, so a cap would add sweep complexity
  for little gain (noted for later if needed).
- NO VACUUM / auto_vacuum here (kept off the hot path); space reclaim is left as a
  separate decision (see report).

test: deterministic in-process unit test proves the sweep deletes over-retention
rows across all devices — including a device absent from the devices table and an
offline_timeout row — while keeping recent rows; idempotent on an empty table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:34:19 -05:00
ScreenTinker 101f086204 fix(#142): load-aware per-device reconnect throttle (the outage fix)
Gates genuine reconnects PER DEVICE before the heavy register work (DB writes +
playlist build) runs, so a single flapping device can no longer saturate the
event loop and take down the server.

- Actuator is per-device, keyed on device_id (modeled on lastPlayLogAt). A device
  is flagged only when it exceeds reconnectBaseMax genuine reconnects per window.
  Same-socket playlist refreshes (isPlaylistRefresh) are exempt.
- Load-awareness is BANDED (normal/elevated/critical from the step-2 lag signal),
  not a continuous controller. The band only MULTIPLIES an already-flagged
  device's backoff; global lag never gates a healthy device.
- Hysteresis: escalate immediately while storming (tighten fast); decay one level
  per reconnectReleaseMs of calm (release slow).
- HARD CEILING per device, independent of band and warm-up — a slow-ramp attacker
  can't train through it.
- COLD START: for reconnectWarmupMs after boot, force the normal band and apply
  only the hard ceiling, so a full-fleet reconnect after a deploy doesn't throttle
  healthy screens. State is in-memory, resets on restart.
- Observability: every throttle engagement logs device, band, observed vs allowed
  rate, and backoff. Throttled device gets device:throttled + a deferred disconnect.

Tests (api.test.js style):
- unit: healthy-never-throttled, storm-throttled-with-growing-backoff, band
  multiplies backoff, hard-ceiling-even-in-warmup, warm-up leniency, neighbor
  isolation, slow release.
- integration GATE (the required one): full-fleet reconnect right after restart
  throttles NO healthy device; a single device storming IS throttled; a neighbor
  stays unaffected while another storms.
- also fixes pre-existing test PORT collisions (my new integration files clashed
  with totp.test.js:3979 and totp-keyrotation.test.js:3980 -> moved to 3982/3983);
  full suite now green serially AND in parallel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:18:00 -05:00
ScreenTinker ed3cf72b82 feat(#142): event-loop lag telemetry (perf_hooks) + bounded storage
Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
(C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new
event_loop_lag table and recomputes a coarse load band (normal/elevated/critical)
from the window p99. Standalone value: current lag is exposed on /api/status and
band changes are logged, so site lag is diagnosable independent of throttling.

The band feeds the #142 reconnect throttle (next commit) but ships first as its
own subsystem.

- event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune
  (LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune.
  Deliberately NOT another unbounded-growth table.
- Band transitions are asymmetric: jump up immediately (tighten fast), release one
  level at a time after N calm samples below a deadband (release slow, no flap).
  Pure nextBand() function, unit-tested deterministically.
- config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS,
  LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES.
- tests: band-transition unit tests; integration proves sampling persists, stays
  bounded under the prune, and surfaces on /api/status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:01:08 -05:00
ScreenTinker 674a34ba45 feat(config): HIDE_BILLING flag to hide the Subscription/billing UI (#116)
Opt-in, default-off UI gate (per strobe's spec; verified his file refs first).
When set, hides the Subscription sidebar item + billing view and bounces
#/billing to the dashboard. Billing shown by default -> existing deployments
unchanged. UI-only: /api/subscription/* untouched (internal usage reads stay).

- config.js: config.hideBilling from HIDE_BILLING (mirrors selfHosted).
- auth.js: surface hide_billing on GET /api/auth/me (client already fetches it
  at boot, stored on the user object).
- index.html: id="billingNavItem" on the Subscription <li> (mirrors adminNavItem).
- app.js: toggle billingNavItem in updateSidebarUser (next to the admin toggle);
  guard #/billing -> history.replaceState('#/') + render dashboard (replaceState
  so the back button doesn't loop into the guard).
- .env.example + README documented.

Spec assumptions verified against code: adminNavItem toggle pattern exists;
/me is fetched at boot and updateSidebarUser runs both at boot (cached user)
and post-/me, so no-flash holds on warm loads (one-time flash possible on the
first load after the flag flips — same as the admin nav, minor); route dispatch
is an if/else chain. Nav label is static (no data-i18n) so no i18n change.

Validated (headless Chrome, both states):
- flag unset -> Subscription tab present, #/billing renders (backward-compat).
- HIDE_BILLING=true -> tab hidden, #/billing redirects to #/.
- config maps HIDE_BILLING both ways; live /me default hide_billing=false.
- 149 server tests green. Default-off = zero change for existing deployments.

Known cosmetic (harmless): after the redirect the billing nav LINK keeps its
'active' class, but the nav item is display:none so it's never visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 09:19:24 -05:00
ScreenTinker 52b10408be chore(version): single-source VERSION, env-configurable data paths, bump tooling
- server/version.js: shared version helper that reads the root VERSION file once
  (fallback 0.0.0). Replaces the stale hardcoded 1.2.0 / 1.5.1 / 1.0.0 fallbacks
  in /api/version, /api/update/check, and /api/status.
- config.js: DATA_DIR / DB_PATH / UPLOADS_DIR / CERTS_DIR env overrides for the
  db, uploads, and certs/jwt-secret locations. Unset resolves to exactly the
  legacy in-repo paths, so existing installs (including production) are
  byte-for-byte unchanged. Guarded by test/config-paths.test.js.
- package.json: rename remote-display-server -> screentinker (+ lockfile name).
- scripts/bump-version.sh: one-shot bump across VERSION, package.json (+lock),
  android (versionName and versionCode + 1), and the tizen widget version; makes
  one commit plus an annotated tag; prints the push command, never pushes.
- .gitignore: global *.db / *.db-wal / *.db-shm / *.db.* so no database file
  (including .db.devbak backups, at any path) can be committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 12:56:03 -05:00
ScreenTinker 54549420e7 feat(signup): optional org-on-create for self-service signups (#12)
MSP-style deployments want self-service signups created WITHOUT a personal
org, so an admin/operator can assign them into an existing customer org
afterward.

- config.autoCreateOrgOnSignup (AUTO_CREATE_ORG_ON_SIGNUP env), default
  true - single-tenant and the hosted self-service flow are unchanged.
- ensureDefaultOrgForUser gains { allowCreate }: an existing membership is
  always returned (idempotent); the MINT path is gated. allowCreate=false +
  no membership -> returns null (user created org-less).
- register accepts a per-request createOrg flag overriding the deployment
  default; the first-ever user is always given an org (never headless).
  login / Google / Microsoft pass allowCreate from the global config, so an
  org-less user is not silently given an org on next sign-in.

Edge case: a non-platform user with zero workspaces now lands on a "no
workspaces yet" empty state (new no-workspace view) instead of being bounced
into onboarding (whose pairing step needs a workspace). route() redirects
them there, and refreshCurrentUser() redirects once /me reveals zero
accessible_workspaces (covers the first-load race). The workspace switcher
already rendered an empty placeholder and resource routes already return []
for a null workspace, so nothing crashes in between.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:16:27 -05:00
ScreenTinker 742d8c4b09 feat(socket): delivery queue for offline-device emits
Short-lived per-device queue covers the TV-flap window (issue #3):
when a device is mid-reconnect, prior code emitted to an empty room
and the event vanished. Now playlist-updates and commands targeting
an offline device are queued and flushed in order on the next
device:register for that device_id.

server/lib/command-queue.js (new):
- pendingPlaylistUpdate: per-device marker (rebuild via builder on
  flush -> always fresh DB state, no stale snapshots)
- pendingCommands: per-device Map<type, payload> with last-of-type
  dedup (most recent screen_off wins)
- TTL via COMMAND_QUEUE_TTL_MS env (default 30000)
- Active sweep every 30s prunes expired entries

Memory bounds: ~6 entries per device worst case (1 playlist marker
+ 5 command types), unref'd sweep timer.

Wired emit sites (8 total; the four direct socket.emit calls in
deviceSocket register handlers are intentionally NOT queued because
the socket is alive by definition at those points):
- server/routes/video-walls.js   (pushWallPayloadToDevice)
- server/routes/device-groups.js (pushPlaylistToDevice)
- server/routes/content.js       (content-delete fan-out)
- server/routes/playlists.js     (pushToDevices + assign)
- server/services/scheduler.js   (scheduled rotations)
- server/ws/deviceSocket.js x2   (wall leader reclaim/reassign)

server/ws/deviceSocket.js register paths now call flushQueue after
heartbeat.registerConnection + socket.join. Existing
socket.emit('device:playlist-update', ...) lines kept - they send
the initial state on register; the flush replays any queued events.
Player's handlePlaylistUpdate fingerprint check dedupes the
overlap.

Refs #3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:06:43 -05:00
ScreenTinker 3da49ec79c chore(config): env-configurable heartbeat timing
Make HEARTBEAT_INTERVAL and HEARTBEAT_TIMEOUT env-tunable so
self-hosters with slow/jittery networks don't have to edit
config.js (issue #3 reporter did exactly this to confirm the
diagnosis). Defaults unchanged at 10000ms / 45000ms so existing
deployments keep current behavior.

Same parseInt(env) || default pattern as PORT/HTTPS_PORT/PING_*.
README env table extended.

Refs #3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:03:02 -05:00
ScreenTinker 1aee4f2d5b fix(socket): raise Engine.IO ping/pong + prefer WebSocket transport
Connection-stability layer for issue #3. LG webOS WebKit (and other
TV-grade clients) miss Engine.IO pongs under decode load with the
Socket.IO defaults of 25s ping / 20s timeout, causing spurious
transport drops and a connect/reconnect/evict/disconnect loop on
the device. Default polling-first transport adds another fragility
layer via the polling->WebSocket upgrade dance.

- pingInterval / pingTimeout default to 30000 / 30000 (worst-case
  dead-socket detection 60s, up from ~45s). Both env-configurable
  via PING_INTERVAL / PING_TIMEOUT.
- Player Socket.IO client: transports: ['websocket', 'polling'].
  Tries WebSocket first; falls back to polling on the same connect
  attempt if WebSocket fails. Polling fallback preserved for
  firewall-restricted networks.

App-level heartbeat checker is unchanged and remains the safety net
for clients that miss the transport-level ping/pong window.

Tradeoffs documented in inline comments. README env table extended
with PING_INTERVAL and PING_TIMEOUT rows.

Refs #3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:02:34 -05:00
ScreenTinker 3dfec5d2f9 feat(config): DISABLE_HOMEPAGE env var to redirect / to the app
Self-hosters running internal-only deployments don't need the
marketing homepage. With DISABLE_HOMEPAGE=true, requests to /
302-redirect to /app instead of serving the landing page.
Unset/false preserves current behavior.

Requested via discord feedback.
2026-05-14 12:03:29 -05:00
ScreenTinker c71c4016ca feat(email): Microsoft Graph send + alert spam protection + preferences UI
Replaces the unused EMAIL_WEBHOOK_URL stub with a real Microsoft Graph
Mail.Send pipeline via @azure/msal-node client-credentials flow. Prior
state on prod: every alert email was logged to journalctl and never
sent (21 fallback log lines per hour for the chronic-offline devices).

Four coordinated changes shipped as one commit since they're all part
of making email delivery actually work responsibly:

1. services/email.js (NEW): Graph send via plain HTTPS (no SDK), in-memory
   MSAL token cache (refresh 60s pre-expiry), graceful stdout fallback
   when GRAPH_* env vars absent. Drop-in replacement for the old webhook.

2. services/alerts.js refactored: sequential await around sendEmail (was
   parallel fire-and-forget; first run hit Graph's MailboxConcurrency 429
   ApplicationThrottled on a 30-device backlog). Sequential at ~250ms per
   send takes 5-8s for the full backlog, well within the 60s tick. Also:
   24h long-offline cutoff to stop nagging about chronic-offline devices
   (the 20,000+ minute ones); 2-hour dedup window (was 1h) via a generic
   shouldSendAlert(type, id, windowMs) helper that future alert types
   (payment_failed, plan_limit_hit, etc.) can reuse.

3. Preferences UI: single checkbox in settings.js Account section bound
   to users.email_alerts. Saved via the existing Save Profile button. PUT
   /api/auth/me extended to accept email_alerts. requireAuth middleware
   SELECT now includes email_alerts so it propagates via req.user.

4. Dev safety net: GRAPH_DEV_RESTRICT_TO env var as an allow-list. When
   set, only listed recipients reach Graph; everyone else is suppressed
   with a log line. Prevents local dev (which often runs against fresh
   prod DB copies) from accidentally emailing real prod users. UNSET on
   prod systemd unit so production fans out normally.

Also: package.json scripts use --env-file-if-exists=.env so local dev
picks up .env automatically (Node 20.6+ built-in, no dotenv dep). Prod
runs via systemd ExecStart and is unaffected. server/.gitignore added
to keep .env out of git.

Smoke verified end-to-end:
- Sequential send pattern verified (a prior parallel-send tick had hit
  Graph's MailboxConcurrency 429 on 30 simultaneous sends; sequential
  at ~250ms each completes the same backlog without throttling)
- 24h cutoff silenced 20/21 prod devices on the next tick
- Dev restrict suppressed the 1 within-24h send
- User-preference toggle flipped via UI -> DB -> alert path silently
  continued before reaching even the suppression log
2026-05-12 18:16:40 -05:00
ScreenTinker 4392bb460a Add DISABLE_REGISTRATION env var to block public sign-ups
When DISABLE_REGISTRATION=true (or 1), POST /api/auth/register returns
403 with a clear error. OAuth endpoints (/google, /microsoft) also
refuse to auto-create new accounts — existing OAuth users can still
sign in. First-user setup (empty users table) is always allowed so a
fresh install can still be initialized.

GET /api/auth/config now returns registration_enabled so the login
view can hide the "Create Account" button and the trial banner when
registration is off. Absence of the flag is treated as enabled for
back-compat with older servers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 19:35:32 -05:00
ScreenTinker 1594a9d4a4 Initial open source release
ScreenTinker - open source digital signage management software.
MIT License, all features included, no license gates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 12:14:53 -05:00