Compare commits

...

22 commits

Author SHA1 Message Date
ScreenTinker d9fb914b9e chore(release): v1.9.2-beta1
Some checks failed
CI / Unit tests (node --test) (push) Has been cancelled
CI / OpenAPI spec lint (push) Has been cancelled
CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Has been cancelled
CI / Boot smoke + version check (push) Has been cancelled
2026-06-27 19:59:34 -05:00
ScreenTinker ce78d0dde4 docs(#142): 1.9.2-beta1 changelog + device_status_log VACUUM maintenance note
Documents the #142 changes and tells operators with an already-bloated
device_status_log to reclaim space with a one-time manual VACUUM in a maintenance
window (retention now bounds further growth). Explains why auto-VACUUM is not
enabled. New doc: docs/maintenance-device-status-log.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:59:17 -05:00
ScreenTinker f206537fed Merge #142 (reconnect-storm hardening) into main for 1.9.2-beta1
Brings the full #142 stack onto main on top of the 1.9.1 stable cut:
- device_status_log index + de-dupe
- event-loop lag telemetry (bounded)
- load-aware per-device reconnect throttle (the outage fix)
- global device_status_log retention sweep (STATUS_LOG_RETENTION_DAYS)
- content-ack dedup
- provisioning-row cleanup window 365d -> 24h
2026-06-27 19:56:46 -05:00
ScreenTinker 139d7d09fa fix(#142): provisioning-row cleanup window 365d -> 24h (matches its own comment)
services/heartbeat.js deleted unclaimed provisioning devices with
created_at < now - (365 * 86400) — a YEAR — while its own comment said "older
than 24 hours". So socket-register pairing junk lingered ~365x longer than
intended. Change the window to 24 * 3600 to match the comment.

Correctness fix only — does NOT touch the pre-auth register path or add a rate
limiter (that pre-auth hardening is a separate security issue, out of this cut).

Extracted the sweep into pruneProvisioningDevices() (still in heartbeat.js, called
from the same interval) so it is unit-testable. Test asserts a >24h unclaimed
provisioning row is swept while a <24h row, an imported row (user_id set), and a
non-provisioning row are kept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:56:32 -05:00
ScreenTinker 852219cb45 chore(release): v1.9.1 2026-06-27 19:50:09 -05:00
ScreenTinker 15448d1c5d fix(#142): dedup repeated content-ack reports (secondary load)
device:content-ack logged + emitted every message, so a device repeatedly
reporting the same "content <id>: ready" (observed from an older app version)
added avoidable load per message.

- Suppress identical (device_id, content_id, status) reports within
  config.contentAckDedupMs (default 10s), modeled on the lastPlayLogAt throttle.
  A status change has a different key and passes immediately; a fresh report after
  the window passes too. In-memory, resets on restart. The handler does no DB
  writes, so this is purely shedding redundant log+emit work.

test: integration over a real authenticated device socket — a burst of identical
"ready" collapses to one log/emit, a "ready" after the window passes, and a status
change is never deduped. Unique PORT (3984).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:35:04 -05:00
ScreenTinker 29a8896aa8 fix(#142): global device_status_log retention sweep + STATUS_LOG_RETENTION_DAYS
The per-device insert-time prune (deviceSocket.js) only ever touches a device
that is actively inserting, so it misses two paths: removed/idle devices whose
rows linger forever, and heartbeat.js's offline_timeout insert that bypasses
logDeviceStatus entirely. The reporter's 1.2M-row bloat accumulated UNDER a 7-day
per-device prune for exactly this reason.

- pruneStatusLog() (db/database.js): a GLOBAL time-range sweep across ALL devices,
  modeled on the play_logs prune. Run once on startup (recovers a bloated table
  right after deploy) and on the heartbeat interval (services/heartbeat.js).
- STATUS_LOG_RETENTION_DAYS env, default 3 (lower than the old hardcoded 7d; the
  dashboard only shows a 24h uptime window, so 2-3d is ample for diagnostics).
- Deliberately NO per-device row cap: Step 3's throttle already bounds how fast a
  storming device can generate status rows, so a cap would add sweep complexity
  for little gain (noted for later if needed).
- NO VACUUM / auto_vacuum here (kept off the hot path); space reclaim is left as a
  separate decision (see report).

test: deterministic in-process unit test proves the sweep deletes over-retention
rows across all devices — including a device absent from the devices table and an
offline_timeout row — while keeping recent rows; idempotent on an empty table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:34:19 -05:00
ScreenTinker 101f086204 fix(#142): load-aware per-device reconnect throttle (the outage fix)
Gates genuine reconnects PER DEVICE before the heavy register work (DB writes +
playlist build) runs, so a single flapping device can no longer saturate the
event loop and take down the server.

- Actuator is per-device, keyed on device_id (modeled on lastPlayLogAt). A device
  is flagged only when it exceeds reconnectBaseMax genuine reconnects per window.
  Same-socket playlist refreshes (isPlaylistRefresh) are exempt.
- Load-awareness is BANDED (normal/elevated/critical from the step-2 lag signal),
  not a continuous controller. The band only MULTIPLIES an already-flagged
  device's backoff; global lag never gates a healthy device.
- Hysteresis: escalate immediately while storming (tighten fast); decay one level
  per reconnectReleaseMs of calm (release slow).
- HARD CEILING per device, independent of band and warm-up — a slow-ramp attacker
  can't train through it.
- COLD START: for reconnectWarmupMs after boot, force the normal band and apply
  only the hard ceiling, so a full-fleet reconnect after a deploy doesn't throttle
  healthy screens. State is in-memory, resets on restart.
- Observability: every throttle engagement logs device, band, observed vs allowed
  rate, and backoff. Throttled device gets device:throttled + a deferred disconnect.

Tests (api.test.js style):
- unit: healthy-never-throttled, storm-throttled-with-growing-backoff, band
  multiplies backoff, hard-ceiling-even-in-warmup, warm-up leniency, neighbor
  isolation, slow release.
- integration GATE (the required one): full-fleet reconnect right after restart
  throttles NO healthy device; a single device storming IS throttled; a neighbor
  stays unaffected while another storms.
- also fixes pre-existing test PORT collisions (my new integration files clashed
  with totp.test.js:3979 and totp-keyrotation.test.js:3980 -> moved to 3982/3983);
  full suite now green serially AND in parallel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:18:00 -05:00
ScreenTinker ed3cf72b82 feat(#142): event-loop lag telemetry (perf_hooks) + bounded storage
Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
(C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new
event_loop_lag table and recomputes a coarse load band (normal/elevated/critical)
from the window p99. Standalone value: current lag is exposed on /api/status and
band changes are logged, so site lag is diagnosable independent of throttling.

The band feeds the #142 reconnect throttle (next commit) but ships first as its
own subsystem.

- event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune
  (LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune.
  Deliberately NOT another unbounded-growth table.
- Band transitions are asymmetric: jump up immediately (tighten fast), release one
  level at a time after N calm samples below a deadband (release slow, no flap).
  Pure nextBand() function, unit-tested deterministically.
- config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS,
  LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES.
- tests: band-transition unit tests; integration proves sampling persists, stays
  bounded under the prune, and surfaces on /api/status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 19:01:08 -05:00
ScreenTinker d90cfb3986 fix(#142): index device_status_log + de-dupe its CREATE TABLE
The dashboard uptime query (WHERE device_id=? AND timestamp>?) and the
per-device retention prune (WHERE device_id=? AND timestamp<?) were both full
table scans. At 1M+ rows (the outage report) this was the dashboard-degradation
cause that persisted even after the reconnect storm stopped.

- schema.sql: add idx_device_status_log_device_ts(device_id, timestamp); both
  queries now SEARCH ... USING INDEX instead of SCAN (verified via EXPLAIN).
- database.js: same index as a migration for existing DBs (idempotent).
- schema.sql defined device_status_log twice; drop the duplicate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 18:54:57 -05:00
ScreenTinker f96b65576f chore(release): guard bump-version.sh against a diverged origin/main
Some checks failed
CI / Unit tests (node --test) (push) Has been cancelled
CI / OpenAPI spec lint (push) Has been cancelled
CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Has been cancelled
CI / Boot smoke + version check (push) Has been cancelled
Add a pre-push fast-forward check: fetch origin/main and abort if it has commits not in local HEAD, BEFORE the annotated tag is created. Prevents the beta9 incident where origin/main had advanced by one commit so 'git push origin main' was rejected, but the tag pushed anyway and fired release.yml from a commit not on main. Best-effort fetch — warns and proceeds when offline (the push stays the backstop).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 12:26:23 -05:00
ScreenTinker ed164647b8 Merge origin/main (Update SECURITY.md) into beta9 cut 2026-06-25 12:16:47 -05:00
ScreenTinker ae018b8eea chore(release): v1.9.1-beta9 2026-06-25 12:06:44 -05:00
ScreenTinker 071d7cc9c3 fix(server): persist per-item mute into the published snapshot (#129)
A mute toggle wrote the draft playlist_items + emitted a live device:mute-changed but only markDraft()'d — it never updated playlists.published_snapshot, the copy the device actually plays. So the device's item.muted stayed 0 and every loop/reload re-applied full volume: dashboard icon red but audio kept playing (Android; web's native <video> loop masked it). emitMuteChanged now surgically patches the matching item's muted (0/1) inside the published_snapshot and re-pushes the playlist, so loops re-apply the correct flag. Surgical patch (not publishPlaylist) so a mute toggle can't prematurely publish other draft edits or flip publish state. Adds a regression test that fails without the patch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 12:06:29 -05:00
screentinker 1e1ed7e29a
Update SECURITY.md
Some checks are pending
CI / Unit tests (node --test) (push) Waiting to run
CI / OpenAPI spec lint (push) Waiting to run
CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run
CI / Boot smoke + version check (push) Waiting to run
2026-06-24 12:09:25 -05:00
ScreenTinker 36c4bf523f chore(release): v1.9.1-beta8 2026-06-24 11:43:31 -05:00
ScreenTinker 16c381254b fix(android): lower minSdk 26 -> 24 to support Android 7.0/7.1 panels (#141)
Covers API 24 (7.0) + 25 (7.1.2); all 26+ APIs were already guarded with graceful else branches; no dependency bumps. Validated on API 24 + 25 emulators: install, foreground service, #139 OTA verify on the legacy GET_SIGNATURES path (incl. tampered-refuse), EncryptedSharedPreferences, and playback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 11:38:56 -05:00
Christopher Cookman 01e5b10f53
feat(setup): Debian 13 player/server install script (#137)
Some checks are pending
CI / Unit tests (node --test) (push) Waiting to run
CI / OpenAPI spec lint (push) Waiting to run
CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run
CI / Boot smoke + version check (push) Waiting to run
Community contribution from @ChrisChrome (tested on Debian 13 headless). Adds scripts/debian-13-setup.sh — server/player/both modes, systemd units, kiosk autologin, and management scripts (status/update/logs) — modeled on the Raspberry Pi setup. Also fixes Chromium fullscreen by detecting screen resolution at runtime (replacing --start-fullscreen), applied to both the Debian and Pi scripts, plus a README entry.

Maintainer review fix: the kiosk wait-loop now polls /api/status (the server's real readiness endpoint) instead of the non-existent /api/health, which had been silently burning the ~120s timeout on every all-in-one boot (bug inherited from the Pi script, fixed in both).
2026-06-23 23:47:22 -05:00
ScreenTinker 9c990ff91f chore(release): v1.9.1-beta7
Some checks are pending
CI / Unit tests (node --test) (push) Waiting to run
CI / OpenAPI spec lint (push) Waiting to run
CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run
CI / Boot smoke + version check (push) Waiting to run
2026-06-23 23:23:00 -05:00
ScreenTinker a6fe849c67 Merge fix/ota-redownload-loop (#140): stop OTA re-download loop on devices that can't silently install (#139) 2026-06-23 23:22:29 -05:00
ScreenTinker 0c0a8dd68a fix(ota): surface stuck OTA on dashboard + read APK signer correctly on API 28/29 (#139)
Follow-up to the cache/backoff loop fix (aa23cf0): make a device that can't
self-install visible to operators, and fix the signature-verify bug that kept the
whole #139 fix from engaging on the actual Fire OS target.

Dashboard surface (Phase 2):
- devices gains ota_status / ota_target_version / ota_attempts / ota_updated_at
  via the idempotent ALTER TABLE ADD COLUMN migration (non-destructive,
  default-backfilled, idempotent on re-run).
- The device reports ota_status (OtaThrottle.statusFor -> none | pending |
  manual_update_required) in device_info; the server persists it on register
  (the reconnect backstop). devices d.* already surfaces it to the dashboard.
- Dashboard shows a non-blocking amber badge when manual_update_required
  ("Update available (vX) - install failed N times, manual update required");
  i18n key in en.js (non-en inherits via the en fallback). Server suite +1 test.

Event-driven status (Option B):
- New device:ota-status WS message, emitted on STATE TRANSITIONS only
  (enter-backoff -> manual_update_required, clear -> none), so the badge updates
  promptly without waiting for a reconnect and without per-poll/heartbeat chatter.
  Server handler persists the same fields; an unknown/forged device_id is a safe
  no-op. The register-path persist stays as the reconnect backstop.

Signature-verify fix (the critical piece):
verifyApkSignature read the downloaded APK's signer via
getPackageArchiveInfo(GET_SIGNING_CERTIFICATES).signingInfo, but that field is
null for ARCHIVE files on API 28/29 (populated only from API 30). On Fire OS 8
(Android 9 / API 28) - the actual deployment target - this returned 0 certs from
a correctly-signed APK, so every OTA was refused as "tampered," the cache was
deleted, and the full APK re-downloaded every check cycle. This was the real
cause of the #139 re-download loop, NOT a silent-install failure: the cache and
backoff added in this branch sit behind this verify gate and never engaged on
the target.

Fix: below API 30, read the archive's signer via the legacy GET_SIGNATURES +
.signatures (its v1/JAR cert, which IS populated on 28/29). Keep
GET_SIGNING_CERTIFICATES + signingInfo for API >= 30 and for the installed-app
read (which works on 28+). The archive's signer is still extracted and compared
to the installed app's signer; a mismatch or zero-cert APK is still rejected.
This reads the cert correctly on old APIs - it does not weaken verification.

Verified on emulators:
- API 28: verify now passes for a legit APK (was: 0 certs, refused). Full backoff
  then engages - 8.5MB pulled once, cache-hit on retries, backoff after 3,
  manual_update_required emitted once; clears on successful update.
- API 28 negative: a re-signed (different-key) APK is still refused on cert
  MISMATCH - no hole opened.
- API 30: unchanged path still passes (no regression).
- server suite 173/173, OtaThrottleTest 7/7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:49:01 -05:00
ScreenTinker aa23cf02dd fix(ota): stop OTA re-download loop on devices that cannot silently install (#139)
Devices that download an OTA APK but cannot silently install it (Fire TV: no
device-owner path) re-downloaded the full APK every check cycle indefinitely -
install never completes, version never advances, next check re-triggers.

Client (UpdateChecker.kt, ServerConfig.kt, OtaThrottle.kt):
- Reuse a cached, signature-verified APK instead of re-downloading every cycle;
  delete leftover invalid files; keep the verified APK on disk as the
  manual-install artifact.
- Persisted per-version attempt budget (EncryptedSharedPreferences) so it
  survives the Fire OS app restarts that drive the loop. An attempt is counted
  only when an install is launched - a download/verify failure does not consume
  the budget, so a transient network problem cannot park a healthy device in
  backoff. After 3 failed installs, back off to one retry per 24h.
- Clear OTA state and caches when a check returns update_available=false while
  state is pending (app relaunched as the new version).
- Report OTA status to the dashboard via device:log (tag ota) on state
  transitions only (enter-backoff, clear) to avoid flooding the channel.
- Extract throttle decision logic into a pure OtaThrottle object (no Android
  deps) with JUnit coverage (OtaThrottleTest) for the state transitions.

Server (server.js):
- Reword /download/apk log from "OTA update in progress" to "APK served" and
  rate-limit to once per IP / 10 min so a looping device cannot flood the log.

Note: client-cooperative fix - prevents the loop in cohorts running this APK.
Currently-stuck beta4 devices still require a one-time manual update.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 19:53:55 -05:00
40 changed files with 2078 additions and 57 deletions

View file

@ -1,5 +1,36 @@
# Changelog
## 1.9.2-beta1 — unreleased
### Fixed — server resilience (#142)
- **A single flapping device can no longer saturate the event loop.** A new
load-aware, per-device reconnect throttle (`lib/reconnect-throttle.js`) gates
genuine reconnects *before* the heavy register work (DB writes + playlist build).
The verdict is per-device; global event-loop lag only multiplies an
already-flagged device's backoff and never throttles a healthy one. Hard ceiling
+ cold-start warm-up so a full-fleet reconnect after a deploy is never throttled.
- **`device_status_log` growth is bounded.** Added
`idx_device_status_log_device_ts`, a global retention sweep (`pruneStatusLog`,
`STATUS_LOG_RETENTION_DAYS` default 3) covering removed/idle devices and the
`offline_timeout` path, and de-duplicated the table's `CREATE TABLE`.
- **`content-ack` spam de-duplicated.** Repeated identical
`(device_id, content_id, status)` reports are suppressed within
`CONTENT_ACK_DEDUP_MS` (default 10s).
- **Provisioning cleanup window corrected.** Unclaimed provisioning devices are now
swept after 24h (the code used `365 * 86400` — a year — contradicting its own
comment).
### Added — observability (#142)
- **Event-loop lag telemetry** via `perf_hooks.monitorEventLoopDelay()`. Sampled to
a bounded `event_loop_lag` table (indexed + pruned, `LAG_TELEMETRY_RETENTION_DAYS`)
and surfaced on `/api/status` as `loop_lag` (mean/p50/p99/max + band).
### Maintenance
- Operators whose `device_status_log` is already bloated from a pre-1.9.2 deployment
should reclaim disk with a **one-time manual `VACUUM`** in a maintenance window;
retention now bounds further growth. Auto-VACUUM is intentionally not enabled.
See [`docs/maintenance-device-status-log.md`](docs/maintenance-device-status-log.md).
## 1.9.1-beta3 — unreleased
### Fixed — Tizen player

View file

@ -426,6 +426,7 @@ keytool -genkey -v -keystore android/release-key.jks -keyalg RSA -keysize 2048 -
3. Install the ScreenTinker app on your device:
- **Android TV / tablets**: Download the APK from your instance (`/download/apk`) or build it from source (see above)
- **Raspberry Pi**: `curl -sSL https://your-instance/scripts/raspberry-pi-setup.sh | bash`
- **Debian 13 (headless)**: `curl -sSL https://your-instance/scripts/debian-13-setup.sh | sudo bash`
- **Windows**: Run the setup script from `scripts/windows-setup.bat`
- **Samsung Tizen TV / signage**: point the TV's URL Launcher (or browser) at `https://your-instance/player` - no signing needed. For an installed native app, see [tizen/README.md](tizen/README.md)
- **Any browser**: Open `https://your-instance/player` in kiosk/fullscreen mode

View file

@ -95,3 +95,28 @@ by name in release notes and (when applicable) in the GitHub advisory
itself. Let me know in your report whether you'd like credit and how
you'd like to be named. Anonymous reports are also welcome — no credit
is required.
## Uploaded content access model
Uploaded content (images, videos) served under /uploads/content is
**public by unguessable URL**, not access-controlled:
- Filenames are UUIDv4 (122 bits of randomness), so URLs are not enumerable
or guessable.
- There is no per-request authentication on content bytes, and CORS is open
(Access-Control-Allow-Origin: *) because the web player's canvas-based
screenshot capture requires cross-origin access.
- Anyone who obtains a content URL can read that file, cross-tenant, with no
expiry (immutable 30-day cache) and no revocation short of deleting the file.
This is an intentional design choice for digital signage, where content is
destined for public display. It is **security-through-unguessability, not
access control.**
**Do not upload content you require to remain confidential** - including
material that is destined for a screen but not yet public (e.g. a scheduled
promotion before its reveal, or an internal board containing names or other
sensitive details). Such content is world-readable from the moment of upload.
If pre-launch or tenant-private confidentiality is a requirement for your
deployment, open an issue - signed/expiring URLs are tracked but not yet
implemented.

View file

@ -1 +1 @@
1.9.1-beta6
1.9.2-beta1

View file

@ -9,10 +9,10 @@ android {
defaultConfig {
applicationId = "com.remotedisplay.player"
minSdk = 26
minSdk = 24
targetSdk = 34
versionCode = 26
versionName = "1.9.1-beta6"
versionCode = 31
versionName = "1.9.2-beta1"
}
signingConfigs {

View file

@ -240,6 +240,12 @@ class MainActivity : AppCompatActivity() {
// Start auto-update checker
updateChecker = UpdateChecker(this)
// #139: surface OTA status (applying / backing off / manual-update-required) to the
// dashboard. wsService is read lazily — it binds after this runs.
updateChecker.otaLogReporter = { level, msg -> wsService?.sendLog("ota", level, msg) }
// #139 Phase 2 (Option B): announce OTA status transitions (clear / enter-backoff) so the
// dashboard badge clears/lights up promptly without waiting for a reconnect.
updateChecker.otaStatusReporter = { wsService?.sendOtaStatus() }
updateChecker.startPeriodicCheck()
}

View file

@ -71,4 +71,37 @@ class ServerConfig(context: Context) {
fun clearPlaylistCache() {
prefs.edit().remove("cached_playlist").apply()
}
// #139 OTA attempt state. Persisted (not in-memory) on purpose: the OTA loop is driven
// by Fire OS restarting the app, which re-fires the update check; an in-memory counter
// would reset on every restart and never back off. `otaTargetVersion` is the version we
// are currently trying to install; `otaAttempts` counts install attempts for it;
// `otaLastAttemptAt` gates the post-cap retry backoff.
var otaTargetVersion: String
get() = prefs.getString("ota_target_version", "") ?: ""
set(value) = prefs.edit().putString("ota_target_version", value).apply()
var otaAttempts: Int
get() = prefs.getInt("ota_attempts", 0)
set(value) = prefs.edit().putInt("ota_attempts", value).apply()
var otaLastAttemptAt: Long
get() = prefs.getLong("ota_last_attempt_at", 0L)
set(value) = prefs.edit().putLong("ota_last_attempt_at", value).apply()
// #139: true once the "entering backoff" status has been reported for the current target,
// so the dashboard line fires on the transition only — not on every backed-off poll (Fire OS
// restarts re-fire the check constantly). Reset on a new target / on clear.
var otaBackoffReported: Boolean
get() = prefs.getBoolean("ota_backoff_reported", false)
set(value) = prefs.edit().putBoolean("ota_backoff_reported", value).apply()
fun clearOtaState() {
prefs.edit()
.remove("ota_target_version")
.remove("ota_attempts")
.remove("ota_last_attempt_at")
.remove("ota_backoff_reported")
.apply()
}
}

View file

@ -0,0 +1,74 @@
package com.remotedisplay.player.service
/**
* #139: pure OTA throttle decision logic no Android dependencies, so it's unit-testable
* (see OtaThrottleTest). UpdateChecker is the imperative shell: it reads/writes the persisted
* fields (ServerConfig / EncryptedSharedPreferences) and performs the actual download + install;
* this object owns the stateful RULES so they have coverage beyond a compile:
*
* - a new target version resets the attempt budget,
* - a check NEVER consumes the budget only a launched install does (so a transient
* download/network failure can't park a healthy device in backoff),
* - after MAX_INSTALL_ATTEMPTS failed installs, back off to one retry per BACKOFF_MS,
* - the "entering backoff" signal fires on the crossing only (report-on-transition).
*/
object OtaThrottle {
const val MAX_INSTALL_ATTEMPTS = 3
const val BACKOFF_MS = 24L * 60 * 60 * 1000
/** Persisted OTA state for the version we are currently trying to install. */
data class State(
val targetVersion: String = "",
val attempts: Int = 0,
val lastAttemptAt: Long = 0L,
val backoffReported: Boolean = false
)
enum class Action { ATTEMPT, BACKOFF }
/** True when [latestVersion] differs from the persisted target — caller drops stale APKs. */
fun isNewTarget(state: State, latestVersion: String): Boolean = state.targetVersion != latestVersion
/**
* A check found [latestVersion] available. Returns the state to persist (reset on a new
* target) and whether to attempt now. Does NOT count an attempt: the budget is consumed
* only once an install is actually launched (see [onInstallLaunched]).
*/
fun onUpdateAvailable(state: State, latestVersion: String, now: Long): Pair<State, Action> {
val s = if (isNewTarget(state, latestVersion)) State(targetVersion = latestVersion) else state
if (s.attempts >= MAX_INSTALL_ATTEMPTS && now - s.lastAttemptAt < BACKOFF_MS) {
return s to Action.BACKOFF
}
return s to Action.ATTEMPT
}
/**
* An install was actually launched (a verified APK was in hand). Consumes one attempt and
* returns the new state plus whether this attempt is the FIRST to cross the cap into backoff
* (true => caller reports "manual update required" once; false on all later polls).
*/
fun onInstallLaunched(state: State, now: Long): Pair<State, Boolean> {
val attempts = state.attempts + 1
var s = state.copy(attempts = attempts, lastAttemptAt = now)
val enteredBackoff = attempts >= MAX_INSTALL_ATTEMPTS && !s.backoffReported
if (enteredBackoff) s = s.copy(backoffReported = true)
return s to enteredBackoff
}
/** A check found us already on the latest. True if there was pending OTA state to clear. */
fun shouldClearOnUpToDate(state: State): Boolean = state.targetVersion.isNotEmpty()
/**
* #139 Phase 2: operator-facing status for the dashboard.
* - "none" : no update pending.
* - "manual_update_required" : capped AND still inside the backoff window this device
* can't self-install; a human needs to update it.
* - "pending" : an update is in progress / will retry (under the cap, or the
* window has elapsed so a retry is due).
*/
fun statusFor(state: State, now: Long): String = when {
state.targetVersion.isEmpty() -> "none"
state.attempts >= MAX_INSTALL_ATTEMPTS && now - state.lastAttemptAt < BACKOFF_MS -> "manual_update_required"
else -> "pending"
}
}

View file

@ -39,6 +39,25 @@ class UpdateChecker(private val context: Context) {
private var installReceiverRegistered = false
// #139: report OTA status to the dashboard (device:log, tag "ota"). Wired by MainActivity
// to WebSocketService.sendLog; null until then. Read lazily so binding order doesn't matter.
// The throttle thresholds + decision rules live in OtaThrottle (pure, unit-tested); this
// class is the imperative shell that persists state and does the download/install.
var otaLogReporter: ((level: String, message: String) -> Unit)? = null
private fun report(level: String, message: String) {
when (level) { "error" -> Log.e(TAG, message); "warn" -> Log.w(TAG, message); else -> Log.i(TAG, message) }
try { otaLogReporter?.invoke(level, message) } catch (_: Throwable) {}
}
// #139 Phase 2 (Option B): announce an OTA status TRANSITION to the server (wired by
// MainActivity to WebSocketService.sendOtaStatus, which reads the just-persisted state).
// Fired ONLY at the two transitions — clear and enter-backoff — so the dashboard badge
// updates promptly without waiting for a reconnect, with no per-poll/heartbeat chatter.
// Lazy/null-safe so binding order doesn't matter, same as otaLogReporter.
var otaStatusReporter: (() -> Unit)? = null
private fun announceOtaStatus() { try { otaStatusReporter?.invoke() } catch (_: Throwable) {} }
// The PackageInstaller session reports its status (incl. STATUS_PENDING_USER_ACTION,
// which Android 13+ returns for non-device-owner installers) via this broadcast.
// Without handling it the committed session just stalls and the update never
@ -59,6 +78,8 @@ class UpdateChecker(private val context: Context) {
catch (e: Exception) { Log.e(TAG, "Confirm launch failed: ${e.message}") }
}
}
// Logcat only — NOT report(): these fire per attempt, and #139 keeps the
// device:log/dashboard channel to state transitions (enter-backoff, clear).
android.content.pm.PackageInstaller.STATUS_SUCCESS -> Log.i(TAG, "Update installed successfully")
else -> Log.w(TAG, "Install status: ${intent.getStringExtra(android.content.pm.PackageInstaller.EXTRA_STATUS_MESSAGE)}")
}
@ -116,9 +137,17 @@ class UpdateChecker(private val context: Context) {
Log.i(TAG, "Current: $currentVersion, Latest: $latestVersion, Update: $updateAvailable")
if (updateAvailable && downloadUrl.isNotEmpty()) {
Log.i(TAG, "Update available! Downloading...")
downloadAndInstall("${config.serverUrl}$downloadUrl", latestVersion)
if (!updateAvailable) {
// #139: on the latest version now. If OTA state was pending, the install
// landed (the app relaunched as the new version) — clear state + caches once.
if (OtaThrottle.shouldClearOnUpToDate(otaState())) {
report("info", "OTA complete: now on $currentVersion — clearing update state")
config.clearOtaState()
cleanupApks(null)
announceOtaStatus() // transition -> emits 'none' so the badge clears promptly
}
} else if (downloadUrl.isNotEmpty()) {
maybeUpdate(latestVersion, "${config.serverUrl}$downloadUrl")
}
} catch (e: Exception) {
Log.e(TAG, "Update check error: ${e.message}")
@ -126,20 +155,89 @@ class UpdateChecker(private val context: Context) {
}.start()
}
private fun downloadAndInstall(url: String, version: String) {
private fun otaState() = OtaThrottle.State(
config.otaTargetVersion, config.otaAttempts, config.otaLastAttemptAt, config.otaBackoffReported)
private fun persistOta(s: OtaThrottle.State) {
config.otaTargetVersion = s.targetVersion
config.otaAttempts = s.attempts
config.otaLastAttemptAt = s.lastAttemptAt
config.otaBackoffReported = s.backoffReported
}
// #139 imperative shell over OtaThrottle (the pure, unit-tested decision logic). A device
// that can't silently install (Fire TV: no device-owner) stops re-pulling the full APK every
// cycle. Only a COMMITTED install consumes the attempt budget — a transient download/verify
// failure on a HEALTHY device must never park it in backoff.
private fun maybeUpdate(latestVersion: String, downloadUrl: String) {
val now = System.currentTimeMillis()
val cur = otaState()
if (OtaThrottle.isNewTarget(cur, latestVersion)) cleanupApks(latestVersion)
val (afterCheck, action) = OtaThrottle.onUpdateAvailable(cur, latestVersion, now)
persistOta(afterCheck)
// Capped + still inside the window: do nothing AND stay silent. Fire OS restarts re-fire
// this check constantly; reporting here would just move the flood onto the WS channel.
// The enter-backoff line was already sent once on the crossing (below).
if (action == OtaThrottle.Action.BACKOFF) return
// download/verify failure → retry on the normal cadence; do NOT count it as an attempt.
if (!downloadAndInstall(downloadUrl, latestVersion)) {
Log.w(TAG, "Update $latestVersion: download/verify failed — retry next check (no attempt consumed)")
return
}
val (afterLaunch, enteredBackoff) = OtaThrottle.onInstallLaunched(afterCheck, now)
persistOta(afterLaunch)
Log.i(TAG, "Install launched for $latestVersion (attempt ${afterLaunch.attempts}/${OtaThrottle.MAX_INSTALL_ATTEMPTS})")
if (enteredBackoff) {
report("warn", "Update $latestVersion available but not installing after ${afterLaunch.attempts} attempts — manual update required (backing off to one retry per ${OtaThrottle.BACKOFF_MS / 3_600_000L}h)")
announceOtaStatus() // transition -> emits 'manual_update_required'
}
}
// #139: remove cached OTA APKs other than `keep` (null = remove all). Keeps the external
// files dir from accumulating one stale APK per superseded version.
private fun cleanupApks(keep: String?) {
try {
val dir = context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS) ?: return
val keepName = keep?.let { "ScreenTinker-$it.apk" }
dir.listFiles { f ->
f.name.startsWith("ScreenTinker-") && f.name.endsWith(".apk") && f.name != keepName
}?.forEach { it.delete() }
} catch (e: Exception) {
Log.w(TAG, "APK cleanup failed: ${e.message}")
}
}
// Returns TRUE only when a verified APK is in hand and an install has been launched (the
// caller may then count an attempt); FALSE on any download/verify failure — the caller must
// NOT count those, so a transient network problem can't burn a healthy device's budget. #139
private fun downloadAndInstall(url: String, version: String): Boolean {
try {
val apkFile = File(context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS),
"ScreenTinker-$version.apk")
// #139: reuse a previously-downloaded, verified APK for this version instead of
// re-pulling ~8.7 MB every cycle. The file also stays on disk as the artifact for a
// manual install when silent install isn't possible.
if (apkFile.exists() && verifyApkSignature(apkFile)) {
Log.i(TAG, "Reusing cached verified APK: ${apkFile.absolutePath} (${apkFile.length()} bytes)")
handler.post { installApk(apkFile) }
return true
}
// A leftover but invalid file (partial/corrupt/tampered) must never be reused.
if (apkFile.exists()) apkFile.delete()
// Download to a temp file
val request = Request.Builder().url(url).build()
val response = client.newCall(request).execute()
if (!response.isSuccessful) {
Log.e(TAG, "Download failed: ${response.code}")
return
return false
}
val apkFile = File(context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS),
"ScreenTinker-$version.apk")
response.body?.byteStream()?.use { input ->
apkFile.outputStream().use { output ->
input.copyTo(output)
@ -158,7 +256,7 @@ class UpdateChecker(private val context: Context) {
if (!verifyApkSignature(apkFile)) {
Log.e(TAG, "Refusing update: APK signature/package verification failed (tampered or MITM'd APK)")
apkFile.delete()
return
return false
}
Log.i(TAG, "APK signature verified against installed app - proceeding to install")
@ -166,8 +264,10 @@ class UpdateChecker(private val context: Context) {
handler.post {
installApk(apkFile)
}
return true
} catch (e: Exception) {
Log.e(TAG, "Download/install error: ${e.message}")
return false
}
}
@ -245,9 +345,18 @@ class UpdateChecker(private val context: Context) {
private fun verifyApkSignature(apkFile: File): Boolean {
return try {
val pm = context.packageManager
val flags = if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P)
// #139: getPackageArchiveInfo(GET_SIGNING_CERTIFICATES).signingInfo is NULL for
// ARCHIVE files on API 28/29 (it's only populated from API 30) — so the modern flag
// reads 0 certs from a downloaded APK and we'd wrongly REFUSE a legitimate update,
// which is the real Fire OS 8 / Android 9 OTA-loop cause. Below API 30, read the
// archive's signer via the legacy GET_SIGNATURES + .signatures (its v1/JAR cert,
// which IS populated on 28/29). This reads the cert CORRECTLY — it does not weaken
// verification: the archive's signer is still extracted and compared to the installed
// app's signer below, and a mismatch / zero-cert APK is still rejected.
val archiveUsesSigningInfo = Build.VERSION.SDK_INT >= Build.VERSION_CODES.R // API 30
val archiveFlags = if (archiveUsesSigningInfo)
PackageManager.GET_SIGNING_CERTIFICATES else @Suppress("DEPRECATION") PackageManager.GET_SIGNATURES
val downloaded = pm.getPackageArchiveInfo(apkFile.absolutePath, flags)
val downloaded = pm.getPackageArchiveInfo(apkFile.absolutePath, archiveFlags)
if (downloaded == null) {
Log.e(TAG, "Could not parse downloaded APK")
return false
@ -256,14 +365,20 @@ class UpdateChecker(private val context: Context) {
Log.e(TAG, "APK package mismatch: ${downloaded.packageName} != ${context.packageName}")
return false
}
val installed = pm.getPackageInfo(context.packageName, flags)
val downloadedSigs = signingCertHashes(downloaded)
val installedSigs = signingCertHashes(installed)
// INSTALLED-app read: signingInfo IS populated for installed packages on API 28+,
// so keep the modern flag there (this side already worked).
val installedUsesSigningInfo = Build.VERSION.SDK_INT >= Build.VERSION_CODES.P // API 28
val installedFlags = if (installedUsesSigningInfo)
PackageManager.GET_SIGNING_CERTIFICATES else @Suppress("DEPRECATION") PackageManager.GET_SIGNATURES
val installed = pm.getPackageInfo(context.packageName, installedFlags)
val downloadedSigs = signingCertHashes(downloaded, archiveUsesSigningInfo)
val installedSigs = signingCertHashes(installed, installedUsesSigningInfo)
if (downloadedSigs.isEmpty() || installedSigs.isEmpty()) {
Log.e(TAG, "Missing signing certificates (downloaded=${downloadedSigs.size}, installed=${installedSigs.size})")
return false
}
// Share at least one current signing certificate.
// Require a non-empty overlap of signer certs (handles multi-signer / cert-rotation
// the same way the API>=30 path does: compare the full current signer sets).
val match = downloadedSigs.any { it in installedSigs }
if (!match) Log.e(TAG, "APK signing certificate does not match installed app")
match
@ -273,8 +388,13 @@ class UpdateChecker(private val context: Context) {
}
}
private fun signingCertHashes(info: PackageInfo): Set<String> {
val sigs: Array<Signature>? = if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
// Read the signer-cert SHA-256 set from a PackageInfo. `useSigningInfo` must match the flag
// it was fetched with: GET_SIGNING_CERTIFICATES -> signingInfo.apkContentsSigners (modern;
// multi-signer + rotation aware), GET_SIGNATURES -> legacy .signatures (the only field
// populated for ARCHIVE reads on API 28/29). Both yield the same cert for a normally-signed
// APK; the caller compares as sets so an overlapping signer still verifies.
private fun signingCertHashes(info: PackageInfo, useSigningInfo: Boolean): Set<String> {
val sigs: Array<Signature>? = if (useSigningInfo) {
info.signingInfo?.apkContentsSigners
} else {
@Suppress("DEPRECATION") info.signatures

View file

@ -560,6 +560,22 @@ class WebSocketService : Service() {
} catch (e: Throwable) { Log.w("WebSocketService", "sendLog: ${e.message}") }
}
// #139 Phase 2 (Option B): announce an OTA status transition to the server so the dashboard
// badge updates promptly (not only on reconnect). Reads the just-persisted throttle state —
// the emit always reflects the stored truth. Called by UpdateChecker at clear / enter-backoff.
fun sendOtaStatus() {
if (socket?.connected() != true) return
try {
val s = OtaThrottle.State(config.otaTargetVersion, config.otaAttempts, config.otaLastAttemptAt, config.otaBackoffReported)
socket?.emit("device:ota-status", JSONObject().apply {
put("device_id", config.deviceId)
put("ota_status", OtaThrottle.statusFor(s, System.currentTimeMillis()))
put("ota_target_version", config.otaTargetVersion)
put("ota_attempts", config.otaAttempts)
})
} catch (e: Throwable) { Log.w("WebSocketService", "sendOtaStatus: ${e.message}") }
}
fun sendPlaybackState(contentId: String, positionSec: Float) {
if (socket?.connected() != true) return
try {

View file

@ -13,6 +13,8 @@ import android.os.SystemClock
import android.provider.Settings
import android.util.DisplayMetrics
import android.view.WindowManager
import com.remotedisplay.player.data.ServerConfig
import com.remotedisplay.player.service.OtaThrottle
import java.security.MessageDigest
import org.json.JSONObject
@ -49,6 +51,13 @@ class DeviceInfo(private val context: Context) {
put("screen_height", outH)
put("render_width", renW)
put("render_height", renH)
// #139 Phase 2: report OTA backoff state (alongside app_version) so the dashboard can
// flag screens stuck in manual-update-required. Read from the persisted throttle state.
val cfg = ServerConfig(context)
val ota = OtaThrottle.State(cfg.otaTargetVersion, cfg.otaAttempts, cfg.otaLastAttemptAt, cfg.otaBackoffReported)
put("ota_status", OtaThrottle.statusFor(ota, System.currentTimeMillis()))
put("ota_target_version", cfg.otaTargetVersion)
put("ota_attempts", cfg.otaAttempts)
}
}

View file

@ -0,0 +1,97 @@
package com.remotedisplay.player.service
import org.junit.Assert.assertEquals
import org.junit.Assert.assertFalse
import org.junit.Assert.assertTrue
import org.junit.Test
/**
* #139: coverage for the OTA throttle state machine (the stateful core that the OTA
* re-download-loop fix depends on), independent of Android. UpdateChecker is just the shell.
*/
class OtaThrottleTest {
private val V = "1.9.1-beta6"
private val MAX = OtaThrottle.MAX_INSTALL_ATTEMPTS
private val WINDOW = OtaThrottle.BACKOFF_MS
// Launch `n` installs from `start`, returning the resulting state.
private fun launch(start: OtaThrottle.State, n: Int, now: Long = 1000L): OtaThrottle.State {
var s = start
repeat(n) { s = OtaThrottle.onInstallLaunched(s, now + it).first }
return s
}
@Test fun newTargetResetsBudget() {
val stale = OtaThrottle.State(targetVersion = "1.9.1-beta5", attempts = 2, lastAttemptAt = 1000, backoffReported = true)
assertTrue(OtaThrottle.isNewTarget(stale, V))
val (s, action) = OtaThrottle.onUpdateAvailable(stale, V, now = 5000)
assertEquals(V, s.targetVersion)
assertEquals(0, s.attempts)
assertEquals(0L, s.lastAttemptAt)
assertFalse(s.backoffReported)
assertEquals(OtaThrottle.Action.ATTEMPT, action)
}
@Test fun aCheckNeverConsumesBudget_onlyInstallLaunchedDoes() {
var s = OtaThrottle.State(targetVersion = V, attempts = 0)
// Repeated checks (e.g. each followed by a failed download) must not advance the counter.
repeat(5) {
val (ns, action) = OtaThrottle.onUpdateAvailable(s, V, now = 100)
assertEquals(OtaThrottle.Action.ATTEMPT, action)
assertEquals(0, ns.attempts)
s = ns
}
// Only a launched install increments.
assertEquals(1, OtaThrottle.onInstallLaunched(s, now = 200).first.attempts)
}
@Test fun capThenBackoffWithinWindow() {
val s = launch(OtaThrottle.State(targetVersion = V), MAX, now = 1000L)
assertEquals(MAX, s.attempts)
assertTrue(s.backoffReported)
// A check inside the window → BACKOFF, no further attempt, state unchanged.
val (ns, action) = OtaThrottle.onUpdateAvailable(s, V, now = 1000L + WINDOW - 1)
assertEquals(OtaThrottle.Action.BACKOFF, action)
assertEquals(MAX, ns.attempts)
}
@Test fun enterBackoffSignalsExactlyOnce() {
var s = OtaThrottle.State(targetVersion = V)
var crossings = 0
repeat(MAX + 3) { i ->
val (ns, entered) = OtaThrottle.onInstallLaunched(s, now = i.toLong())
if (entered) crossings++
s = ns
}
assertEquals("enter-backoff fires only on the crossing", 1, crossings)
}
@Test fun retryAfterWindowElapsedDoesNotReReport() {
val capped = OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = 0L, backoffReported = true)
val (afterCheck, action) = OtaThrottle.onUpdateAvailable(capped, V, now = WINDOW + 1)
assertEquals(OtaThrottle.Action.ATTEMPT, action) // window elapsed → one retry allowed
val (_, entered) = OtaThrottle.onInstallLaunched(afterCheck, now = WINDOW + 2)
assertFalse("already reported entering backoff — must not report again", entered)
}
@Test fun clearsOnSuccessOnlyWhenPending() {
assertTrue(OtaThrottle.shouldClearOnUpToDate(OtaThrottle.State(targetVersion = V, attempts = 2)))
assertFalse(OtaThrottle.shouldClearOnUpToDate(OtaThrottle.State())) // nothing pending
}
@Test fun statusForReflectsBackoffWindow() {
val now = 10_000L
// no target → none
assertEquals("none", OtaThrottle.statusFor(OtaThrottle.State(), now))
// under the cap → pending
assertEquals("pending", OtaThrottle.statusFor(
OtaThrottle.State(targetVersion = V, attempts = 1, lastAttemptAt = now), now))
// capped AND inside the window → manual update required
assertEquals("manual_update_required", OtaThrottle.statusFor(
OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = now), now + WINDOW - 1))
// capped but window elapsed (a retry is due) → pending, not stuck
assertEquals("pending", OtaThrottle.statusFor(
OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = now), now + WINDOW + 1))
}
}

View file

@ -0,0 +1,44 @@
# Maintenance: `device_status_log` growth & space reclaim (#142)
## What changed in 1.9.2-beta1
`device_status_log` previously grew without an effective bound (the per-device
insert-time prune missed removed/idle devices and the heartbeat `offline_timeout`
insert). In one deployment it reached ~1.2M rows / ~119 MB over ~23 days and
degraded dashboard performance.
1.9.2-beta1 bounds further growth:
- **Index** `idx_device_status_log_device_ts(device_id, timestamp)` — the dashboard
uptime query and the prunes now use an index instead of a full scan.
- **Global retention sweep** (`pruneStatusLog()`), run on startup and on the
heartbeat interval, deletes rows older than **`STATUS_LOG_RETENTION_DAYS`**
(default **3**) across *all* devices — including removed/idle devices and the
`offline_timeout` rows the per-device prune never revisited.
## Reclaiming space on an already-bloated database
> **Operator action — only needed once, only if your `device_status_log` is already
> bloated from a pre-1.9.2 deployment.**
Retention bounds *future* growth, but SQLite does **not** return freed pages to the
filesystem on `DELETE` — the file stays at its high-water mark until a `VACUUM`.
After upgrading (which prunes the old rows), reclaim the disk with a **one-time
manual `VACUUM` in a maintenance window**:
```sh
# stop the server (or do this during a low-traffic window — VACUUM takes a global
# write lock and rewrites the whole DB file; the app cannot write during it)
sqlite3 /opt/screentinker/server/db/remote_display.db 'VACUUM;'
```
In the reference incident this took the DB from **119 MB → 39 MB**.
### Why VACUUM is not automatic
`VACUUM` locks the database and rewrites the entire file — unacceptable on the hot
path. `PRAGMA auto_vacuum=INCREMENTAL` is **not** enabled either: it only takes
effect on a freshly-created database (set before the first table) or after a
one-time full `VACUUM` to convert an existing DB, so enabling it would be a no-op on
existing installs and a silent behavior change on new ones. Space reclaim is left as
a deliberate operator decision; ongoing growth is already bounded by retention.

View file

@ -6,6 +6,8 @@ export default {
'device.pl_item.orphan_zone_tip': "This item's zone isn't part of the device's current layout. It still plays (recovered into the largest zone), but reassign it to a zone in this layout.",
'dashboard.device_orphan_tip_one': "{n} item assigned to a zone that isn't in this device's layout — open the device to reassign",
'dashboard.device_orphan_tip_other': "{n} items assigned to a zone that isn't in this device's layout — open the device to reassign",
// #139: device stuck in OTA backoff (can't self-install — e.g. Fire TV) — needs a manual update.
'dashboard.device_ota_stuck': 'Update available (v{version}) — install failed {n}×, manual update required',
// Nav (sidebar)
'nav.displays': 'Displays',
'nav.content': 'Content',

View file

@ -117,6 +117,9 @@ function renderDeviceCard(device) {
<div class="device-card-name">${esc(device.name)}${device.orphan_count > 0 ? `
<span class="device-orphan-badge" title="${tn('dashboard.device_orphan_tip', device.orphan_count)}" style="margin-left:6px;display:inline-flex;align-items:center;gap:3px;font-size:11px;color:var(--danger);vertical-align:middle">
<svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M10.29 3.86L1.82 18a2 2 0 0 0 1.71 3h16.94a2 2 0 0 0 1.71-3L13.71 3.86a2 2 0 0 0-3.42 0z"/><line x1="12" y1="9" x2="12" y2="13"/><line x1="12" y1="17" x2="12.01" y2="17"/></svg>${device.orphan_count}
</span>` : ''}${device.ota_status === 'manual_update_required' ? `
<span class="device-ota-badge" title="${esc(t('dashboard.device_ota_stuck', { version: device.ota_target_version || '?', n: device.ota_attempts || 0 }))}" style="margin-left:6px;display:inline-flex;align-items:center;gap:3px;font-size:11px;color:var(--warning);vertical-align:middle">
<svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4"/><polyline points="7 10 12 15 17 10"/><line x1="12" y1="15" x2="12" y2="3"/></svg>update
</span>` : ''}</div>
${device.owner_name || device.owner_email ? `<div style="font-size:11px;color:var(--text-muted);margin-bottom:4px">
<svg width="10" height="10" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" style="vertical-align:-1px">

View file

@ -17,6 +17,25 @@ if [ -n "$(git status --porcelain)" ]; then
exit 1
fi
# Pre-push fast-forward guard. This script creates an annotated tag locally; if
# origin/main has advanced past the commit we're bumping from, `git push origin main`
# is rejected as a non-fast-forward - and if the tag gets pushed anyway it fires the
# release workflow from a commit that isn't even on main (the beta9 divergence
# incident). Catch the divergence HERE, before the tag exists, so nothing can fire.
# Best-effort: when the fetch can't run (offline), warn and proceed rather than block
# a local bump - the push itself is still the backstop.
if git fetch --quiet origin main 2>/dev/null; then
if ! git merge-base --is-ancestor FETCH_HEAD HEAD; then
echo "ERROR: origin/main ($(git rev-parse --short FETCH_HEAD)) has commits not in your" >&2
echo " HEAD ($(git rev-parse --short HEAD)) - 'git push origin main' would be rejected." >&2
echo " Merge origin/main into your branch first, then re-run the bump." >&2
exit 1
fi
else
echo "WARNING: could not fetch origin/main - skipping the fast-forward check (offline?)." >&2
echo " Confirm 'git push origin main' will fast-forward before pushing the tag." >&2
fi
CURRENT="$(cat VERSION)"
IFS=. read -r MAJ MIN PAT <<< "$CURRENT"

549
scripts/debian-13-setup.sh Executable file
View file

@ -0,0 +1,549 @@
#!/bin/bash
# ScreenTinker - Debian 13 Setup Script
#
# Modes:
# - Server + Player (both)
# - Server only
# - Player only
#
# Usage:
# curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash
# curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash -s -- --server-only
# curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash -s -- --player-only https://screentinker.com
set -euo pipefail
# -- Configuration --
SCREENTINKER_DIR="/opt/screentinker"
SCREENTINKER_PORT=3001
NODE_MAJOR=20
LOG_FILE="/var/log/screentinker-debian-setup.log"
# -- Colors --
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
log() { echo -e "${GREEN}[ScreenTinker]${NC} $1"; }
warn() { echo -e "${YELLOW}[WARNING]${NC} $1"; }
err() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
MODE="both"
MODE_SET=false
SERVER_URL=""
while [[ $# -gt 0 ]]; do
case "$1" in
--server-only)
MODE="server"
MODE_SET=true
shift
;;
--player-only)
MODE="player"
MODE_SET=true
shift
if [[ $# -gt 0 && "$1" == http* ]]; then
SERVER_URL="$1"
shift
fi
;;
--both)
MODE="both"
MODE_SET=true
shift
;;
--help|-h)
echo "Usage: sudo ./debian-13-setup.sh [OPTIONS] [SERVER_URL]"
echo ""
echo "Options:"
echo " --server-only Install only the server"
echo " --player-only [URL] Install only the player (URL required)"
echo " --both Install both server and player (default)"
echo " --help Show this help"
echo ""
echo "Examples:"
echo " sudo ./debian-13-setup.sh"
echo " sudo ./debian-13-setup.sh --server-only"
echo " sudo ./debian-13-setup.sh --player-only https://screentinker.com"
exit 0
;;
http*)
SERVER_URL="$1"
shift
;;
*)
shift
;;
esac
done
if [ "$(id -u)" -ne 0 ]; then
err "This script must be run as root. Try: sudo bash debian-13-setup.sh"
fi
if [ -r /etc/os-release ]; then
. /etc/os-release
if [ "${ID:-}" != "debian" ] || [ "${VERSION_ID:-}" != "13" ]; then
warn "Detected ${PRETTY_NAME:-unknown}. This script targets Debian 13."
read -p "Continue anyway? (y/N) " -n 1 -r; echo
[[ ! $REPLY =~ ^[Yy]$ ]] && exit 1
else
log "Detected Debian 13"
fi
fi
if [ "$MODE" = "player" ] && [ -z "$SERVER_URL" ]; then
echo ""
echo -e "${BLUE}======================================${NC}"
echo -e "${BLUE} ScreenTinker Debian 13 Setup${NC}"
echo -e "${BLUE}======================================${NC}"
echo ""
read -p "Server URL (e.g., https://screentinker.com): " SERVER_URL
elif [ "$MODE" = "both" ] && [ "$MODE_SET" = false ] && [ -z "$SERVER_URL" ]; then
echo ""
echo -e "${BLUE}======================================${NC}"
echo -e "${BLUE} ScreenTinker Debian 13 Setup${NC}"
echo -e "${BLUE}======================================${NC}"
echo ""
echo " 1) Server + Player (recommended for single-screen host)"
echo " 2) Server Only"
echo " 3) Player Only"
echo ""
read -p "Choose [1/2/3]: " MODE_CHOICE
case "$MODE_CHOICE" in
2)
MODE="server"
;;
3)
MODE="player"
read -p "Server URL (e.g., https://screentinker.com): " SERVER_URL
;;
*)
MODE="both"
;;
esac
fi
SERVER_URL="${SERVER_URL%/}"
NEED_SERVER=false
NEED_PLAYER=false
case "$MODE" in
server)
NEED_SERVER=true
;;
player)
NEED_PLAYER=true
;;
both)
NEED_SERVER=true
NEED_PLAYER=true
;;
*)
err "Unknown mode: $MODE"
;;
esac
if [ "$NEED_PLAYER" = true ] && [ "$MODE" = "player" ] && [ -z "$SERVER_URL" ]; then
err "Player-only mode requires a server URL"
fi
if [ "$NEED_PLAYER" = true ]; then
if [ "$MODE" = "player" ]; then
KIOSK_URL="${SERVER_URL}/player"
else
KIOSK_URL="http://localhost:${SCREENTINKER_PORT}/player"
fi
fi
echo ""
log "Setup log: $LOG_FILE"
exec > >(tee -a "$LOG_FILE") 2>&1
log "Updating system packages..."
apt-get update -qq
apt-get upgrade -y -qq
log "Installing base dependencies..."
apt-get install -y -qq \
git curl wget unzip htop \
avahi-daemon \
fonts-liberation fonts-noto-color-emoji \
>> "$LOG_FILE" 2>&1
RUNTIME_USER="${SUDO_USER:-$(logname 2>/dev/null || echo root)}"
if ! id "$RUNTIME_USER" &>/dev/null; then
warn "Could not resolve invoking user; defaulting to root"
RUNTIME_USER="root"
fi
RUNTIME_HOME=$(eval echo "~$RUNTIME_USER")
if [ "$NEED_SERVER" = true ]; then
NEED_NODE=true
if command -v node &>/dev/null; then
CUR=$(node -v | cut -d'v' -f2 | cut -d'.' -f1)
if [ "$CUR" -ge "$NODE_MAJOR" ]; then
log "Node.js $(node -v) already installed"
NEED_NODE=false
fi
fi
if [ "$NEED_NODE" = true ]; then
log "Installing Node.js ${NODE_MAJOR}.x..."
curl -fsSL "https://deb.nodesource.com/setup_${NODE_MAJOR}.x" | bash - >> "$LOG_FILE" 2>&1
apt-get install -y -qq nodejs >> "$LOG_FILE" 2>&1
log "Node.js $(node -v) installed"
fi
if [ -d "$SCREENTINKER_DIR/.git" ]; then
log "Repo exists at $SCREENTINKER_DIR, pulling latest..."
cd "$SCREENTINKER_DIR" && git pull origin main >> "$LOG_FILE" 2>&1
else
log "Cloning ScreenTinker..."
git clone https://github.com/screentinker/screentinker.git "$SCREENTINKER_DIR" >> "$LOG_FILE" 2>&1
fi
log "Installing server dependencies..."
cd "$SCREENTINKER_DIR/server"
npm install --production >> "$LOG_FILE" 2>&1
mkdir -p "$SCREENTINKER_DIR/server/db"
mkdir -p "$SCREENTINKER_DIR/server/uploads"
chown -R "$RUNTIME_USER":"$RUNTIME_USER" "$SCREENTINKER_DIR"
log "Creating screentinker-server service..."
cat > /etc/systemd/system/screentinker-server.service << SERVICEEOF
[Unit]
Description=ScreenTinker Digital Signage Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=${RUNTIME_USER}
WorkingDirectory=${SCREENTINKER_DIR}/server
ExecStart=/usr/bin/node server.js
Restart=always
RestartSec=5
StartLimitBurst=5
StartLimitIntervalSec=60
Environment=NODE_ENV=production
Environment=PORT=${SCREENTINKER_PORT}
Environment=SELF_HOSTED=true
Environment=HOST=0.0.0.0
StandardOutput=journal
StandardError=journal
SyslogIdentifier=screentinker-server
[Install]
WantedBy=multi-user.target
SERVICEEOF
systemctl daemon-reload
systemctl enable screentinker-server.service
log "Server service enabled"
fi
if [ "$NEED_PLAYER" = true ]; then
log "Installing player packages..."
apt-get install -y -qq \
xserver-xorg xserver-xorg-legacy x11-xserver-utils xinit \
chromium unclutter xdotool \
>> "$LOG_FILE" 2>&1 || {
warn "Failed to install chromium package, trying chromium-browser..."
apt-get install -y -qq xserver-xorg xserver-xorg-legacy x11-xserver-utils xinit chromium-browser unclutter xdotool >> "$LOG_FILE" 2>&1
}
CHROMIUM_BIN=$(command -v chromium 2>/dev/null || command -v chromium-browser 2>/dev/null || echo "/usr/bin/chromium")
log "Allowing non-root X server startup..."
mkdir -p /etc/X11
cat > /etc/X11/Xwrapper.config << 'XWRAPEOF'
allowed_users=anybody
needs_root_rights=yes
XWRAPEOF
log "Creating kiosk launcher..."
cat > "$RUNTIME_HOME/screentinker-kiosk.sh" << KIOSKEOF
#!/bin/bash
KIOSK_URL="${KIOSK_URL}"
sleep 2
# Disable screen blanking and power management
xset s off
xset s noblank
xset -dpms
xset s 0 0
# Hide cursor after 3 seconds of inactivity
unclutter -idle 3 -root &
# Clean Chromium crash flags (prevents restore session dialogs)
CDIR="\$HOME/.config/chromium/Default"
mkdir -p "\$CDIR"
if [ -f "\$CDIR/Preferences" ]; then
sed -i 's/"exited_cleanly":false/"exited_cleanly":true/' "\$CDIR/Preferences" 2>/dev/null || true
sed -i 's/"exit_type":"Crashed"/"exit_type":"Normal"/' "\$CDIR/Preferences" 2>/dev/null || true
fi
# Wait for local server if running all-in-one
if echo "\$KIOSK_URL" | grep -q "localhost"; then
echo "Waiting for ScreenTinker server..."
for i in \$(seq 1 60); do
if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/status" >/dev/null 2>&1; then
echo "Server ready after \${i}x2s"
break
fi
sleep 2
done
fi
# Detect screen resolution so Chromium fills the display on minimal X11 (no WM)
SCREEN_RES=\$(xrandr 2>/dev/null | grep ' connected' | grep -oE '[0-9]+x[0-9]+' | head -1)
SCREEN_W=\${SCREEN_RES%%x*}
SCREEN_H=\${SCREEN_RES##*x}
if [ -z "\$SCREEN_W" ] || [ -z "\$SCREEN_H" ]; then
SCREEN_W=1920
SCREEN_H=1080
fi
exec ${CHROMIUM_BIN} \\
--kiosk \\
--window-position=0,0 \\
--window-size=\${SCREEN_W},\${SCREEN_H} \\
--noerrdialogs \\
--disable-infobars \\
--disable-session-crashed-bubble \\
--disable-features=TranslateUI \\
--disable-component-update \\
--check-for-update-interval=31536000 \\
--autoplay-policy=no-user-gesture-required \\
--no-first-run \\
--disable-pinch \\
--overscroll-history-navigation=0 \\
--disable-translate \\
--disable-sync \\
--disable-background-networking \\
--disable-default-apps \\
--disable-extensions \\
--disable-hang-monitor \\
--disable-popup-blocking \\
--disable-prompt-on-repost \\
--metrics-recording-only \\
--safebrowsing-disable-auto-update \\
--ignore-certificate-errors \\
"\$KIOSK_URL"
KIOSKEOF
chmod +x "$RUNTIME_HOME/screentinker-kiosk.sh"
chown "$RUNTIME_USER":"$RUNTIME_USER" "$RUNTIME_HOME/screentinker-kiosk.sh"
cat > "$RUNTIME_HOME/.xinitrc" << 'XINITEOF'
#!/bin/bash
exec ~/screentinker-kiosk.sh
XINITEOF
chmod +x "$RUNTIME_HOME/.xinitrc"
chown "$RUNTIME_USER":"$RUNTIME_USER" "$RUNTIME_HOME/.xinitrc"
if [ "$NEED_SERVER" = true ]; then
KIOSK_AFTER="After=screentinker-server.service"
KIOSK_REQ="Requires=screentinker-server.service"
else
KIOSK_AFTER="After=network-online.target"
KIOSK_REQ="Wants=network-online.target"
fi
log "Creating kiosk service..."
cat > /etc/systemd/system/screentinker-kiosk.service << SERVICEEOF
[Unit]
Description=ScreenTinker Kiosk Display
${KIOSK_AFTER}
${KIOSK_REQ}
# Prevent conflicts with getty on tty1
Conflicts=getty@tty1.service
After=getty@tty1.service
[Service]
Type=simple
User=${RUNTIME_USER}
Environment=DISPLAY=:0
Environment=XAUTHORITY=${RUNTIME_HOME}/.Xauthority
# Remove stale X lock files from previous crashes before starting
ExecStartPre=/bin/bash -c 'rm -f /tmp/.X0-lock /tmp/.X11-unix/X0'
ExecStartPre=/bin/sleep 3
ExecStart=/usr/bin/startx ${RUNTIME_HOME}/.xinitrc -- :0 -nolisten tcp vt1
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=120
TTYPath=/dev/tty1
StandardInput=tty
StandardOutput=journal
StandardError=journal
SyslogIdentifier=screentinker-kiosk
[Install]
WantedBy=multi-user.target
SERVICEEOF
systemctl daemon-reload
systemctl enable screentinker-kiosk.service
log "Kiosk service enabled"
log "Configuring auto-login on tty1..."
mkdir -p /etc/systemd/system/getty@tty1.service.d
cat > /etc/systemd/system/getty@tty1.service.d/autologin.conf << AUTOLOGINEOF
[Service]
ExecStart=
ExecStart=-/sbin/agetty --autologin ${RUNTIME_USER} --noclear %I \$TERM
AUTOLOGINEOF
# Disable getty on tty1 so it doesn't conflict with the kiosk service
systemctl disable getty@tty1.service 2>/dev/null || true
systemctl mask getty@tty1.service 2>/dev/null || true
fi
if [ "$NEED_SERVER" = true ]; then
log "Creating management scripts..."
cat > /usr/local/bin/screentinker-update << 'UPDATEEOF'
#!/bin/bash
echo "Stopping services..."
sudo systemctl stop screentinker-kiosk.service 2>/dev/null || true
sudo systemctl stop screentinker-server.service 2>/dev/null || true
echo "Pulling latest..."
cd /opt/screentinker && git pull origin main
echo "Installing dependencies..."
cd server && npm install --production
echo "Starting services..."
sudo systemctl start screentinker-server.service
if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
sleep 3
sudo systemctl start screentinker-kiosk.service
fi
echo ""
echo "Done! Server: $(systemctl is-active screentinker-server.service)"
if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
echo " Kiosk: $(systemctl is-active screentinker-kiosk.service)"
fi
UPDATEEOF
chmod +x /usr/local/bin/screentinker-update
cat > /usr/local/bin/screentinker-status << 'STATUSEOF'
#!/bin/bash
echo ""
echo "=== ScreenTinker Status ==="
echo ""
IP=$(hostname -I | awk '{print $1}')
if systemctl is-active screentinker-server.service &>/dev/null; then
echo "Server: RUNNING (PID $(systemctl show screentinker-server.service -p MainPID --value))"
else
echo "Server: STOPPED"
fi
if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
if systemctl is-active screentinker-kiosk.service &>/dev/null; then
echo "Kiosk: RUNNING"
else
echo "Kiosk: STOPPED"
fi
fi
echo ""
echo "Uptime: $(uptime -p)"
echo "Disk: $(df -h /opt/screentinker 2>/dev/null | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
echo "Memory: $(free -h | awk '/Mem:/ {print $3 " / " $2}')"
echo ""
echo "Dashboard: http://${IP}:3001"
echo "Player: http://${IP}:3001/player"
echo "mDNS: http://$(hostname).local:3001"
echo ""
STATUSEOF
chmod +x /usr/local/bin/screentinker-status
cat > /usr/local/bin/screentinker-logs << 'LOGSEOF'
#!/bin/bash
case "${1:-server}" in
server) journalctl -u screentinker-server.service -f --no-hostname ;;
kiosk) journalctl -u screentinker-kiosk.service -f --no-hostname ;;
all) journalctl -u screentinker-server.service -u screentinker-kiosk.service -f --no-hostname ;;
*) echo "Usage: screentinker-logs [server|kiosk|all]" ;;
esac
LOGSEOF
chmod +x /usr/local/bin/screentinker-logs
fi
cat > /etc/motd << 'MOTDEOF'
____ _____ _
/ ___| ___ _ __ ___ ___ |_ _|_ _ __ | | _____ _ __
\___ \ / __| '__/ _ \/ _ \ | || | '_ \| |/ / _ \ '__|
___) | (__| | | __/ __/ | || | | | | < __/ |
|____/ \___|_| \___|\___| |_||_|_| |_|_|\_\___|_|
Open-Source Digital Signage for Any Screen
Commands:
screentinker-status Show system info and URLs
screentinker-update Pull latest and restart
screentinker-logs Follow logs (server|kiosk|all)
MOTDEOF
if grep -q "#RuntimeWatchdogSec=0" /etc/systemd/system.conf 2>/dev/null; then
sed -i 's/#RuntimeWatchdogSec=0/RuntimeWatchdogSec=10/' /etc/systemd/system.conf
log "Hardware watchdog enabled (10s)"
fi
# Disable console blanking so the screen stays on during boot
if [ -f /etc/default/grub ]; then
if ! grep -q "consoleblank=0" /etc/default/grub; then
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 consoleblank=0"/' /etc/default/grub
update-grub >> "$LOG_FILE" 2>&1 && log "Console blanking disabled in GRUB" || warn "update-grub failed (non-fatal)"
fi
fi
echo ""
echo -e "${GREEN}======================================${NC}"
echo -e "${GREEN} ScreenTinker Setup Complete!${NC}"
echo -e "${GREEN}======================================${NC}"
echo ""
IP=$(hostname -I | awk '{print $1}')
if [ "$MODE" = "both" ]; then
echo "Mode: Server + Player"
echo "Dashboard: http://${IP}:${SCREENTINKER_PORT}"
echo "Player: http://${IP}:${SCREENTINKER_PORT}/player"
elif [ "$MODE" = "server" ]; then
echo "Mode: Server Only"
echo "Dashboard: http://${IP}:${SCREENTINKER_PORT}"
else
echo "Mode: Player Only"
echo "Server: $SERVER_URL"
fi
echo ""
echo "Services:"
if [ "$NEED_SERVER" = true ]; then
echo " sudo systemctl [start|stop|restart] screentinker-server"
fi
if [ "$NEED_PLAYER" = true ]; then
echo " sudo systemctl [start|stop|restart] screentinker-kiosk"
fi
echo ""
echo -e "${YELLOW}Reboot to start: sudo reboot${NC}"
echo ""

View file

@ -280,7 +280,7 @@ fi
if echo "\$KIOSK_URL" | grep -q "localhost"; then
echo "Waiting for ScreenTinker server..."
for i in \$(seq 1 30); do
if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/health" >/dev/null 2>&1; then
if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/status" >/dev/null 2>&1; then
echo "Server ready"
break
fi
@ -288,8 +288,19 @@ if echo "\$KIOSK_URL" | grep -q "localhost"; then
done
fi
# Detect screen resolution so Chromium fills the display on minimal X11 (no WM)
SCREEN_RES=\$(xrandr 2>/dev/null | grep ' connected' | grep -oE '[0-9]+x[0-9]+' | head -1)
SCREEN_W=\${SCREEN_RES%%x*}
SCREEN_H=\${SCREEN_RES##*x}
if [ -z "\$SCREEN_W" ] || [ -z "\$SCREEN_H" ]; then
SCREEN_W=1920
SCREEN_H=1080
fi
exec ${CHROMIUM_BIN} \\
--kiosk \\
--window-position=0,0 \\
--window-size=\${SCREEN_W},\${SCREEN_H} \\
--noerrdialogs \\
--disable-infobars \\
--disable-session-crashed-bubble \\
@ -298,7 +309,6 @@ exec ${CHROMIUM_BIN} \\
--check-for-update-interval=31536000 \\
--autoplay-policy=no-user-gesture-required \\
--no-first-run \\
--start-fullscreen \\
--disable-pinch \\
--overscroll-history-navigation=0 \\
--disable-translate \\

View file

@ -90,4 +90,63 @@ module.exports = {
// on MSP-style deployments where an admin/operator assigns users to existing
// orgs after signup instead.
autoCreateOrgOnSignup: !['false', '0'].includes(String(process.env.AUTO_CREATE_ORG_ON_SIGNUP || '').toLowerCase()),
// #142 event-loop lag telemetry (services/loop-lag.js). perf_hooks
// monitorEventLoopDelay is C++-backed, so continuous sampling is cheap. Each
// window's p99 is persisted to event_loop_lag (bounded: indexed + pruned from
// day one) and drives the banded load level the reconnect throttle reads.
lagSampleIntervalMs: parseInt(process.env.LAG_SAMPLE_INTERVAL_MS) || 1000,
lagResolutionMs: parseInt(process.env.LAG_RESOLUTION_MS) || 20,
lagTelemetryRetentionDays: parseFloat(process.env.LAG_TELEMETRY_RETENTION_DAYS) || 3,
lagPruneIntervalMs: parseInt(process.env.LAG_PRUNE_INTERVAL_MS) || 3600000,
// Banded load levels from the window p99 (ms). Asymmetric by design: a band is
// entered immediately when its up-threshold is crossed (tighten fast), but
// released only one step at a time after lagReleaseSamples consecutive samples
// fall below a deadband (release slow), so small fluctuations don't flap it.
// Bands ONLY scale how hard an already-flagged device is throttled; a healthy
// device is never gated by global lag.
lagElevatedMs: parseInt(process.env.LAG_ELEVATED_MS) || 100,
lagCriticalMs: parseInt(process.env.LAG_CRITICAL_MS) || 250,
lagReleaseSamples: parseInt(process.env.LAG_RELEASE_SAMPLES) || 5,
// #142 load-aware per-device reconnect throttle (lib/reconnect-throttle.js).
// The verdict of WHO is misbehaving is ALWAYS per-device (keyed on device_id):
// a device is flagged only when it exceeds reconnectBaseMax genuine reconnects
// per reconnectWindowMs. Global lag never flags a healthy device — the lag band
// only MULTIPLIES how hard an already-flagged device is backed off.
reconnectWindowMs: parseInt(process.env.RECONNECT_WINDOW_MS) || 10000,
reconnectBaseMax: parseInt(process.env.RECONNECT_BASE_MAX) || 5,
// Absolute per-device ceiling, independent of band AND of warm-up: no device may
// exceed this many reconnects/window no matter what the adaptive logic computes,
// so a slow-ramp attacker can't train its way through.
reconnectHardCeiling: parseInt(process.env.RECONNECT_HARD_CEILING) || 20,
// Server-enforced backoff for a flagged device: baseBackoff * 2^(level-1) * band
// multiplier, capped at maxBackoff. Level escalates while it keeps storming
// (tighten fast) and decays one step per reconnectReleaseMs of calm (release slow).
reconnectBaseBackoffMs: parseInt(process.env.RECONNECT_BASE_BACKOFF_MS) || 1000,
reconnectMaxBackoffMs: parseInt(process.env.RECONNECT_MAX_BACKOFF_MS) || 60000,
reconnectMaxLevel: parseInt(process.env.RECONNECT_MAX_LEVEL) || 10,
reconnectReleaseMs: parseInt(process.env.RECONNECT_RELEASE_MS) || 30000,
// Cold start: for this long after process start, lag is high while the whole
// fleet reconnects at once. Treat leniently — force the 'normal' band and apply
// only the hard ceiling (no rate-band throttle) so a deploy can't throttle
// healthy screens. Throttle state is in-memory and resets on restart.
reconnectWarmupMs: parseInt(process.env.RECONNECT_WARMUP_MS) || 30000,
reconnectBandElevatedMult: parseFloat(process.env.RECONNECT_BAND_ELEVATED_MULT) || 2,
reconnectBandCriticalMult: parseFloat(process.env.RECONNECT_BAND_CRITICAL_MULT) || 4,
// #142 device_status_log retention. A GLOBAL scheduled sweep (pruneStatusLog in
// db/database.js, run on startup + the heartbeat interval) deletes rows older
// than this across ALL devices — covering what the per-device insert-time prune
// in deviceSocket.js misses: removed/idle devices that never insert again, and
// the heartbeat.js offline_timeout insert that bypasses logDeviceStatus. Default
// is LOWER than the old hardcoded 7 days (the reporter's bloat happened under 7d);
// 2-3 days is plenty for the dashboard's 24h uptime view + diagnostics.
statusLogRetentionDays: parseFloat(process.env.STATUS_LOG_RETENTION_DAYS) || 3,
// #142 content-ack dedup window (deviceSocket.js). A device (esp. older apps)
// can spam "content <id>: ready" for the same item; suppress identical
// (device_id, content_id, status) reports within this window. A status CHANGE
// has a different key and passes immediately. In-memory; resets on restart.
contentAckDedupMs: parseInt(process.env.CONTENT_ACK_DEDUP_MS) || 10000,
};

View file

@ -216,6 +216,24 @@ const migrations = [
// signal, so the two differ — surfacing both explains "reports 720 but monitor sees 1080".
"ALTER TABLE devices ADD COLUMN render_width INTEGER",
"ALTER TABLE devices ADD COLUMN render_height INTEGER",
// #139 Phase 2: device-reported OTA backoff status, so the dashboard can flag screens that
// can't self-install (Fire TV: no device-owner path) and need a hands-on update. ADD COLUMN
// with defaults is non-destructive in SQLite, and the apply loop below swallows "duplicate
// column" — so this is idempotent and upgrades an existing populated db without data loss.
// ota_updated_at = server receipt time (s), stamped on each register persist.
"ALTER TABLE devices ADD COLUMN ota_status TEXT DEFAULT 'none'",
"ALTER TABLE devices ADD COLUMN ota_target_version TEXT",
"ALTER TABLE devices ADD COLUMN ota_attempts INTEGER DEFAULT 0",
"ALTER TABLE devices ADD COLUMN ota_updated_at INTEGER",
// #142: index device_status_log for the per-device + time-window access pattern.
// schema.sql creates this on fresh installs; this migration covers existing DBs.
// Both the dashboard uptime query and the retention prune were full scans — the
// dashboard-degradation cause once the table reached 1M+ rows.
"CREATE INDEX IF NOT EXISTS idx_device_status_log_device_ts ON device_status_log(device_id, timestamp)",
// #142: event-loop lag telemetry table (bounded: indexed + scheduled prune).
// schema.sql creates these on fresh installs; this covers existing DBs.
"CREATE TABLE IF NOT EXISTS event_loop_lag (id INTEGER PRIMARY KEY AUTOINCREMENT, sampled_at INTEGER NOT NULL DEFAULT (strftime('%s','now')), mean_ms REAL NOT NULL, p50_ms REAL NOT NULL, p99_ms REAL NOT NULL, max_ms REAL NOT NULL, band TEXT NOT NULL DEFAULT 'normal')",
"CREATE INDEX IF NOT EXISTS idx_event_loop_lag_sampled ON event_loop_lag(sampled_at)",
];
// Apply each ALTER idempotently. A "duplicate column name" / "already exists"
// error means the column is already present (expected on a migrated DB) - benign.
@ -732,6 +750,21 @@ const { applyTenantDeleteCascade } = require('../lib/tenant-cascade-migration');
}
})();
// #142 GLOBAL device_status_log retention sweep across ALL devices. Run on startup
// and on the heartbeat interval (services/heartbeat.js). This covers the rows the
// per-device insert-time prune in deviceSocket.js misses: removed/idle devices that
// never insert again, and the heartbeat offline_timeout insert that bypasses
// logDeviceStatus. A plain time-range delete (like the play_logs prune) — runs off
// the hot path; after the first sweep the table is small, so the cost is negligible.
function pruneStatusLog() {
try {
const maxAgeSec = Math.round(config.statusLogRetentionDays * 86400);
const n = db.prepare("DELETE FROM device_status_log WHERE timestamp < strftime('%s','now') - ?").run(maxAgeSec).changes;
if (n > 0) console.log(`[status-log] pruned ${n} row(s) older than ${config.statusLogRetentionDays}d`);
return n;
} catch (_) { return 0; }
}
// Prune old telemetry (keep last 24h worth at 15s intervals = ~5760, cap at 6000)
function pruneTelemetry(deviceId) {
db.prepare(`
@ -804,4 +837,4 @@ try {
const { verifyAndRepairSchema } = require('../lib/schema-check');
verifyAndRepairSchema(db);
module.exports = { db, pruneTelemetry, pruneScreenshots };
module.exports = { db, pruneTelemetry, pruneScreenshots, pruneStatusLog };

View file

@ -463,6 +463,27 @@ CREATE TABLE IF NOT EXISTS device_status_log (
status TEXT NOT NULL,
timestamp INTEGER NOT NULL DEFAULT (strftime('%s','now'))
);
-- #142: index the per-device + time-window access pattern. Both the dashboard
-- uptime query (WHERE device_id=? AND timestamp>?) and the retention prune
-- (WHERE device_id=? AND timestamp<?) were full table scans; at 1M+ rows that
-- was the dashboard-degradation cause in the outage report.
CREATE INDEX IF NOT EXISTS idx_device_status_log_device_ts ON device_status_log(device_id, timestamp);
-- ===================== EVENT LOOP LAG (#142) =====================
-- Event-loop delay telemetry from perf_hooks.monitorEventLoopDelay(). Bounded
-- from day one: indexed on sampled_at and pruned on a schedule (see
-- services/loop-lag.js, LAG_TELEMETRY_RETENTION_DAYS) so it can never become a
-- second unbounded-growth table.
CREATE TABLE IF NOT EXISTS event_loop_lag (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sampled_at INTEGER NOT NULL DEFAULT (strftime('%s','now')),
mean_ms REAL NOT NULL,
p50_ms REAL NOT NULL,
p99_ms REAL NOT NULL,
max_ms REAL NOT NULL,
band TEXT NOT NULL DEFAULT 'normal'
);
CREATE INDEX IF NOT EXISTS idx_event_loop_lag_sampled ON event_loop_lag(sampled_at);
-- ===================== DEVICE FINGERPRINTS =====================
@ -484,13 +505,6 @@ CREATE TABLE IF NOT EXISTS alert_configs (
created_at INTEGER NOT NULL DEFAULT (strftime('%s','now'))
);
CREATE TABLE IF NOT EXISTS device_status_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
device_id TEXT NOT NULL,
status TEXT NOT NULL,
timestamp INTEGER NOT NULL DEFAULT (strftime('%s','now'))
);
-- ===================== PLAYER DEBUG LOGS =====================
-- Smart TVs (Tizen, WebOS, Fire TV, etc.) have no accessible devtools. The
-- player captures errors into window.__debugLog client-side and POSTs them

View file

@ -0,0 +1,98 @@
// #142 step 3 — load-aware per-device reconnect throttle (the outage fix).
//
// A single device stuck in a tight websocket reconnect loop can flood the server
// with full register cycles (DB writes + playlist build) and saturate the event
// loop. This module gates genuine reconnects PER DEVICE, before that heavy work
// runs in deviceSocket.js.
//
// Design (mirrors the issue's suggested mitigation + the lastPlayLogAt pattern):
// - WHO is always per-device: a device is "flagged" only when it exceeds
// reconnectBaseMax genuine reconnects within reconnectWindowMs. Global lag
// NEVER flags a healthy device.
// - Load-awareness is BANDED (normal/elevated/critical from services/loop-lag),
// not a continuous controller — deterministic and testable. The band only
// MULTIPLIES the backoff applied to an ALREADY-flagged device.
// - Hysteresis: escalate immediately while storming (tighten fast); decay the
// escalation level one step per reconnectReleaseMs of calm (release slow).
// - HARD CEILING: independent of band and of warm-up, no device may exceed
// reconnectHardCeiling/window — a slow-ramp attacker can't train through it.
// - COLD START: for reconnectWarmupMs after process start, force the 'normal'
// band and apply only the hard ceiling, so a full-fleet reconnect right after
// a deploy doesn't throttle healthy screens.
// - State is in-memory (resets on restart), like pair-lockout / totp-lockout.
const config = require('../config');
const loopLag = require('../services/loop-lag');
// deviceId -> { hits: number[], level: number, blockedUntil: ms, lastThrottleAt: ms }
const state = new Map();
let startedAt = Date.now();
function bandMultiplier(band) {
if (band === 'critical') return config.reconnectBandCriticalMult;
if (band === 'elevated') return config.reconnectBandElevatedMult;
return 1;
}
function reject(s, now, band, reason, observed, allowed) {
s.level = Math.min(s.level + 1, config.reconnectMaxLevel);
const backoff = Math.min(
config.reconnectBaseBackoffMs * Math.pow(2, s.level - 1) * bandMultiplier(band),
config.reconnectMaxBackoffMs
);
s.blockedUntil = now + backoff;
s.lastThrottleAt = now;
return { allow: false, retryAfterMs: backoff, reason, observed, allowed, band, level: s.level };
}
// Decide whether to allow a genuine reconnect for `deviceId`.
// `now` and `bandOverride` are injectable for deterministic tests; production
// passes only deviceId.
function check(deviceId, now = Date.now(), bandOverride = null) {
const warmup = (now - startedAt) < config.reconnectWarmupMs;
const band = bandOverride !== null ? bandOverride : (warmup ? 'normal' : loopLag.getBand());
let s = state.get(deviceId);
if (!s) { s = { hits: [], level: 0, blockedUntil: 0, lastThrottleAt: 0 }; state.set(deviceId, s); }
// Already inside an enforced backoff window: reject and escalate (tighten fast).
if (now < s.blockedUntil) {
return reject(s, now, band, 'in-backoff', s.hits.length, config.reconnectBaseMax);
}
// Sliding window of genuine reconnects.
s.hits = s.hits.filter((t) => now - t < config.reconnectWindowMs);
s.hits.push(now);
const observed = s.hits.length;
// Hard ceiling — always enforced, regardless of band or warm-up.
if (observed > config.reconnectHardCeiling) {
return reject(s, now, band, 'hard-ceiling', observed, config.reconnectHardCeiling);
}
// Cold start: only the hard ceiling applies; never rate-throttle during warm-up.
if (warmup) return allow(s, now, band);
// Healthy device: under the per-device threshold -> always allowed.
if (observed <= config.reconnectBaseMax) return allow(s, now, band);
// Flagged: storming beyond the per-device threshold -> throttle (band-scaled).
return reject(s, now, band, 'rate', observed, config.reconnectBaseMax);
}
function allow(s, now, band) {
// Release slow: decay one escalation level per reconnectReleaseMs of calm.
if (s.level > 0 && now - s.lastThrottleAt > config.reconnectReleaseMs) {
s.level = Math.max(0, s.level - 1);
s.lastThrottleAt = now;
}
return { allow: true, band, level: s.level };
}
// Test-only: clear state and optionally rewind the warm-up origin.
function __resetForTest(opts = {}) {
state.clear();
if (opts.startedAt !== undefined) startedAt = opts.startedAt;
}
module.exports = { check, __resetForTest };

View file

@ -1,12 +1,12 @@
{
"name": "screentinker",
"version": "1.9.1-beta6",
"version": "1.9.2-beta1",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "screentinker",
"version": "1.9.1-beta6",
"version": "1.9.2-beta1",
"dependencies": {
"@azure/msal-node": "^5.2.1",
"archiver": "^7.0.1",

View file

@ -1,6 +1,6 @@
{
"name": "screentinker",
"version": "1.9.1-beta6",
"version": "1.9.2-beta1",
"description": "ScreenTinker - Digital Signage Management Server",
"main": "server.js",
"scripts": {

View file

@ -160,20 +160,58 @@ function checkItemWrite(req, res) {
return item;
}
// #129: real-time mute. Tell every device on this playlist to toggle the volume of the
// matching currently-playing item NOW (decoupled from publish — the device matches by
// content_id/widget_id and applies it live). The new value is also written to the row, so
// it lands in the next published snapshot and persists across playlist reloads.
// #129 + mute-fix: per-item mute has to do TWO things, because the device plays from
// playlists.published_snapshot (deviceSocket.buildPlaylistPayload), NOT the draft
// playlist_items the toggle writes:
// (1) LIVE — tell every device on this playlist to silence the matching currently-playing
// item NOW (device matches by content_id/widget_id). Mutes the in-progress playthrough.
// (2) PERSIST — patch the matching item's `muted` inside the published_snapshot the device
// actually plays, then re-push the playlist. Without this the snapshot kept muted=0, so
// every loop/reload re-applied full volume — the "icon red but audio plays across 3
// playthroughs" bug (Android re-loads each loop; web's native <video> loop masked it).
// We patch the snapshot SURGICALLY (just the muted field of matching items) rather than calling
// publishPlaylist, so a mute toggle can't prematurely publish other pending draft edits or flip
// the playlist's draft/published status. muted is written as 0/1 to match buildSnapshotItems'
// format (the player reads it via optInt). playlist_items.muted is still updated by the caller,
// so a later full publish stays consistent.
function emitMuteChanged(req, item, muted) {
try {
const io = req.app.get('io');
if (!io) return;
const deviceNs = io.of('/device');
const m = !!muted;
// (2) PERSIST: patch the published snapshot the device reads from.
const pl = db.prepare('SELECT published_snapshot FROM playlists WHERE id = ?').get(item.playlist_id);
if (pl && pl.published_snapshot) {
let snap = null;
try { snap = JSON.parse(pl.published_snapshot); } catch (e) { snap = null; }
if (Array.isArray(snap)) {
let changed = false;
for (const s of snap) {
const match = item.content_id ? s.content_id === item.content_id
: (item.widget_id ? s.widget_id === item.widget_id : false);
if (match && (s.muted ? 1 : 0) !== (m ? 1 : 0)) { s.muted = m ? 1 : 0; changed = true; }
}
if (changed) {
db.prepare('UPDATE playlists SET published_snapshot = ? WHERE id = ?')
.run(JSON.stringify(snap), item.playlist_id);
}
}
}
// (1) LIVE toggle + re-deliver the patched snapshot so loops re-apply the correct flag.
// Lazy require (matches playlists.pushToDevices) to avoid a route<->ws circular import.
const { buildPlaylistPayload } = require('../ws/deviceSocket');
const commandQueue = require('../lib/command-queue');
const devices = db.prepare('SELECT id FROM devices WHERE playlist_id = ?').all(item.playlist_id);
const payload = { content_id: item.content_id || null, widget_id: item.widget_id || null, muted: !!muted };
for (const d of devices) deviceNs.to(d.id).emit('device:mute-changed', payload);
console.log(`[mute] item ${item.id} (content ${item.content_id || item.widget_id}) -> ${muted ? 'MUTED' : 'unmuted'}; notified ${devices.length} device(s)`);
} catch (e) { /* best-effort live toggle; the published snapshot is the source of truth */ }
const payload = { content_id: item.content_id || null, widget_id: item.widget_id || null, muted: m };
for (const d of devices) {
deviceNs.to(d.id).emit('device:mute-changed', payload); // current playthrough
commandQueue.queueOrEmitPlaylistUpdate(deviceNs, d.id, buildPlaylistPayload); // future loads (no reload of current item)
}
console.log(`[mute] item ${item.id} (content ${item.content_id || item.widget_id}) -> ${m ? 'MUTED' : 'unmuted'}; snapshot patched + notified ${devices.length} device(s)`);
} catch (e) { /* best-effort; playlist_items.muted is still updated for the next full publish */ }
}
// Update playlist item

View file

@ -7,6 +7,7 @@ const fs = require('fs');
const config = require('../config');
const VERSION = require('../version');
const { PLATFORM_ROLES } = require('../middleware/auth');
const loopLag = require('../services/loop-lag');
// Public status page
router.get('/', (req, res) => {
@ -24,6 +25,9 @@ router.get('/', (req, res) => {
version,
uptime_human: formatUptime(uptime),
timestamp: new Date().toISOString(),
// #142: current event-loop lag snapshot, so site lag is diagnosable from the
// health endpoint independent of any throttling. Cheap (in-memory read).
loop_lag: loopLag.getLag(),
});
});

View file

@ -625,6 +625,10 @@ app.set('io', io);
const { startHeartbeatChecker } = require('./services/heartbeat');
startHeartbeatChecker(io);
// #142: start event-loop lag sampling (feeds /api/status + the reconnect throttle)
const { startLoopLagMonitor } = require('./services/loop-lag');
startLoopLagMonitor();
// Start command-queue sweep (prunes expired entries for offline devices)
const commandQueue = require('./lib/command-queue');
commandQueue.startSweep();
@ -710,13 +714,22 @@ function resolveApkPath() {
return null;
}
// #139: a device that can't silently install re-downloads the APK every check cycle. Don't
// word a download as "in progress" (it may be a stuck loop, not progress), and rate-limit the
// line to once per IP per window so a looping device can't flood the log.
const otaDownloadLoggedAt = new Map(); // ip -> last-logged ms
const OTA_DOWNLOAD_LOG_WINDOW_MS = 10 * 60 * 1000;
// Serve APK download
app.get('/download/apk', (req, res) => {
const apkPath = resolveApkPath();
if (apkPath) {
// #96: an APK download means a device is actually applying an OTA - log it so the
// update is observable end to end (check -> download -> [relaunch]).
console.log(`[ota] APK download by ${getClientIp(req)} (${fs.statSync(apkPath).size} bytes) - OTA update in progress`);
const ip = getClientIp(req);
const now = Date.now();
if (now - (otaDownloadLoggedAt.get(ip) || 0) > OTA_DOWNLOAD_LOG_WINDOW_MS) {
otaDownloadLoggedAt.set(ip, now);
console.log(`[ota] APK served to ${ip} (${fs.statSync(apkPath).size} bytes)`);
}
res.setHeader('Content-Type', 'application/vnd.android.package-archive');
res.setHeader('Content-Disposition', 'attachment; filename="ScreenTinker.apk"');
res.setHeader('Cache-Control', 'no-cache');

View file

@ -1,4 +1,4 @@
const { db } = require('../db/database');
const { db, pruneStatusLog } = require('../db/database');
const config = require('../config');
const { deviceRoom, emitToWorkspace } = require('../lib/socket-rooms');
@ -6,6 +6,10 @@ const { deviceRoom, emitToWorkspace } = require('../lib/socket-rooms');
const deviceConnections = new Map();
function startHeartbeatChecker(io) {
// #142: sweep stale device_status_log rows once at startup (recovers a bloated
// table immediately after a deploy), then again on each interval below.
pruneStatusLog();
setInterval(() => {
const now = Date.now();
const dashboardNs = io.of('/dashboard');
@ -36,19 +40,18 @@ function startHeartbeatChecker(io) {
}
}
// Cleanup: delete unclaimed provisioning devices older than 24 hours
// Keep imported devices (they have user_id set) so users can re-pair them
db.prepare(`
DELETE FROM devices WHERE status = 'provisioning'
AND user_id IS NULL
AND created_at < strftime('%s','now') - (365 * 86400)
`).run();
// Cleanup: delete unclaimed provisioning devices older than 24 hours.
pruneProvisioningDevices();
// Cleanup: prune play logs older than 90 days
db.prepare(`
DELETE FROM play_logs WHERE started_at < strftime('%s','now') - (90 * 86400)
`).run();
// #142: global device_status_log retention sweep (all devices, incl. removed/idle
// and the offline_timeout insert path that bypasses the per-device prune).
pruneStatusLog();
// Cleanup: expired team invites
db.prepare(`
DELETE FROM team_invites WHERE expires_at < strftime('%s','now')
@ -83,11 +86,25 @@ function getAllConnections() {
return deviceConnections;
}
// #142: sweep unclaimed provisioning devices older than 24h. The window previously
// read `365 * 86400` (a YEAR), contradicting its own "older than 24 hours" comment,
// so socket-register pairing junk lingered far longer than intended. Imported
// devices keep a user_id and are preserved so they can be re-paired. Extracted from
// the interval above so the correctness fix is unit-testable. Returns rows deleted.
function pruneProvisioningDevices() {
return db.prepare(`
DELETE FROM devices
WHERE status = 'provisioning' AND user_id IS NULL
AND created_at < strftime('%s','now') - (24 * 3600)
`).run().changes;
}
module.exports = {
startHeartbeatChecker,
registerConnection,
updateHeartbeat,
removeConnection,
getConnection,
getAllConnections
getAllConnections,
pruneProvisioningDevices
};

107
server/services/loop-lag.js Normal file
View file

@ -0,0 +1,107 @@
// #142 — Event-loop lag telemetry (the data subsystem; ships before the throttle).
//
// Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
// (a C++-backed histogram — cheap). Each window we read mean/p50/p99/max, persist
// a row to the bounded `event_loop_lag` table, and recompute a coarse load BAND
// (normal | elevated | critical) from the window p99.
//
// The band is consumed by the reconnect throttle (#142 step 3), but this module
// has standalone value: getLag() is surfaced on /api/status and band changes are
// logged, so site connectivity/lag is diagnosable independent of any throttling.
//
// Band transitions are deliberately asymmetric (see nextBand): jump UP immediately
// when an up-threshold is crossed (tighten fast), step DOWN only one level at a
// time after lagReleaseSamples consecutive calm samples below a deadband (release
// slow). This avoids band flap from transient blips.
const { monitorEventLoopDelay } = require('perf_hooks');
const { db } = require('../db/database');
const config = require('../config');
const NS_PER_MS = 1e6;
// A band releases only once p99 falls below this fraction of the band's entry
// threshold — the deadband that stops small fluctuations from flapping the band.
const DEADBAND = 0.5;
const LEVEL = { normal: 0, elevated: 1, critical: 2 };
let histogram = null;
let band = 'normal';
let calmSamples = 0;
let current = { mean_ms: 0, p50_ms: 0, p99_ms: 0, max_ms: 0, band: 'normal', sampled_at: 0 };
// Pure band-transition function (exported for deterministic unit tests). Given the
// current band, the window p99 (ms), and the running calm-sample count, returns the
// next [band, calmSamples]. Up is immediate (may skip a level); down is one step
// per release window, gated by a deadband.
function nextBand(cur, p99, calm) {
const level = LEVEL[cur] ?? 0;
// UP — immediate, tighten fast (normal can jump straight to critical).
if (p99 >= config.lagCriticalMs && level < LEVEL.critical) return ['critical', 0];
if (p99 >= config.lagElevatedMs && level < LEVEL.elevated) return ['elevated', 0];
// DOWN — slow, one step, only below the current band's deadband.
if (level === LEVEL.critical && p99 <= config.lagCriticalMs * DEADBAND) {
const c = calm + 1;
return c >= config.lagReleaseSamples ? ['elevated', 0] : ['critical', c];
}
if (level === LEVEL.elevated && p99 <= config.lagElevatedMs * DEADBAND) {
const c = calm + 1;
return c >= config.lagReleaseSamples ? ['normal', 0] : ['elevated', c];
}
// Hold (inside deadband, or already normal): reset the calm counter.
return [cur, 0];
}
const round2 = (x) => Math.round(x * 100) / 100;
function sample() {
const p99 = histogram.percentile(99) / NS_PER_MS;
const snap = {
mean_ms: round2(histogram.mean / NS_PER_MS),
p50_ms: round2(histogram.percentile(50) / NS_PER_MS),
p99_ms: round2(p99),
max_ms: round2(histogram.max / NS_PER_MS),
};
histogram.reset();
const prev = band;
[band, calmSamples] = nextBand(band, snap.p99_ms, calmSamples);
current = { ...snap, band, sampled_at: Math.floor(Date.now() / 1000) };
try {
db.prepare(
'INSERT INTO event_loop_lag (sampled_at, mean_ms, p50_ms, p99_ms, max_ms, band) VALUES (?, ?, ?, ?, ?, ?)'
).run(current.sampled_at, snap.mean_ms, snap.p50_ms, snap.p99_ms, snap.max_ms, band);
} catch (_) { /* table may not exist on a partially-migrated DB */ }
// Observable: log whenever we're loaded or when the band changes (incl. back to
// normal). Healthy steady state stays quiet.
if (band !== 'normal' || prev !== 'normal') {
const tag = band !== prev ? ` (was ${prev})` : '';
console.log(`[loop-lag] band=${band}${tag} mean=${snap.mean_ms}ms p99=${snap.p99_ms}ms max=${snap.max_ms}ms`);
}
}
function pruneLag() {
try {
const cutoff = Math.floor(Date.now() / 1000) - Math.round(config.lagTelemetryRetentionDays * 86400);
const n = db.prepare('DELETE FROM event_loop_lag WHERE sampled_at < ?').run(cutoff).changes;
if (n > 0) console.log(`[loop-lag] pruned ${n} sample(s) older than ${config.lagTelemetryRetentionDays}d`);
} catch (_) { /* ignore */ }
}
function startLoopLagMonitor() {
if (histogram) return; // idempotent
histogram = monitorEventLoopDelay({ resolution: config.lagResolutionMs });
histogram.enable();
const t1 = setInterval(sample, config.lagSampleIntervalMs);
pruneLag(); // sweep stale rows on boot
const t2 = setInterval(pruneLag, config.lagPruneIntervalMs);
// Don't keep the process alive on these timers (matters for tests / clean exit).
if (t1.unref) t1.unref();
if (t2.unref) t2.unref();
}
function getBand() { return band; }
function getLag() { return { ...current }; }
module.exports = { startLoopLagMonitor, getBand, getLag, nextBand };

View file

@ -259,6 +259,32 @@ test('device WS: wrong device_token is rejected (auth-error, never registered)',
assert.ok(!got.registered, 'wrong token must not register');
});
// #139 Phase 2 (Option B): event-driven OTA status. Registers (which, with no ota fields in
// device_info, persists ota_status='none' via the backstop), then emits a valid ota-status and
// a foreign-id one in order on the authenticated socket.
function deviceOtaSeq(payload, otaEvents, timeoutMs = 4000) {
return new Promise((resolve) => {
const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
const finish = () => { try { sock.close(); } catch { /* */ } resolve(); };
sock.on('connect', () => sock.emit('device:register', payload));
sock.on('device:registered', () => { for (const e of otaEvents) sock.emit('device:ota-status', e); setTimeout(finish, 500); });
sock.on('device:auth-error', finish);
setTimeout(finish, timeoutMs);
});
}
test('device WS: device:ota-status persists the fields; a foreign device_id is a safe no-op (#139)', async () => {
await deviceOtaSeq(
{ device_id: S.deviceId, device_token: S.deviceToken, device_info: { app_version: 'test' } },
[
{ device_id: S.deviceId, ota_status: 'manual_update_required', ota_target_version: '1.9.1-beta6', ota_attempts: 3 },
{ device_id: 'nope-not-a-device', ota_status: 'none', ota_target_version: null, ota_attempts: 0 }, // foreign id -> no-op, no throw
]);
const dev = await jfetch(`/api/devices/${S.deviceId}`, auth(S.jwt));
assert.equal(dev.body.ota_status, 'manual_update_required', 'valid ota-status persisted');
assert.equal(dev.body.ota_target_version, '1.9.1-beta6');
assert.equal(dev.body.ota_attempts, 3, 'and the foreign-id event did not overwrite it');
});
// ───────────────────────── TIER 4: #92 FOLLOW-UP COVERAGE ─────────────────────────
// The non-security gaps named in the self-review (issue #92): the gap-fix fields + the
// cross-tenant guard (the security-relevant one), docs serving, and the token lifecycle

View file

@ -0,0 +1,85 @@
'use strict';
// #142 step 5 — content-ack dedup. Repeated identical (device_id, content_id, status)
// reports are suppressed within config.contentAckDedupMs; a status change or a report
// after the window passes. Observed via the server log (the handler logs+emits only
// when it does NOT dedup). Unique PORT (3984) to avoid the collision class.
const { test, before, after } = require('node:test');
const assert = require('node:assert/strict');
const { spawn } = require('node:child_process');
const path = require('node:path');
const os = require('node:os');
const fs = require('node:fs');
const crypto = require('node:crypto');
const ioClient = require('socket.io-client');
const PORT = 3984;
const BASE = `http://127.0.0.1:${PORT}`;
const DATA_DIR = path.join(os.tmpdir(), 'st-ack-' + crypto.randomBytes(4).toString('hex'));
const LOG = path.join(os.tmpdir(), 'st-ack-' + crypto.randomBytes(4).toString('hex') + '.log');
const DEDUP_MS = 600;
let proc;
const sleep = (ms) => new Promise(r => setTimeout(r, ms));
before(async () => {
const logFd = fs.openSync(LOG, 'w');
proc = spawn('node', ['server.js'], {
cwd: path.join(__dirname, '..'),
env: { ...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test', CONTENT_ACK_DEDUP_MS: String(DEDUP_MS) },
stdio: ['ignore', logFd, logFd],
});
let up = false;
for (let i = 0; i < 80; i++) {
try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* */ }
await sleep(250);
}
if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
});
after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
function provision() {
const code = String(crypto.randomInt(100000, 1000000));
return new Promise((resolve) => {
const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
sock.on('connect', () => sock.emit('device:register', { pairing_code: code }));
sock.on('device:registered', (d) => { try { sock.close(); } catch { /* */ } resolve({ id: d.device_id, token: d.device_token }); });
setTimeout(() => resolve(null), 4000);
});
}
function openRegistered(dev) {
return new Promise((resolve, reject) => {
const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
sock.on('connect', () => sock.emit('device:register', { device_id: dev.id, device_token: dev.token, device_info: { app_version: 'test' } }));
sock.on('device:registered', () => resolve(sock));
sock.on('device:auth-error', () => reject(new Error('auth-error')));
setTimeout(() => reject(new Error('register timeout')), 4000);
});
}
test('repeated identical content-acks are deduped; window-expiry and status-change pass', async () => {
const dev = await provision();
assert.ok(dev, 'device provisioned');
const sock = await openRegistered(dev);
const cid = 'cid-' + crypto.randomBytes(3).toString('hex');
// 5 rapid identical "ready" within the dedup window -> only ONE should log/emit
for (let i = 0; i < 5; i++) { sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'ready' }); await sleep(40); }
// wait past the window, then "ready" again -> passes (a fresh report)
await sleep(DEDUP_MS + 250);
sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'ready' });
// a status CHANGE has a different key -> passes immediately
await sleep(60);
sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'error' });
await sleep(400);
try { sock.close(); } catch { /* */ }
const log = fs.readFileSync(LOG, 'utf8');
const ready = (log.match(new RegExp(`content ${cid}: ready`, 'g')) || []).length;
const err = (log.match(new RegExp(`content ${cid}: error`, 'g')) || []).length;
assert.equal(ready, 2, 'a burst of identical "ready" collapses to one; a second after the window passes -> 2 total');
assert.equal(err, 1, 'a status change is not deduped');
});

View file

@ -0,0 +1,64 @@
'use strict';
// #142 step 2 — integration: the lag monitor samples, persists to a BOUNDED table,
// and surfaces current lag on /api/status. Boots the real server with fast sampling
// and a tiny (fractional-day) retention so the prune is observable within the test.
const { test, before, after } = require('node:test');
const assert = require('node:assert/strict');
const { spawn } = require('node:child_process');
const path = require('node:path');
const os = require('node:os');
const fs = require('node:fs');
const crypto = require('node:crypto');
const Database = require('better-sqlite3');
const PORT = 3982;
const BASE = `http://127.0.0.1:${PORT}`;
const DATA_DIR = path.join(os.tmpdir(), 'st-lag-int-' + crypto.randomBytes(4).toString('hex'));
const LOG = path.join(os.tmpdir(), 'st-lag-int-' + crypto.randomBytes(4).toString('hex') + '.log');
let proc;
before(async () => {
const logFd = fs.openSync(LOG, 'w');
proc = spawn('node', ['server.js'], {
cwd: path.join(__dirname, '..'),
env: {
...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test',
LAG_SAMPLE_INTERVAL_MS: '200', // sample fast
LAG_TELEMETRY_RETENTION_DAYS: '0.00001', // ~0.86s retention
LAG_PRUNE_INTERVAL_MS: '400', // prune often
},
stdio: ['ignore', logFd, logFd],
});
let up = false;
for (let i = 0; i < 80; i++) {
try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* not yet */ }
await new Promise(r => setTimeout(r, 250));
}
if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
});
after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
test('/api/status exposes a current loop_lag snapshot', async () => {
const r = await fetch(BASE + '/api/status');
const body = await r.json();
assert.ok(body.loop_lag, 'loop_lag present on /api/status');
assert.ok(['normal', 'elevated', 'critical'].includes(body.loop_lag.band), 'band is a valid level');
assert.equal(typeof body.loop_lag.p99_ms, 'number', 'p99_ms is numeric');
assert.equal(typeof body.loop_lag.mean_ms, 'number', 'mean_ms is numeric');
});
test('lag samples are persisted AND bounded by retention prune (not unbounded)', async () => {
// Let it sample for ~3s. At 200ms/sample that is ~15 inserts, but with ~0.86s
// retention pruned every 400ms the table must stay small — proving the table
// can never become a second unbounded-growth table.
await new Promise(r => setTimeout(r, 1800));
const dbPath = path.join(DATA_DIR, 'db', 'remote_display.db');
const db = new Database(dbPath, { readonly: true });
const count = db.prepare('SELECT COUNT(*) c FROM event_loop_lag').get().c;
db.close();
assert.ok(count >= 1, 'lag samples are being persisted');
assert.ok(count < 15, `table is bounded by the prune (held ${count} rows over ~3s of 200ms sampling)`);
});

View file

@ -0,0 +1,57 @@
'use strict';
// #142 step 2 — deterministic unit tests for the event-loop-lag band transitions.
// Pure function, no sockets/timing. Isolate the DB to a temp dir BEFORE requiring
// the module (requiring it pulls in db/database, which initialises a DB on load).
const os = require('node:os');
const path = require('node:path');
const crypto = require('node:crypto');
process.env.DATA_DIR = path.join(os.tmpdir(), 'st-lag-unit-' + crypto.randomBytes(4).toString('hex'));
const { test } = require('node:test');
const assert = require('node:assert/strict');
const { nextBand } = require('../services/loop-lag');
// config defaults exercised here: elevated=100ms, critical=250ms, releaseSamples=5,
// deadband=0.5 -> release-below thresholds: elevated@50ms, critical@125ms.
test('UP is immediate and can skip a level (tighten fast)', () => {
assert.deepEqual(nextBand('normal', 50, 0), ['normal', 0], 'below elevated stays normal');
assert.deepEqual(nextBand('normal', 100, 0), ['elevated', 0], 'crossing elevated up-threshold jumps immediately');
assert.deepEqual(nextBand('normal', 250, 0), ['critical', 0], 'a big spike jumps normal->critical in one sample');
assert.deepEqual(nextBand('elevated', 250, 0), ['critical', 0]);
});
test('deadband holds the band for small fluctuations (no flap)', () => {
// elevated, p99 between release(50) and up(100) -> hold elevated, calm reset
assert.deepEqual(nextBand('elevated', 80, 3), ['elevated', 0]);
// critical, p99 between release(125) and up(250) -> hold critical
assert.deepEqual(nextBand('critical', 200, 4), ['critical', 0]);
});
test('DOWN is slow: requires lagReleaseSamples calm samples below the deadband', () => {
// elevated -> normal only after 5 consecutive calm samples
let band = 'elevated', calm = 0;
for (let i = 0; i < 4; i++) {
[band, calm] = nextBand(band, 20, calm);
assert.equal(band, 'elevated', `still elevated after ${i + 1} calm sample(s)`);
}
[band, calm] = nextBand(band, 20, calm); // 5th
assert.deepEqual([band, calm], ['normal', 0], 'drops to normal on the 5th calm sample');
});
test('DOWN releases one level at a time: critical -> elevated -> normal', () => {
let band = 'critical', calm = 0;
for (let i = 0; i < 5; i++) [band, calm] = nextBand(band, 10, calm);
assert.equal(band, 'elevated', 'critical releases to elevated, never straight to normal');
for (let i = 0; i < 5; i++) [band, calm] = nextBand(band, 10, calm);
assert.equal(band, 'normal', 'then elevated releases to normal');
});
test('a single calm sample does not release (calm counter resets on a non-calm sample)', () => {
let [band, calm] = nextBand('elevated', 20, 0); // calm=1
assert.deepEqual([band, calm], ['elevated', 1]);
[band, calm] = nextBand(band, 80, calm); // back inside deadband -> reset
assert.deepEqual([band, calm], ['elevated', 0], 'one blip resets the release counter');
});

View file

@ -91,6 +91,24 @@ test('muted reaches the device via the published snapshot (buildSnapshotItems)',
assert.equal(item.muted, 1, 'snapshot (device payload) carries muted=1');
});
test('mute toggle patches the published snapshot WITHOUT a manual republish (the beta7 bug)', async () => {
// Baseline: publish once so the device has a snapshot carrying muted=0.
await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: false }));
await jfetch(`/api/playlists/${S.playlistId}/publish`, post(S.jwt, {}));
const read = () => JSON.parse(db.prepare('SELECT published_snapshot FROM playlists WHERE id = ?').get(S.playlistId).published_snapshot)
.find((i) => i.content_id === S.contentId).muted;
assert.equal(read(), 0, 'baseline: snapshot the device plays carries muted=0');
// The actual bug: a mute toggle ALONE (no /publish) must reach the played snapshot.
// On beta7 this stayed 0 (markDraft only) so every loop re-applied full volume.
await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: true }));
assert.equal(read(), 1, 'mute toggle patched the snapshot the device plays — no manual republish needed');
// Unmute toggle reverts the snapshot too.
await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: false }));
assert.equal(read(), 0, 'unmute toggle patched the snapshot back to 0');
});
test('PUT ignoring muted (other field) leaves muted untouched', async () => {
await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: true }));
const r = await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { duration_sec: 15 }));

View file

@ -0,0 +1,41 @@
'use strict';
// #142 (cut 2) — provisioning-row cleanup window correctness. The sweep deletes
// UNCLAIMED provisioning devices older than 24h (it previously used 365*86400 — a
// year — contradicting its own comment). Imported devices (user_id set) and
// non-provisioning devices are preserved. Deterministic, in-process (no server).
const os = require('node:os');
const path = require('node:path');
const crypto = require('node:crypto');
process.env.DATA_DIR = path.join(os.tmpdir(), 'st-provclean-' + crypto.randomBytes(4).toString('hex'));
const { test } = require('node:test');
const assert = require('node:assert/strict');
const { db } = require('../db/database');
const { pruneProvisioningDevices } = require('../services/heartbeat');
test('sweeps unclaimed provisioning devices older than 24h, keeps the rest', () => {
db.pragma('foreign_keys = OFF'); // seed user_id without a real users row
db.exec('DELETE FROM devices');
const ins = db.prepare("INSERT INTO devices (id, status, user_id, created_at) VALUES (?, ?, ?, strftime('%s','now') - ?)");
ins.run('old-unclaimed', 'provisioning', null, 25 * 3600); // >24h, unclaimed -> SWEPT
ins.run('new-unclaimed', 'provisioning', null, 1 * 3600); // <24h, unclaimed -> kept
ins.run('old-imported', 'provisioning', 'u-imported', 25 * 3600); // >24h but imported (user_id) -> kept
ins.run('old-online', 'online', null, 25 * 3600); // >24h but not provisioning -> kept
db.pragma('foreign_keys = ON');
assert.equal(db.prepare('SELECT COUNT(*) c FROM devices').get().c, 4, 'seeded 4');
const deleted = pruneProvisioningDevices();
assert.equal(deleted, 1, 'only the >24h unclaimed provisioning device is swept');
const ids = db.prepare('SELECT id FROM devices ORDER BY id').all().map(r => r.id);
assert.deepEqual(ids, ['new-unclaimed', 'old-imported', 'old-online']);
// regression guard: a 25h-old row sits well inside the OLD 365-day window, so this
// would have survived before the fix.
});
test('idempotent: a second sweep with nothing stale deletes nothing', () => {
assert.equal(pruneProvisioningDevices(), 0);
});

View file

@ -0,0 +1,113 @@
'use strict';
// #142 step 3 — REQUIRED GATE TEST + storm + neighbor, over real sockets.
//
// Boots the real server with warm-up ACTIVE (default) so the whole suite runs in
// the cold-start window — the exact "right after a deploy" scenario. Hard ceiling
// and window are tightened so the storm trips quickly without thousands of connects;
// fleet devices stay well under the ceiling.
const { test, before, after } = require('node:test');
const assert = require('node:assert/strict');
const { spawn } = require('node:child_process');
const path = require('node:path');
const os = require('node:os');
const fs = require('node:fs');
const crypto = require('node:crypto');
const ioClient = require('socket.io-client');
const PORT = 3983;
const BASE = `http://127.0.0.1:${PORT}`;
const DATA_DIR = path.join(os.tmpdir(), 'st-thr-int-' + crypto.randomBytes(4).toString('hex'));
const LOG = path.join(os.tmpdir(), 'st-thr-int-' + crypto.randomBytes(4).toString('hex') + '.log');
let proc;
before(async () => {
const logFd = fs.openSync(LOG, 'w');
proc = spawn('node', ['server.js'], {
cwd: path.join(__dirname, '..'),
env: {
...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test',
// warm-up left at default (30s) so the whole test runs in the cold-start window
RECONNECT_HARD_CEILING: '8',
RECONNECT_WINDOW_MS: '5000',
RECONNECT_BASE_MAX: '3',
},
stdio: ['ignore', logFd, logFd],
});
let up = false;
for (let i = 0; i < 80; i++) {
try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* */ }
await new Promise(r => setTimeout(r, 250));
}
if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
});
after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
// Provision a brand-new device via a UNIQUE pairing code -> returns {device_id, device_token}.
function provision() {
const code = String(crypto.randomInt(100000, 1000000));
return new Promise((resolve) => {
const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
sock.on('connect', () => sock.emit('device:register', { pairing_code: code }));
sock.on('device:registered', (d) => { try { sock.close(); } catch { /* */ } resolve({ id: d.device_id, token: d.device_token }); });
setTimeout(() => { try { sock.close(); } catch { /* */ } resolve(null); }, 4000);
});
}
// One genuine reconnect (new socket). Resolves {registered, throttled}.
function reconnect(dev) {
return new Promise((resolve) => {
const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
let done = false;
const finish = (r) => { if (done) return; done = true; try { sock.close(); } catch { /* */ } resolve(r); };
sock.on('connect', () => sock.emit('device:register', { device_id: dev.id, device_token: dev.token, device_info: { app_version: 'test' } }));
sock.on('device:registered', () => finish({ registered: true, throttled: false }));
sock.on('device:throttled', () => finish({ registered: false, throttled: true }));
setTimeout(() => finish({ registered: false, throttled: false }), 1500);
});
}
test('GATE: full-fleet reconnect right after restart throttles NO healthy device', async () => {
// 12 distinct devices, each reconnecting twice in quick succession — a deploy-time
// herd. The loop is transiently busy, but per-device keying means none is flagged.
const fleet = [];
for (let i = 0; i < 12; i++) { const d = await provision(); assert.ok(d, 'device provisioned'); fleet.push(d); }
let registered = 0, throttled = 0;
// two reconnect rounds across the whole fleet
for (let round = 0; round < 2; round++) {
const results = await Promise.all(fleet.map(reconnect));
for (const r of results) { if (r.registered) registered++; if (r.throttled) throttled++; }
}
assert.equal(throttled, 0, 'NO healthy fleet device may be throttled at cold start');
assert.equal(registered, 24, 'every fleet reconnect registered');
});
test('a single device storming IS throttled (backoff engages)', async () => {
const dev = await provision();
assert.ok(dev);
let registered = 0, throttled = 0;
// 12 sequential reconnects within the 5s window -> exceeds the hard ceiling (8)
for (let i = 0; i < 12; i++) {
const r = await reconnect(dev);
if (r.registered) registered++;
if (r.throttled) throttled++;
}
assert.ok(throttled >= 1, `storming device must be throttled (got ${throttled} throttle(s))`);
assert.ok(registered < 12, `not all storm reconnects should succeed (got ${registered}/12)`);
});
test('neighbor isolation: a healthy device is unaffected while another storms', async () => {
const stormer = await provision();
const neighbor = await provision();
assert.ok(stormer && neighbor);
// storm the stormer hard
for (let i = 0; i < 12; i++) await reconnect(stormer);
// neighbor reconnects normally a couple of times -> must still register
const a = await reconnect(neighbor);
const b = await reconnect(neighbor);
assert.ok(a.registered && b.registered, 'neighbor must register normally while another device storms');
assert.ok(!a.throttled && !b.throttled, 'neighbor must not be throttled by another device');
});

View file

@ -0,0 +1,98 @@
'use strict';
// #142 step 3 — deterministic unit tests for the per-device reconnect throttle.
// Pure logic with injected `now` / band; isolate the DB before require (the module
// pulls in services/loop-lag -> db/database which initialises a DB on load).
const os = require('node:os');
const path = require('node:path');
const crypto = require('node:crypto');
process.env.DATA_DIR = path.join(os.tmpdir(), 'st-thr-unit-' + crypto.randomBytes(4).toString('hex'));
const { test, beforeEach } = require('node:test');
const assert = require('node:assert/strict');
const throttle = require('../lib/reconnect-throttle');
// config defaults: window=10000, baseMax=5, hardCeiling=20, baseBackoff=1000,
// maxBackoff=60000, releaseMs=30000, warmup=30000, elevMult=2, critMult=4.
const T0 = 1_000_000; // arbitrary epoch-ms origin for the warm-up clock
const POST = T0 + 40_000; // safely past the 30s warm-up
const WARM = T0 + 1_000; // inside the warm-up window
beforeEach(() => throttle.__resetForTest({ startedAt: T0 }));
test('healthy device is never throttled (<= baseMax genuine reconnects)', () => {
for (let i = 0; i < 5; i++) {
const v = throttle.check('A', POST + i, 'normal');
assert.ok(v.allow, `reconnect ${i + 1} (<=baseMax) must be allowed`);
}
});
test('a per-device storm IS throttled and the backoff GROWS (tighten fast)', () => {
let v;
for (let i = 0; i < 5; i++) v = throttle.check('B', POST + i, 'normal'); // 5 allowed
v = throttle.check('B', POST + 5, 'normal'); // 6th -> flagged
assert.equal(v.allow, false);
assert.equal(v.reason, 'rate');
assert.equal(v.observed, 6);
assert.equal(v.allowed, 5);
const b1 = v.retryAfterMs;
// keep hammering while blocked -> escalate, longer backoff each time
const b2 = throttle.check('B', POST + 6, 'normal').retryAfterMs;
const b3 = throttle.check('B', POST + 7, 'normal').retryAfterMs;
assert.ok(b2 > b1 && b3 > b2, `backoff must grow: ${b1} < ${b2} < ${b3}`);
});
test('lag band multiplies an already-flagged device\'s backoff (critical > normal)', () => {
let v;
for (let i = 0; i < 5; i++) throttle.check('N', POST + i, 'normal');
v = throttle.check('N', POST + 5, 'normal');
const normalBackoff = v.retryAfterMs;
throttle.__resetForTest({ startedAt: T0 });
for (let i = 0; i < 5; i++) throttle.check('C', POST + i, 'critical');
v = throttle.check('C', POST + 5, 'critical');
assert.ok(v.retryAfterMs > normalBackoff, `critical backoff ${v.retryAfterMs} > normal ${normalBackoff}`);
});
test('a healthy device is NOT throttled even when the band is critical (lag never gates the healthy)', () => {
for (let i = 0; i < 5; i++) {
const v = throttle.check('H', POST + i, 'critical');
assert.ok(v.allow, 'healthy device stays allowed regardless of band');
}
});
test('COLD START: during warm-up, moderate flapping (>baseMax, <ceiling) is NOT throttled', () => {
for (let i = 0; i < 12; i++) { // 12 > baseMax(5) but < hardCeiling(20)
const v = throttle.check('W', WARM + i, 'critical'); // band forced normal in warm-up anyway
assert.ok(v.allow, `warm-up reconnect ${i + 1} must be lenient`);
}
});
test('HARD CEILING is enforced even during warm-up (slow-ramp cannot train through)', () => {
let v;
for (let i = 0; i < 20; i++) {
v = throttle.check('K', WARM + i, 'normal');
assert.ok(v.allow, `warm-up reconnect ${i + 1} (<=ceiling) allowed`);
}
v = throttle.check('K', WARM + 20, 'normal'); // 21st -> over ceiling(20)
assert.equal(v.allow, false);
assert.equal(v.reason, 'hard-ceiling');
});
test('neighbor isolation: one device storming does not throttle another', () => {
for (let i = 0; i < 10; i++) throttle.check('STORM', POST + i, 'normal'); // STORM gets throttled
const v = throttle.check('NEIGHBOR', POST + 11, 'normal');
assert.ok(v.allow, 'a different device must be unaffected');
});
test('release slow: escalation level decays after a calm period', () => {
let v;
for (let i = 0; i < 6; i++) v = throttle.check('R', POST + i, 'normal'); // flagged, level 1
assert.ok(v.level >= 1);
const peak = v.level;
// a calm reconnect well past the window AND past releaseMs(30000)
v = throttle.check('R', POST + 6 + 40_000, 'normal');
assert.ok(v.allow, 'calm reconnect after the storm is allowed');
assert.ok(v.level < peak, `level decays after calm: ${v.level} < ${peak}`);
});

View file

@ -0,0 +1,48 @@
'use strict';
// #142 step 4 — global device_status_log retention sweep. Deterministic, in-process
// (no server/port). Isolate the DB and set retention BEFORE requiring the module
// (config reads env at load; database.js initialises a DB on load).
const os = require('node:os');
const path = require('node:path');
const crypto = require('node:crypto');
process.env.DATA_DIR = path.join(os.tmpdir(), 'st-statusprune-' + crypto.randomBytes(4).toString('hex'));
process.env.STATUS_LOG_RETENTION_DAYS = '2';
const { test } = require('node:test');
const assert = require('node:assert/strict');
const { db, pruneStatusLog } = require('../db/database');
test('global sweep deletes rows older than retention across ALL devices, keeps recent', () => {
db.exec('DELETE FROM device_status_log'); // clean slate
const old = db.prepare("INSERT INTO device_status_log (device_id, status, timestamp) VALUES (?, ?, strftime('%s','now') - ?)");
// 5 days old (> 2d retention): an active device, a device NOT in the devices
// table (removed/idle — what the per-device insert-time prune never revisits),
// and the heartbeat offline_timeout status that bypasses logDeviceStatus.
old.run('live-dev', 'online', 5 * 86400);
old.run('removed-idle-dev', 'offline', 5 * 86400);
old.run('hb-dev', 'offline_timeout', 5 * 86400);
// recent (< retention): must survive, regardless of device existence / status.
old.run('live-dev', 'online', 0);
old.run('hb-dev', 'offline_timeout', 3600);
assert.equal(db.prepare('SELECT COUNT(*) c FROM device_status_log').get().c, 5, 'seeded 5 rows');
const deleted = pruneStatusLog();
assert.equal(deleted, 3, 'the 3 over-retention rows pruned (incl. removed-idle + offline_timeout paths)');
const remaining = db.prepare('SELECT device_id, status FROM device_status_log ORDER BY device_id').all();
assert.equal(remaining.length, 2);
// both survivors are the recent rows; no old row of any device/status survived
assert.deepEqual(remaining.map(r => r.device_id).sort(), ['hb-dev', 'live-dev']);
const oldestNow = db.prepare("SELECT MIN(timestamp) m FROM device_status_log").get().m;
const cutoff = Math.floor(Date.now() / 1000) - 2 * 86400;
assert.ok(oldestNow >= cutoff, 'no surviving row is older than the retention cutoff');
});
test('sweep is safe and idempotent on an empty/already-clean table', () => {
db.exec('DELETE FROM device_status_log');
assert.equal(pruneStatusLog(), 0, 'nothing to delete -> 0, no throw');
});

View file

@ -6,6 +6,7 @@ const { db, pruneTelemetry, pruneScreenshots } = require('../db/database');
const config = require('../config');
const heartbeat = require('../services/heartbeat');
const commandQueue = require('../lib/command-queue');
const reconnectThrottle = require('../lib/reconnect-throttle');
// Debounce window for marking a device offline on socket disconnect. Brief
// flap (Wi-Fi blip, Engine.IO ping miss, server-side eviction-then-reconnect)
@ -27,6 +28,12 @@ const OFFLINE_DEBOUNCE_MS = 5000;
// event is still forwarded every time, so the UI is unaffected. In-memory only.
const lastPlayLogAt = new Map();
const PLAY_LOG_MIN_GAP_MS = 2000;
// #142 content-ack dedup. An older app can spam "content <id>: ready" for the same
// item; each was logged + emitted individually (secondary load). Suppress identical
// (device_id, content_id, status) reports within config.contentAckDedupMs. A status
// CHANGE has a different key and passes immediately. In-memory; resets on restart.
const lastContentAck = new Map();
const { getUserPlan, getUserDeviceCount } = require('../middleware/subscription');
// Phase 2.3: deviceRoom() resolves a device_id to its workspace room so
// dashboardNs.emit can be scoped instead of broadcast platform-wide.
@ -353,6 +360,23 @@ module.exports = function setupDeviceSocket(io) {
return;
}
// #142: per-device reconnect throttle. Only GENUINE reconnects (a new
// socket) count — same-socket playlist refreshes (isPlaylistRefresh) are
// exempt. This runs BEFORE the heavy register work (DB writes, playlist
// build) so a single flapping device cannot saturate the event loop. The
// verdict is per-device; global lag only scales an already-flagged
// device's backoff, never gates a healthy one.
if (!isPlaylistRefresh) {
const verdict = reconnectThrottle.check(device_id);
if (!verdict.allow) {
console.warn(`[throttle] device ${device_id} reconnect throttled: reason=${verdict.reason} band=${verdict.band} observed=${verdict.observed}/${verdict.allowed} per ${config.reconnectWindowMs}ms -> backoff ${verdict.retryAfterMs}ms (level ${verdict.level})`);
socket.emit('device:throttled', { retry_after_ms: verdict.retryAfterMs, reason: 'reconnect_rate' });
// nextTick disconnect so the throttle notice flushes first.
process.nextTick(() => { try { socket.disconnect(true); } catch (_) { /* */ } });
return;
}
}
currentDeviceId = device_id;
authenticated = true;
// Cancel any pending offline timer - device is back in the grace window
@ -372,8 +396,12 @@ module.exports = function setupDeviceSocket(io) {
}
if (device_info) {
db.prepare('UPDATE devices SET android_version = ?, app_version = ?, screen_width = ?, screen_height = ?, render_width = ?, render_height = ? WHERE id = ?')
.run(device_info.android_version, device_info.app_version, device_info.screen_width, device_info.screen_height, device_info.render_width ?? null, device_info.render_height ?? null, device_id);
db.prepare(`UPDATE devices SET android_version = ?, app_version = ?, screen_width = ?, screen_height = ?, render_width = ?, render_height = ?,
ota_status = ?, ota_target_version = ?, ota_attempts = ?, ota_updated_at = strftime('%s','now') WHERE id = ?`)
.run(device_info.android_version, device_info.app_version, device_info.screen_width, device_info.screen_height, device_info.render_width ?? null, device_info.render_height ?? null,
// #139 Phase 2: older APKs don't send these — default to a clean 'none' state.
device_info.ota_status ?? 'none', device_info.ota_target_version ?? null, device_info.ota_attempts ?? 0,
device_id);
}
heartbeat.registerConnection(device_id, socket.id);
@ -557,6 +585,13 @@ module.exports = function setupDeviceSocket(io) {
if (!requireDeviceAuth()) return;
const { device_id, content_id, status } = data;
if (device_id !== currentDeviceId) return;
// #142: drop repeats of the same (device, content, status) within the dedup
// window. Only a change (new content/status) or a report after the window
// logs+emits, so a device spamming the same "ready" can't add load.
const ackKey = `${device_id}|${content_id}|${status}`;
const nowAck = Date.now();
if (nowAck - (lastContentAck.get(ackKey) || 0) < config.contentAckDedupMs) return;
lastContentAck.set(ackKey, nowAck);
console.log(`Device ${device_id} content ${content_id}: ${status}`);
emitToDeviceWorkspace(dashboardNs, device_id, 'dashboard:content-ack', { device_id, content_id, status });
});
@ -585,6 +620,20 @@ module.exports = function setupDeviceSocket(io) {
});
});
// #139 Phase 2 (Option B): event-driven OTA status. The device announces a status TRANSITION
// ('manual_update_required' on enter-backoff, 'none' on clear) so the dashboard badge updates
// promptly without waiting for a reconnect. The register path still persists these fields too
// (the reconnect backstop if a transition event is missed). Same columns + ?? defaults.
socket.on('device:ota-status', (data) => {
if (!requireDeviceAuth()) return;
const { device_id, ota_status, ota_target_version, ota_attempts } = data || {};
// Unknown / forged / mismatched id -> no-op. WHERE id = ? also makes an unregistered id a
// 0-row update (never throws), so a stray event can't error the socket.
if (!device_id || device_id !== currentDeviceId) return;
db.prepare("UPDATE devices SET ota_status = ?, ota_target_version = ?, ota_attempts = ?, ota_updated_at = strftime('%s','now') WHERE id = ?")
.run(ota_status ?? 'none', ota_target_version ?? null, ota_attempts ?? 0, device_id);
});
// Play event logging (proof-of-play)
socket.on('device:play-event', (data) => {
if (!requireDeviceAuth()) return;

View file

@ -1,6 +1,6 @@
<?xml version="1.0" encoding="UTF-8"?>
<widget xmlns="http://www.w3.org/ns/widgets" xmlns:tizen="http://tizen.org/ns/widgets"
id="http://screentinker.com/player" version="1.9.1" viewmodes="maximized">
id="http://screentinker.com/player" version="1.9.2" viewmodes="maximized">
<tizen:application id="ScrnTinkr1.ScreenTinker" package="ScrnTinkr1" required_version="2.4"/>
<tizen:profile name="tv"/>
<name>ScreenTinker</name>