chore(release): v1.9.2-beta1

docs(#142 ): 1.9.2-beta1 changelog + device_status_log VACUUM maintenance note
Documents the #142 changes and tells operators with an already-bloated device_status_log to reclaim space with a one-time manual VACUUM in a maintenance window (retention now bounds further growth). Explains why auto-VACUUM is not enabled. New doc: docs/maintenance-device-status-log.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 09:23:16 -06:00 · 2026-06-27 19:59:34 -05:00 · 2026-06-27 19:59:17 -05:00 · 2026-06-27 19:56:46 -05:00 · 2026-06-27 19:56:32 -05:00 · 2026-06-27 19:50:09 -05:00
40 changed files with 2078 additions and 57 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,36 @@
 # Changelog

+## 1.9.2-beta1 — unreleased
+
+### Fixed — server resilience (#142)
+- **A single flapping device can no longer saturate the event loop.** A new
+  load-aware, per-device reconnect throttle (`lib/reconnect-throttle.js`) gates
+  genuine reconnects *before* the heavy register work (DB writes + playlist build).
+  The verdict is per-device; global event-loop lag only multiplies an
+  already-flagged device's backoff and never throttles a healthy one. Hard ceiling
+  + cold-start warm-up so a full-fleet reconnect after a deploy is never throttled.
+- **`device_status_log` growth is bounded.** Added
+  `idx_device_status_log_device_ts`, a global retention sweep (`pruneStatusLog`,
+  `STATUS_LOG_RETENTION_DAYS` default 3) covering removed/idle devices and the
+  `offline_timeout` path, and de-duplicated the table's `CREATE TABLE`.
+- **`content-ack` spam de-duplicated.** Repeated identical
+  `(device_id, content_id, status)` reports are suppressed within
+  `CONTENT_ACK_DEDUP_MS` (default 10s).
+- **Provisioning cleanup window corrected.** Unclaimed provisioning devices are now
+  swept after 24h (the code used `365 * 86400` — a year — contradicting its own
+  comment).
+
+### Added — observability (#142)
+- **Event-loop lag telemetry** via `perf_hooks.monitorEventLoopDelay()`. Sampled to
+  a bounded `event_loop_lag` table (indexed + pruned, `LAG_TELEMETRY_RETENTION_DAYS`)
+  and surfaced on `/api/status` as `loop_lag` (mean/p50/p99/max + band).
+
+### Maintenance
+- Operators whose `device_status_log` is already bloated from a pre-1.9.2 deployment
+  should reclaim disk with a **one-time manual `VACUUM`** in a maintenance window;
+  retention now bounds further growth. Auto-VACUUM is intentionally not enabled.
+  See [`docs/maintenance-device-status-log.md`](docs/maintenance-device-status-log.md).
+
 ## 1.9.1-beta3 — unreleased

 ### Fixed — Tizen player
--- a/README.md
+++ b/README.md
@ -426,6 +426,7 @@ keytool -genkey -v -keystore android/release-key.jks -keyalg RSA -keysize 2048 -
 3. Install the ScreenTinker app on your device:
   - **Android TV / tablets**: Download the APK from your instance (`/download/apk`) or build it from source (see above)
   - **Raspberry Pi**: `curl -sSL https://your-instance/scripts/raspberry-pi-setup.sh | bash`
+   - **Debian 13 (headless)**: `curl -sSL https://your-instance/scripts/debian-13-setup.sh | sudo bash`
   - **Windows**: Run the setup script from `scripts/windows-setup.bat`
   - **Samsung Tizen TV / signage**: point the TV's URL Launcher (or browser) at `https://your-instance/player` - no signing needed. For an installed native app, see [tizen/README.md](tizen/README.md)
   - **Any browser**: Open `https://your-instance/player` in kiosk/fullscreen mode
--- a/SECURITY.md
+++ b/SECURITY.md
@ -95,3 +95,28 @@ by name in release notes and (when applicable) in the GitHub advisory
 itself. Let me know in your report whether you'd like credit and how
 you'd like to be named. Anonymous reports are also welcome — no credit
 is required.
+
+## Uploaded content access model
+
+Uploaded content (images, videos) served under /uploads/content is
+**public by unguessable URL**, not access-controlled:
+
+- Filenames are UUIDv4 (122 bits of randomness), so URLs are not enumerable
+  or guessable.
+- There is no per-request authentication on content bytes, and CORS is open
+  (Access-Control-Allow-Origin: *) because the web player's canvas-based
+  screenshot capture requires cross-origin access.
+- Anyone who obtains a content URL can read that file, cross-tenant, with no
+  expiry (immutable 30-day cache) and no revocation short of deleting the file.
+
+This is an intentional design choice for digital signage, where content is
+destined for public display. It is **security-through-unguessability, not
+access control.**
+
+**Do not upload content you require to remain confidential** - including
+material that is destined for a screen but not yet public (e.g. a scheduled
+promotion before its reveal, or an internal board containing names or other
+sensitive details). Such content is world-readable from the moment of upload.
+If pre-launch or tenant-private confidentiality is a requirement for your
+deployment, open an issue - signed/expiring URLs are tracked but not yet
+implemented.
--- a/2
+++ b/2
@ -1 +1 @@
-1.9.1-beta6
+1.9.2-beta1
--- a/android/app/build.gradle.kts
+++ b/android/app/build.gradle.kts
@ -9,10 +9,10 @@ android {

    defaultConfig {
        applicationId = "com.remotedisplay.player"
-        minSdk = 26
+        minSdk = 24
        targetSdk = 34
-        versionCode = 26
-        versionName = "1.9.1-beta6"
+        versionCode = 31
+        versionName = "1.9.2-beta1"
    }

    signingConfigs {
--- a/android/app/src/main/java/com/remotedisplay/player/MainActivity.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/MainActivity.kt
@ -240,6 +240,12 @@ class MainActivity : AppCompatActivity() {

        // Start auto-update checker
        updateChecker = UpdateChecker(this)
+        // #139: surface OTA status (applying / backing off / manual-update-required) to the
+        // dashboard. wsService is read lazily — it binds after this runs.
+        updateChecker.otaLogReporter = { level, msg -> wsService?.sendLog("ota", level, msg) }
+        // #139 Phase 2 (Option B): announce OTA status transitions (clear / enter-backoff) so the
+        // dashboard badge clears/lights up promptly without waiting for a reconnect.
+        updateChecker.otaStatusReporter = { wsService?.sendOtaStatus() }
        updateChecker.startPeriodicCheck()

    }
--- a/android/app/src/main/java/com/remotedisplay/player/data/ServerConfig.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/data/ServerConfig.kt
@ -71,4 +71,37 @@ class ServerConfig(context: Context) {
    fun clearPlaylistCache() {
        prefs.edit().remove("cached_playlist").apply()
    }
+
+    // #139 OTA attempt state. Persisted (not in-memory) on purpose: the OTA loop is driven
+    // by Fire OS restarting the app, which re-fires the update check; an in-memory counter
+    // would reset on every restart and never back off. `otaTargetVersion` is the version we
+    // are currently trying to install; `otaAttempts` counts install attempts for it;
+    // `otaLastAttemptAt` gates the post-cap retry backoff.
+    var otaTargetVersion: String
+        get() = prefs.getString("ota_target_version", "") ?: ""
+        set(value) = prefs.edit().putString("ota_target_version", value).apply()
+
+    var otaAttempts: Int
+        get() = prefs.getInt("ota_attempts", 0)
+        set(value) = prefs.edit().putInt("ota_attempts", value).apply()
+
+    var otaLastAttemptAt: Long
+        get() = prefs.getLong("ota_last_attempt_at", 0L)
+        set(value) = prefs.edit().putLong("ota_last_attempt_at", value).apply()
+
+    // #139: true once the "entering backoff" status has been reported for the current target,
+    // so the dashboard line fires on the transition only — not on every backed-off poll (Fire OS
+    // restarts re-fire the check constantly). Reset on a new target / on clear.
+    var otaBackoffReported: Boolean
+        get() = prefs.getBoolean("ota_backoff_reported", false)
+        set(value) = prefs.edit().putBoolean("ota_backoff_reported", value).apply()
+
+    fun clearOtaState() {
+        prefs.edit()
+            .remove("ota_target_version")
+            .remove("ota_attempts")
+            .remove("ota_last_attempt_at")
+            .remove("ota_backoff_reported")
+            .apply()
+    }
 }
--- a/android/app/src/main/java/com/remotedisplay/player/service/OtaThrottle.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/service/OtaThrottle.kt
@ -0,0 +1,74 @@
+package com.remotedisplay.player.service
+
+/**
+ * #139: pure OTA throttle decision logic — no Android dependencies, so it's unit-testable
+ * (see OtaThrottleTest). UpdateChecker is the imperative shell: it reads/writes the persisted
+ * fields (ServerConfig / EncryptedSharedPreferences) and performs the actual download + install;
+ * this object owns the stateful RULES so they have coverage beyond a compile:
+ *
+ *  - a new target version resets the attempt budget,
+ *  - a check NEVER consumes the budget — only a launched install does (so a transient
+ *    download/network failure can't park a healthy device in backoff),
+ *  - after MAX_INSTALL_ATTEMPTS failed installs, back off to one retry per BACKOFF_MS,
+ *  - the "entering backoff" signal fires on the crossing only (report-on-transition).
+ */
+object OtaThrottle {
+    const val MAX_INSTALL_ATTEMPTS = 3
+    const val BACKOFF_MS = 24L * 60 * 60 * 1000
+
+    /** Persisted OTA state for the version we are currently trying to install. */
+    data class State(
+        val targetVersion: String = "",
+        val attempts: Int = 0,
+        val lastAttemptAt: Long = 0L,
+        val backoffReported: Boolean = false
+    )
+
+    enum class Action { ATTEMPT, BACKOFF }
+
+    /** True when [latestVersion] differs from the persisted target — caller drops stale APKs. */
+    fun isNewTarget(state: State, latestVersion: String): Boolean = state.targetVersion != latestVersion
+
+    /**
+     * A check found [latestVersion] available. Returns the state to persist (reset on a new
+     * target) and whether to attempt now. Does NOT count an attempt: the budget is consumed
+     * only once an install is actually launched (see [onInstallLaunched]).
+     */
+    fun onUpdateAvailable(state: State, latestVersion: String, now: Long): Pair<State, Action> {
+        val s = if (isNewTarget(state, latestVersion)) State(targetVersion = latestVersion) else state
+        if (s.attempts >= MAX_INSTALL_ATTEMPTS && now - s.lastAttemptAt < BACKOFF_MS) {
+            return s to Action.BACKOFF
+        }
+        return s to Action.ATTEMPT
+    }
+
+    /**
+     * An install was actually launched (a verified APK was in hand). Consumes one attempt and
+     * returns the new state plus whether this attempt is the FIRST to cross the cap into backoff
+     * (true => caller reports "manual update required" once; false on all later polls).
+     */
+    fun onInstallLaunched(state: State, now: Long): Pair<State, Boolean> {
+        val attempts = state.attempts + 1
+        var s = state.copy(attempts = attempts, lastAttemptAt = now)
+        val enteredBackoff = attempts >= MAX_INSTALL_ATTEMPTS && !s.backoffReported
+        if (enteredBackoff) s = s.copy(backoffReported = true)
+        return s to enteredBackoff
+    }
+
+    /** A check found us already on the latest. True if there was pending OTA state to clear. */
+    fun shouldClearOnUpToDate(state: State): Boolean = state.targetVersion.isNotEmpty()
+
+    /**
+     * #139 Phase 2: operator-facing status for the dashboard.
+     *  - "none"                    : no update pending.
+     *  - "manual_update_required"  : capped AND still inside the backoff window — this device
+     *                                can't self-install; a human needs to update it.
+     *  - "pending"                 : an update is in progress / will retry (under the cap, or the
+     *                                window has elapsed so a retry is due).
+     */
+    fun statusFor(state: State, now: Long): String = when {
+        state.targetVersion.isEmpty() -> "none"
+        state.attempts >= MAX_INSTALL_ATTEMPTS && now - state.lastAttemptAt < BACKOFF_MS -> "manual_update_required"
+        else -> "pending"
+    }
+}
--- a/android/app/src/main/java/com/remotedisplay/player/service/UpdateChecker.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/service/UpdateChecker.kt
@ -39,6 +39,25 @@ class UpdateChecker(private val context: Context) {

    private var installReceiverRegistered = false

+    // #139: report OTA status to the dashboard (device:log, tag "ota"). Wired by MainActivity
+    // to WebSocketService.sendLog; null until then. Read lazily so binding order doesn't matter.
+    // The throttle thresholds + decision rules live in OtaThrottle (pure, unit-tested); this
+    // class is the imperative shell that persists state and does the download/install.
+    var otaLogReporter: ((level: String, message: String) -> Unit)? = null
+
+    private fun report(level: String, message: String) {
+        when (level) { "error" -> Log.e(TAG, message); "warn" -> Log.w(TAG, message); else -> Log.i(TAG, message) }
+        try { otaLogReporter?.invoke(level, message) } catch (_: Throwable) {}
+    }
+
+    // #139 Phase 2 (Option B): announce an OTA status TRANSITION to the server (wired by
+    // MainActivity to WebSocketService.sendOtaStatus, which reads the just-persisted state).
+    // Fired ONLY at the two transitions — clear and enter-backoff — so the dashboard badge
+    // updates promptly without waiting for a reconnect, with no per-poll/heartbeat chatter.
+    // Lazy/null-safe so binding order doesn't matter, same as otaLogReporter.
+    var otaStatusReporter: (() -> Unit)? = null
+    private fun announceOtaStatus() { try { otaStatusReporter?.invoke() } catch (_: Throwable) {} }
+
    // The PackageInstaller session reports its status (incl. STATUS_PENDING_USER_ACTION,
    // which Android 13+ returns for non-device-owner installers) via this broadcast.
    // Without handling it the committed session just stalls and the update never
@ -59,6 +78,8 @@ class UpdateChecker(private val context: Context) {
                            catch (e: Exception) { Log.e(TAG, "Confirm launch failed: ${e.message}") }
                        }
                    }
+                    // Logcat only — NOT report(): these fire per attempt, and #139 keeps the
+                    // device:log/dashboard channel to state transitions (enter-backoff, clear).
                    android.content.pm.PackageInstaller.STATUS_SUCCESS -> Log.i(TAG, "Update installed successfully")
                    else -> Log.w(TAG, "Install status: ${intent.getStringExtra(android.content.pm.PackageInstaller.EXTRA_STATUS_MESSAGE)}")
                }
@ -116,9 +137,17 @@ class UpdateChecker(private val context: Context) {

                Log.i(TAG, "Current: $currentVersion, Latest: $latestVersion, Update: $updateAvailable")

-                if (updateAvailable && downloadUrl.isNotEmpty()) {
-                    Log.i(TAG, "Update available! Downloading...")
-                    downloadAndInstall("${config.serverUrl}$downloadUrl", latestVersion)
+                if (!updateAvailable) {
+                    // #139: on the latest version now. If OTA state was pending, the install
+                    // landed (the app relaunched as the new version) — clear state + caches once.
+                    if (OtaThrottle.shouldClearOnUpToDate(otaState())) {
+                        report("info", "OTA complete: now on $currentVersion — clearing update state")
+                        config.clearOtaState()
+                        cleanupApks(null)
+                        announceOtaStatus() // transition -> emits 'none' so the badge clears promptly
+                    }
+                } else if (downloadUrl.isNotEmpty()) {
+                    maybeUpdate(latestVersion, "${config.serverUrl}$downloadUrl")
                }
            } catch (e: Exception) {
                Log.e(TAG, "Update check error: ${e.message}")
@ -126,20 +155,89 @@ class UpdateChecker(private val context: Context) {
        }.start()
    }

-    private fun downloadAndInstall(url: String, version: String) {
+    private fun otaState() = OtaThrottle.State(
+        config.otaTargetVersion, config.otaAttempts, config.otaLastAttemptAt, config.otaBackoffReported)
+
+    private fun persistOta(s: OtaThrottle.State) {
+        config.otaTargetVersion = s.targetVersion
+        config.otaAttempts = s.attempts
+        config.otaLastAttemptAt = s.lastAttemptAt
+        config.otaBackoffReported = s.backoffReported
+    }
+
+    // #139 imperative shell over OtaThrottle (the pure, unit-tested decision logic). A device
+    // that can't silently install (Fire TV: no device-owner) stops re-pulling the full APK every
+    // cycle. Only a COMMITTED install consumes the attempt budget — a transient download/verify
+    // failure on a HEALTHY device must never park it in backoff.
+    private fun maybeUpdate(latestVersion: String, downloadUrl: String) {
+        val now = System.currentTimeMillis()
+        val cur = otaState()
+        if (OtaThrottle.isNewTarget(cur, latestVersion)) cleanupApks(latestVersion)
+
+        val (afterCheck, action) = OtaThrottle.onUpdateAvailable(cur, latestVersion, now)
+        persistOta(afterCheck)
+        // Capped + still inside the window: do nothing AND stay silent. Fire OS restarts re-fire
+        // this check constantly; reporting here would just move the flood onto the WS channel.
+        // The enter-backoff line was already sent once on the crossing (below).
+        if (action == OtaThrottle.Action.BACKOFF) return
+
+        // download/verify failure → retry on the normal cadence; do NOT count it as an attempt.
+        if (!downloadAndInstall(downloadUrl, latestVersion)) {
+            Log.w(TAG, "Update $latestVersion: download/verify failed — retry next check (no attempt consumed)")
+            return
+        }
+
+        val (afterLaunch, enteredBackoff) = OtaThrottle.onInstallLaunched(afterCheck, now)
+        persistOta(afterLaunch)
+        Log.i(TAG, "Install launched for $latestVersion (attempt ${afterLaunch.attempts}/${OtaThrottle.MAX_INSTALL_ATTEMPTS})")
+        if (enteredBackoff) {
+            report("warn", "Update $latestVersion available but not installing after ${afterLaunch.attempts} attempts — manual update required (backing off to one retry per ${OtaThrottle.BACKOFF_MS / 3_600_000L}h)")
+            announceOtaStatus() // transition -> emits 'manual_update_required'
+        }
+    }
+
+    // #139: remove cached OTA APKs other than `keep` (null = remove all). Keeps the external
+    // files dir from accumulating one stale APK per superseded version.
+    private fun cleanupApks(keep: String?) {
        try {
+            val dir = context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS) ?: return
+            val keepName = keep?.let { "ScreenTinker-$it.apk" }
+            dir.listFiles { f ->
+                f.name.startsWith("ScreenTinker-") && f.name.endsWith(".apk") && f.name != keepName
+            }?.forEach { it.delete() }
+        } catch (e: Exception) {
+            Log.w(TAG, "APK cleanup failed: ${e.message}")
+        }
+    }
+
+    // Returns TRUE only when a verified APK is in hand and an install has been launched (the
+    // caller may then count an attempt); FALSE on any download/verify failure — the caller must
+    // NOT count those, so a transient network problem can't burn a healthy device's budget. #139
+    private fun downloadAndInstall(url: String, version: String): Boolean {
+        try {
+            val apkFile = File(context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS),
+                "ScreenTinker-$version.apk")
+
+            // #139: reuse a previously-downloaded, verified APK for this version instead of
+            // re-pulling ~8.7 MB every cycle. The file also stays on disk as the artifact for a
+            // manual install when silent install isn't possible.
+            if (apkFile.exists() && verifyApkSignature(apkFile)) {
+                Log.i(TAG, "Reusing cached verified APK: ${apkFile.absolutePath} (${apkFile.length()} bytes)")
+                handler.post { installApk(apkFile) }
+                return true
+            }
+            // A leftover but invalid file (partial/corrupt/tampered) must never be reused.
+            if (apkFile.exists()) apkFile.delete()
+
            // Download to a temp file
            val request = Request.Builder().url(url).build()
            val response = client.newCall(request).execute()

            if (!response.isSuccessful) {
                Log.e(TAG, "Download failed: ${response.code}")
-                return
+                return false
            }

-            val apkFile = File(context.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS),
-                "ScreenTinker-$version.apk")
-
            response.body?.byteStream()?.use { input ->
                apkFile.outputStream().use { output ->
                    input.copyTo(output)
@ -158,7 +256,7 @@ class UpdateChecker(private val context: Context) {
            if (!verifyApkSignature(apkFile)) {
                Log.e(TAG, "Refusing update: APK signature/package verification failed (tampered or MITM'd APK)")
                apkFile.delete()
-                return
+                return false
            }
            Log.i(TAG, "APK signature verified against installed app - proceeding to install")

@ -166,8 +264,10 @@ class UpdateChecker(private val context: Context) {
            handler.post {
                installApk(apkFile)
            }
+            return true
        } catch (e: Exception) {
            Log.e(TAG, "Download/install error: ${e.message}")
+            return false
        }
    }

@ -245,9 +345,18 @@ class UpdateChecker(private val context: Context) {
    private fun verifyApkSignature(apkFile: File): Boolean {
        return try {
            val pm = context.packageManager
-            val flags = if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P)
+            // #139: getPackageArchiveInfo(GET_SIGNING_CERTIFICATES).signingInfo is NULL for
+            // ARCHIVE files on API 28/29 (it's only populated from API 30) — so the modern flag
+            // reads 0 certs from a downloaded APK and we'd wrongly REFUSE a legitimate update,
+            // which is the real Fire OS 8 / Android 9 OTA-loop cause. Below API 30, read the
+            // archive's signer via the legacy GET_SIGNATURES + .signatures (its v1/JAR cert,
+            // which IS populated on 28/29). This reads the cert CORRECTLY — it does not weaken
+            // verification: the archive's signer is still extracted and compared to the installed
+            // app's signer below, and a mismatch / zero-cert APK is still rejected.
+            val archiveUsesSigningInfo = Build.VERSION.SDK_INT >= Build.VERSION_CODES.R // API 30
+            val archiveFlags = if (archiveUsesSigningInfo)
                PackageManager.GET_SIGNING_CERTIFICATES else @Suppress("DEPRECATION") PackageManager.GET_SIGNATURES
-            val downloaded = pm.getPackageArchiveInfo(apkFile.absolutePath, flags)
+            val downloaded = pm.getPackageArchiveInfo(apkFile.absolutePath, archiveFlags)
            if (downloaded == null) {
                Log.e(TAG, "Could not parse downloaded APK")
                return false
@ -256,14 +365,20 @@ class UpdateChecker(private val context: Context) {
                Log.e(TAG, "APK package mismatch: ${downloaded.packageName} != ${context.packageName}")
                return false
            }
-            val installed = pm.getPackageInfo(context.packageName, flags)
-            val downloadedSigs = signingCertHashes(downloaded)
-            val installedSigs = signingCertHashes(installed)
+            // INSTALLED-app read: signingInfo IS populated for installed packages on API 28+,
+            // so keep the modern flag there (this side already worked).
+            val installedUsesSigningInfo = Build.VERSION.SDK_INT >= Build.VERSION_CODES.P // API 28
+            val installedFlags = if (installedUsesSigningInfo)
+                PackageManager.GET_SIGNING_CERTIFICATES else @Suppress("DEPRECATION") PackageManager.GET_SIGNATURES
+            val installed = pm.getPackageInfo(context.packageName, installedFlags)
+            val downloadedSigs = signingCertHashes(downloaded, archiveUsesSigningInfo)
+            val installedSigs = signingCertHashes(installed, installedUsesSigningInfo)
            if (downloadedSigs.isEmpty() || installedSigs.isEmpty()) {
                Log.e(TAG, "Missing signing certificates (downloaded=${downloadedSigs.size}, installed=${installedSigs.size})")
                return false
            }
-            // Share at least one current signing certificate.
+            // Require a non-empty overlap of signer certs (handles multi-signer / cert-rotation
+            // the same way the API>=30 path does: compare the full current signer sets).
            val match = downloadedSigs.any { it in installedSigs }
            if (!match) Log.e(TAG, "APK signing certificate does not match installed app")
            match
@ -273,8 +388,13 @@ class UpdateChecker(private val context: Context) {
        }
    }

-    private fun signingCertHashes(info: PackageInfo): Set<String> {
-        val sigs: Array<Signature>? = if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
+    // Read the signer-cert SHA-256 set from a PackageInfo. `useSigningInfo` must match the flag
+    // it was fetched with: GET_SIGNING_CERTIFICATES -> signingInfo.apkContentsSigners (modern;
+    // multi-signer + rotation aware), GET_SIGNATURES -> legacy .signatures (the only field
+    // populated for ARCHIVE reads on API 28/29). Both yield the same cert for a normally-signed
+    // APK; the caller compares as sets so an overlapping signer still verifies.
+    private fun signingCertHashes(info: PackageInfo, useSigningInfo: Boolean): Set<String> {
+        val sigs: Array<Signature>? = if (useSigningInfo) {
            info.signingInfo?.apkContentsSigners
        } else {
            @Suppress("DEPRECATION") info.signatures
--- a/android/app/src/main/java/com/remotedisplay/player/service/WebSocketService.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/service/WebSocketService.kt
@ -560,6 +560,22 @@ class WebSocketService : Service() {
        } catch (e: Throwable) { Log.w("WebSocketService", "sendLog: ${e.message}") }
    }

+    // #139 Phase 2 (Option B): announce an OTA status transition to the server so the dashboard
+    // badge updates promptly (not only on reconnect). Reads the just-persisted throttle state —
+    // the emit always reflects the stored truth. Called by UpdateChecker at clear / enter-backoff.
+    fun sendOtaStatus() {
+        if (socket?.connected() != true) return
+        try {
+            val s = OtaThrottle.State(config.otaTargetVersion, config.otaAttempts, config.otaLastAttemptAt, config.otaBackoffReported)
+            socket?.emit("device:ota-status", JSONObject().apply {
+                put("device_id", config.deviceId)
+                put("ota_status", OtaThrottle.statusFor(s, System.currentTimeMillis()))
+                put("ota_target_version", config.otaTargetVersion)
+                put("ota_attempts", config.otaAttempts)
+            })
+        } catch (e: Throwable) { Log.w("WebSocketService", "sendOtaStatus: ${e.message}") }
+    }
+
    fun sendPlaybackState(contentId: String, positionSec: Float) {
        if (socket?.connected() != true) return
        try {
--- a/android/app/src/main/java/com/remotedisplay/player/telemetry/DeviceInfo.kt
+++ b/android/app/src/main/java/com/remotedisplay/player/telemetry/DeviceInfo.kt
@ -13,6 +13,8 @@ import android.os.SystemClock
 import android.provider.Settings
 import android.util.DisplayMetrics
 import android.view.WindowManager
+import com.remotedisplay.player.data.ServerConfig
+import com.remotedisplay.player.service.OtaThrottle
 import java.security.MessageDigest
 import org.json.JSONObject

@ -49,6 +51,13 @@ class DeviceInfo(private val context: Context) {
            put("screen_height", outH)
            put("render_width", renW)
            put("render_height", renH)
+            // #139 Phase 2: report OTA backoff state (alongside app_version) so the dashboard can
+            // flag screens stuck in manual-update-required. Read from the persisted throttle state.
+            val cfg = ServerConfig(context)
+            val ota = OtaThrottle.State(cfg.otaTargetVersion, cfg.otaAttempts, cfg.otaLastAttemptAt, cfg.otaBackoffReported)
+            put("ota_status", OtaThrottle.statusFor(ota, System.currentTimeMillis()))
+            put("ota_target_version", cfg.otaTargetVersion)
+            put("ota_attempts", cfg.otaAttempts)
        }
    }

--- a/android/app/src/test/java/com/remotedisplay/player/service/OtaThrottleTest.kt
+++ b/android/app/src/test/java/com/remotedisplay/player/service/OtaThrottleTest.kt
@ -0,0 +1,97 @@
+package com.remotedisplay.player.service
+
+import org.junit.Assert.assertEquals
+import org.junit.Assert.assertFalse
+import org.junit.Assert.assertTrue
+import org.junit.Test
+
+/**
+ * #139: coverage for the OTA throttle state machine (the stateful core that the OTA
+ * re-download-loop fix depends on), independent of Android. UpdateChecker is just the shell.
+ */
+class OtaThrottleTest {
+
+    private val V = "1.9.1-beta6"
+    private val MAX = OtaThrottle.MAX_INSTALL_ATTEMPTS
+    private val WINDOW = OtaThrottle.BACKOFF_MS
+
+    // Launch `n` installs from `start`, returning the resulting state.
+    private fun launch(start: OtaThrottle.State, n: Int, now: Long = 1000L): OtaThrottle.State {
+        var s = start
+        repeat(n) { s = OtaThrottle.onInstallLaunched(s, now + it).first }
+        return s
+    }
+
+    @Test fun newTargetResetsBudget() {
+        val stale = OtaThrottle.State(targetVersion = "1.9.1-beta5", attempts = 2, lastAttemptAt = 1000, backoffReported = true)
+        assertTrue(OtaThrottle.isNewTarget(stale, V))
+        val (s, action) = OtaThrottle.onUpdateAvailable(stale, V, now = 5000)
+        assertEquals(V, s.targetVersion)
+        assertEquals(0, s.attempts)
+        assertEquals(0L, s.lastAttemptAt)
+        assertFalse(s.backoffReported)
+        assertEquals(OtaThrottle.Action.ATTEMPT, action)
+    }
+
+    @Test fun aCheckNeverConsumesBudget_onlyInstallLaunchedDoes() {
+        var s = OtaThrottle.State(targetVersion = V, attempts = 0)
+        // Repeated checks (e.g. each followed by a failed download) must not advance the counter.
+        repeat(5) {
+            val (ns, action) = OtaThrottle.onUpdateAvailable(s, V, now = 100)
+            assertEquals(OtaThrottle.Action.ATTEMPT, action)
+            assertEquals(0, ns.attempts)
+            s = ns
+        }
+        // Only a launched install increments.
+        assertEquals(1, OtaThrottle.onInstallLaunched(s, now = 200).first.attempts)
+    }
+
+    @Test fun capThenBackoffWithinWindow() {
+        val s = launch(OtaThrottle.State(targetVersion = V), MAX, now = 1000L)
+        assertEquals(MAX, s.attempts)
+        assertTrue(s.backoffReported)
+        // A check inside the window → BACKOFF, no further attempt, state unchanged.
+        val (ns, action) = OtaThrottle.onUpdateAvailable(s, V, now = 1000L + WINDOW - 1)
+        assertEquals(OtaThrottle.Action.BACKOFF, action)
+        assertEquals(MAX, ns.attempts)
+    }
+
+    @Test fun enterBackoffSignalsExactlyOnce() {
+        var s = OtaThrottle.State(targetVersion = V)
+        var crossings = 0
+        repeat(MAX + 3) { i ->
+            val (ns, entered) = OtaThrottle.onInstallLaunched(s, now = i.toLong())
+            if (entered) crossings++
+            s = ns
+        }
+        assertEquals("enter-backoff fires only on the crossing", 1, crossings)
+    }
+
+    @Test fun retryAfterWindowElapsedDoesNotReReport() {
+        val capped = OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = 0L, backoffReported = true)
+        val (afterCheck, action) = OtaThrottle.onUpdateAvailable(capped, V, now = WINDOW + 1)
+        assertEquals(OtaThrottle.Action.ATTEMPT, action) // window elapsed → one retry allowed
+        val (_, entered) = OtaThrottle.onInstallLaunched(afterCheck, now = WINDOW + 2)
+        assertFalse("already reported entering backoff — must not report again", entered)
+    }
+
+    @Test fun clearsOnSuccessOnlyWhenPending() {
+        assertTrue(OtaThrottle.shouldClearOnUpToDate(OtaThrottle.State(targetVersion = V, attempts = 2)))
+        assertFalse(OtaThrottle.shouldClearOnUpToDate(OtaThrottle.State())) // nothing pending
+    }
+
+    @Test fun statusForReflectsBackoffWindow() {
+        val now = 10_000L
+        // no target → none
+        assertEquals("none", OtaThrottle.statusFor(OtaThrottle.State(), now))
+        // under the cap → pending
+        assertEquals("pending", OtaThrottle.statusFor(
+            OtaThrottle.State(targetVersion = V, attempts = 1, lastAttemptAt = now), now))
+        // capped AND inside the window → manual update required
+        assertEquals("manual_update_required", OtaThrottle.statusFor(
+            OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = now), now + WINDOW - 1))
+        // capped but window elapsed (a retry is due) → pending, not stuck
+        assertEquals("pending", OtaThrottle.statusFor(
+            OtaThrottle.State(targetVersion = V, attempts = MAX, lastAttemptAt = now), now + WINDOW + 1))
+    }
+}
--- a/docs/maintenance-device-status-log.md
+++ b/docs/maintenance-device-status-log.md
@ -0,0 +1,44 @@
+# Maintenance: `device_status_log` growth & space reclaim (#142)
+
+## What changed in 1.9.2-beta1
+
+`device_status_log` previously grew without an effective bound (the per-device
+insert-time prune missed removed/idle devices and the heartbeat `offline_timeout`
+insert). In one deployment it reached ~1.2M rows / ~119 MB over ~23 days and
+degraded dashboard performance.
+
+1.9.2-beta1 bounds further growth:
+
+- **Index** `idx_device_status_log_device_ts(device_id, timestamp)` — the dashboard
+  uptime query and the prunes now use an index instead of a full scan.
+- **Global retention sweep** (`pruneStatusLog()`), run on startup and on the
+  heartbeat interval, deletes rows older than **`STATUS_LOG_RETENTION_DAYS`**
+  (default **3**) across *all* devices — including removed/idle devices and the
+  `offline_timeout` rows the per-device prune never revisited.
+
+## Reclaiming space on an already-bloated database
+
+> **Operator action — only needed once, only if your `device_status_log` is already
+> bloated from a pre-1.9.2 deployment.**
+
+Retention bounds *future* growth, but SQLite does **not** return freed pages to the
+filesystem on `DELETE` — the file stays at its high-water mark until a `VACUUM`.
+After upgrading (which prunes the old rows), reclaim the disk with a **one-time
+manual `VACUUM` in a maintenance window**:
+
+```sh
+# stop the server (or do this during a low-traffic window — VACUUM takes a global
+# write lock and rewrites the whole DB file; the app cannot write during it)
+sqlite3 /opt/screentinker/server/db/remote_display.db 'VACUUM;'
+```
+
+In the reference incident this took the DB from **119 MB → 39 MB**.
+
+### Why VACUUM is not automatic
+
+`VACUUM` locks the database and rewrites the entire file — unacceptable on the hot
+path. `PRAGMA auto_vacuum=INCREMENTAL` is **not** enabled either: it only takes
+effect on a freshly-created database (set before the first table) or after a
+one-time full `VACUUM` to convert an existing DB, so enabling it would be a no-op on
+existing installs and a silent behavior change on new ones. Space reclaim is left as
+a deliberate operator decision; ongoing growth is already bounded by retention.
--- a/frontend/js/i18n/en.js
+++ b/frontend/js/i18n/en.js
@ -6,6 +6,8 @@ export default {
  'device.pl_item.orphan_zone_tip': "This item's zone isn't part of the device's current layout. It still plays (recovered into the largest zone), but reassign it to a zone in this layout.",
  'dashboard.device_orphan_tip_one': "{n} item assigned to a zone that isn't in this device's layout — open the device to reassign",
  'dashboard.device_orphan_tip_other': "{n} items assigned to a zone that isn't in this device's layout — open the device to reassign",
+  // #139: device stuck in OTA backoff (can't self-install — e.g. Fire TV) — needs a manual update.
+  'dashboard.device_ota_stuck': 'Update available (v{version}) — install failed {n}×, manual update required',
  // Nav (sidebar)
  'nav.displays': 'Displays',
  'nav.content': 'Content',
--- a/frontend/js/views/dashboard.js
+++ b/frontend/js/views/dashboard.js
@ -117,6 +117,9 @@ function renderDeviceCard(device) {
        <div class="device-card-name">${esc(device.name)}${device.orphan_count > 0 ? `
          <span class="device-orphan-badge" title="${tn('dashboard.device_orphan_tip', device.orphan_count)}" style="margin-left:6px;display:inline-flex;align-items:center;gap:3px;font-size:11px;color:var(--danger);vertical-align:middle">
            <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M10.29 3.86L1.82 18a2 2 0 0 0 1.71 3h16.94a2 2 0 0 0 1.71-3L13.71 3.86a2 2 0 0 0-3.42 0z"/><line x1="12" y1="9" x2="12" y2="13"/><line x1="12" y1="17" x2="12.01" y2="17"/></svg>${device.orphan_count}
+          </span>` : ''}${device.ota_status === 'manual_update_required' ? `
+          <span class="device-ota-badge" title="${esc(t('dashboard.device_ota_stuck', { version: device.ota_target_version || '?', n: device.ota_attempts || 0 }))}" style="margin-left:6px;display:inline-flex;align-items:center;gap:3px;font-size:11px;color:var(--warning);vertical-align:middle">
+            <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4"/><polyline points="7 10 12 15 17 10"/><line x1="12" y1="15" x2="12" y2="3"/></svg>update
          </span>` : ''}</div>
        ${device.owner_name || device.owner_email ? `<div style="font-size:11px;color:var(--text-muted);margin-bottom:4px">
          <svg width="10" height="10" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" style="vertical-align:-1px">
--- a/scripts/bump-version.sh
+++ b/scripts/bump-version.sh
@ -17,6 +17,25 @@ if [ -n "$(git status --porcelain)" ]; then
  exit 1
 fi

+# Pre-push fast-forward guard. This script creates an annotated tag locally; if
+# origin/main has advanced past the commit we're bumping from, `git push origin main`
+# is rejected as a non-fast-forward - and if the tag gets pushed anyway it fires the
+# release workflow from a commit that isn't even on main (the beta9 divergence
+# incident). Catch the divergence HERE, before the tag exists, so nothing can fire.
+# Best-effort: when the fetch can't run (offline), warn and proceed rather than block
+# a local bump - the push itself is still the backstop.
+if git fetch --quiet origin main 2>/dev/null; then
+  if ! git merge-base --is-ancestor FETCH_HEAD HEAD; then
+    echo "ERROR: origin/main ($(git rev-parse --short FETCH_HEAD)) has commits not in your" >&2
+    echo "       HEAD ($(git rev-parse --short HEAD)) - 'git push origin main' would be rejected." >&2
+    echo "       Merge origin/main into your branch first, then re-run the bump." >&2
+    exit 1
+  fi
+else
+  echo "WARNING: could not fetch origin/main - skipping the fast-forward check (offline?)." >&2
+  echo "         Confirm 'git push origin main' will fast-forward before pushing the tag." >&2
+fi
+
 CURRENT="$(cat VERSION)"
 IFS=. read -r MAJ MIN PAT <<< "$CURRENT"

--- a/scripts/debian-13-setup.sh
+++ b/scripts/debian-13-setup.sh
@ -0,0 +1,549 @@
+#!/bin/bash
+# ScreenTinker - Debian 13 Setup Script
+#
+# Modes:
+#   - Server + Player (both)
+#   - Server only
+#   - Player only
+#
+# Usage:
+#   curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash
+#   curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash -s -- --server-only
+#   curl -sSL https://screentinker.com/scripts/debian-13-setup.sh | sudo bash -s -- --player-only https://screentinker.com
+
+set -euo pipefail
+
+# -- Configuration --
+SCREENTINKER_DIR="/opt/screentinker"
+SCREENTINKER_PORT=3001
+NODE_MAJOR=20
+LOG_FILE="/var/log/screentinker-debian-setup.log"
+
+# -- Colors --
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+log()  { echo -e "${GREEN}[ScreenTinker]${NC} $1"; }
+warn() { echo -e "${YELLOW}[WARNING]${NC} $1"; }
+err()  { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
+
+MODE="both"
+MODE_SET=false
+SERVER_URL=""
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --server-only)
+            MODE="server"
+            MODE_SET=true
+            shift
+            ;;
+        --player-only)
+            MODE="player"
+            MODE_SET=true
+            shift
+            if [[ $# -gt 0 && "$1" == http* ]]; then
+                SERVER_URL="$1"
+                shift
+            fi
+            ;;
+        --both)
+            MODE="both"
+            MODE_SET=true
+            shift
+            ;;
+        --help|-h)
+            echo "Usage: sudo ./debian-13-setup.sh [OPTIONS] [SERVER_URL]"
+            echo ""
+            echo "Options:"
+            echo "  --server-only         Install only the server"
+            echo "  --player-only [URL]   Install only the player (URL required)"
+            echo "  --both                Install both server and player (default)"
+            echo "  --help                Show this help"
+            echo ""
+            echo "Examples:"
+            echo "  sudo ./debian-13-setup.sh"
+            echo "  sudo ./debian-13-setup.sh --server-only"
+            echo "  sudo ./debian-13-setup.sh --player-only https://screentinker.com"
+            exit 0
+            ;;
+        http*)
+            SERVER_URL="$1"
+            shift
+            ;;
+        *)
+            shift
+            ;;
+    esac
+done
+
+if [ "$(id -u)" -ne 0 ]; then
+    err "This script must be run as root. Try: sudo bash debian-13-setup.sh"
+fi
+
+if [ -r /etc/os-release ]; then
+    . /etc/os-release
+    if [ "${ID:-}" != "debian" ] || [ "${VERSION_ID:-}" != "13" ]; then
+        warn "Detected ${PRETTY_NAME:-unknown}. This script targets Debian 13."
+        read -p "Continue anyway? (y/N) " -n 1 -r; echo
+        [[ ! $REPLY =~ ^[Yy]$ ]] && exit 1
+    else
+        log "Detected Debian 13"
+    fi
+fi
+
+if [ "$MODE" = "player" ] && [ -z "$SERVER_URL" ]; then
+    echo ""
+    echo -e "${BLUE}======================================${NC}"
+    echo -e "${BLUE}   ScreenTinker Debian 13 Setup${NC}"
+    echo -e "${BLUE}======================================${NC}"
+    echo ""
+    read -p "Server URL (e.g., https://screentinker.com): " SERVER_URL
+elif [ "$MODE" = "both" ] && [ "$MODE_SET" = false ] && [ -z "$SERVER_URL" ]; then
+    echo ""
+    echo -e "${BLUE}======================================${NC}"
+    echo -e "${BLUE}   ScreenTinker Debian 13 Setup${NC}"
+    echo -e "${BLUE}======================================${NC}"
+    echo ""
+    echo "  1) Server + Player (recommended for single-screen host)"
+    echo "  2) Server Only"
+    echo "  3) Player Only"
+    echo ""
+    read -p "Choose [1/2/3]: " MODE_CHOICE
+    case "$MODE_CHOICE" in
+        2)
+            MODE="server"
+            ;;
+        3)
+            MODE="player"
+            read -p "Server URL (e.g., https://screentinker.com): " SERVER_URL
+            ;;
+        *)
+            MODE="both"
+            ;;
+    esac
+fi
+
+SERVER_URL="${SERVER_URL%/}"
+
+NEED_SERVER=false
+NEED_PLAYER=false
+
+case "$MODE" in
+    server)
+        NEED_SERVER=true
+        ;;
+    player)
+        NEED_PLAYER=true
+        ;;
+    both)
+        NEED_SERVER=true
+        NEED_PLAYER=true
+        ;;
+    *)
+        err "Unknown mode: $MODE"
+        ;;
+esac
+
+if [ "$NEED_PLAYER" = true ] && [ "$MODE" = "player" ] && [ -z "$SERVER_URL" ]; then
+    err "Player-only mode requires a server URL"
+fi
+
+if [ "$NEED_PLAYER" = true ]; then
+    if [ "$MODE" = "player" ]; then
+        KIOSK_URL="${SERVER_URL}/player"
+    else
+        KIOSK_URL="http://localhost:${SCREENTINKER_PORT}/player"
+    fi
+fi
+
+echo ""
+log "Setup log: $LOG_FILE"
+exec > >(tee -a "$LOG_FILE") 2>&1
+
+log "Updating system packages..."
+apt-get update -qq
+apt-get upgrade -y -qq
+
+log "Installing base dependencies..."
+apt-get install -y -qq \
+    git curl wget unzip htop \
+    avahi-daemon \
+    fonts-liberation fonts-noto-color-emoji \
+    >> "$LOG_FILE" 2>&1
+
+RUNTIME_USER="${SUDO_USER:-$(logname 2>/dev/null || echo root)}"
+if ! id "$RUNTIME_USER" &>/dev/null; then
+    warn "Could not resolve invoking user; defaulting to root"
+    RUNTIME_USER="root"
+fi
+RUNTIME_HOME=$(eval echo "~$RUNTIME_USER")
+
+if [ "$NEED_SERVER" = true ]; then
+    NEED_NODE=true
+    if command -v node &>/dev/null; then
+        CUR=$(node -v | cut -d'v' -f2 | cut -d'.' -f1)
+        if [ "$CUR" -ge "$NODE_MAJOR" ]; then
+            log "Node.js $(node -v) already installed"
+            NEED_NODE=false
+        fi
+    fi
+
+    if [ "$NEED_NODE" = true ]; then
+        log "Installing Node.js ${NODE_MAJOR}.x..."
+        curl -fsSL "https://deb.nodesource.com/setup_${NODE_MAJOR}.x" | bash - >> "$LOG_FILE" 2>&1
+        apt-get install -y -qq nodejs >> "$LOG_FILE" 2>&1
+        log "Node.js $(node -v) installed"
+    fi
+
+    if [ -d "$SCREENTINKER_DIR/.git" ]; then
+        log "Repo exists at $SCREENTINKER_DIR, pulling latest..."
+        cd "$SCREENTINKER_DIR" && git pull origin main >> "$LOG_FILE" 2>&1
+    else
+        log "Cloning ScreenTinker..."
+        git clone https://github.com/screentinker/screentinker.git "$SCREENTINKER_DIR" >> "$LOG_FILE" 2>&1
+    fi
+
+    log "Installing server dependencies..."
+    cd "$SCREENTINKER_DIR/server"
+    npm install --production >> "$LOG_FILE" 2>&1
+
+    mkdir -p "$SCREENTINKER_DIR/server/db"
+    mkdir -p "$SCREENTINKER_DIR/server/uploads"
+    chown -R "$RUNTIME_USER":"$RUNTIME_USER" "$SCREENTINKER_DIR"
+
+    log "Creating screentinker-server service..."
+    cat > /etc/systemd/system/screentinker-server.service << SERVICEEOF
+[Unit]
+Description=ScreenTinker Digital Signage Server
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=${RUNTIME_USER}
+WorkingDirectory=${SCREENTINKER_DIR}/server
+ExecStart=/usr/bin/node server.js
+Restart=always
+RestartSec=5
+StartLimitBurst=5
+StartLimitIntervalSec=60
+
+Environment=NODE_ENV=production
+Environment=PORT=${SCREENTINKER_PORT}
+Environment=SELF_HOSTED=true
+Environment=HOST=0.0.0.0
+
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=screentinker-server
+
+[Install]
+WantedBy=multi-user.target
+SERVICEEOF
+
+    systemctl daemon-reload
+    systemctl enable screentinker-server.service
+    log "Server service enabled"
+fi
+
+if [ "$NEED_PLAYER" = true ]; then
+    log "Installing player packages..."
+    apt-get install -y -qq \
+        xserver-xorg xserver-xorg-legacy x11-xserver-utils xinit \
+        chromium unclutter xdotool \
+        >> "$LOG_FILE" 2>&1 || {
+            warn "Failed to install chromium package, trying chromium-browser..."
+            apt-get install -y -qq xserver-xorg xserver-xorg-legacy x11-xserver-utils xinit chromium-browser unclutter xdotool >> "$LOG_FILE" 2>&1
+        }
+
+    CHROMIUM_BIN=$(command -v chromium 2>/dev/null || command -v chromium-browser 2>/dev/null || echo "/usr/bin/chromium")
+
+    log "Allowing non-root X server startup..."
+    mkdir -p /etc/X11
+    cat > /etc/X11/Xwrapper.config << 'XWRAPEOF'
+allowed_users=anybody
+needs_root_rights=yes
+XWRAPEOF
+
+    log "Creating kiosk launcher..."
+    cat > "$RUNTIME_HOME/screentinker-kiosk.sh" << KIOSKEOF
+#!/bin/bash
+KIOSK_URL="${KIOSK_URL}"
+
+sleep 2
+
+# Disable screen blanking and power management
+xset s off
+xset s noblank
+xset -dpms
+xset s 0 0
+
+# Hide cursor after 3 seconds of inactivity
+unclutter -idle 3 -root &
+
+# Clean Chromium crash flags (prevents restore session dialogs)
+CDIR="\$HOME/.config/chromium/Default"
+mkdir -p "\$CDIR"
+if [ -f "\$CDIR/Preferences" ]; then
+    sed -i 's/"exited_cleanly":false/"exited_cleanly":true/' "\$CDIR/Preferences" 2>/dev/null || true
+    sed -i 's/"exit_type":"Crashed"/"exit_type":"Normal"/' "\$CDIR/Preferences" 2>/dev/null || true
+fi
+
+# Wait for local server if running all-in-one
+if echo "\$KIOSK_URL" | grep -q "localhost"; then
+    echo "Waiting for ScreenTinker server..."
+    for i in \$(seq 1 60); do
+        if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/status" >/dev/null 2>&1; then
+            echo "Server ready after \${i}x2s"
+            break
+        fi
+        sleep 2
+    done
+fi
+
+# Detect screen resolution so Chromium fills the display on minimal X11 (no WM)
+SCREEN_RES=\$(xrandr 2>/dev/null | grep ' connected' | grep -oE '[0-9]+x[0-9]+' | head -1)
+SCREEN_W=\${SCREEN_RES%%x*}
+SCREEN_H=\${SCREEN_RES##*x}
+if [ -z "\$SCREEN_W" ] || [ -z "\$SCREEN_H" ]; then
+    SCREEN_W=1920
+    SCREEN_H=1080
+fi
+
+exec ${CHROMIUM_BIN} \\
+    --kiosk \\
+    --window-position=0,0 \\
+    --window-size=\${SCREEN_W},\${SCREEN_H} \\
+    --noerrdialogs \\
+    --disable-infobars \\
+    --disable-session-crashed-bubble \\
+    --disable-features=TranslateUI \\
+    --disable-component-update \\
+    --check-for-update-interval=31536000 \\
+    --autoplay-policy=no-user-gesture-required \\
+    --no-first-run \\
+    --disable-pinch \\
+    --overscroll-history-navigation=0 \\
+    --disable-translate \\
+    --disable-sync \\
+    --disable-background-networking \\
+    --disable-default-apps \\
+    --disable-extensions \\
+    --disable-hang-monitor \\
+    --disable-popup-blocking \\
+    --disable-prompt-on-repost \\
+    --metrics-recording-only \\
+    --safebrowsing-disable-auto-update \\
+    --ignore-certificate-errors \\
+    "\$KIOSK_URL"
+KIOSKEOF
+
+    chmod +x "$RUNTIME_HOME/screentinker-kiosk.sh"
+    chown "$RUNTIME_USER":"$RUNTIME_USER" "$RUNTIME_HOME/screentinker-kiosk.sh"
+
+    cat > "$RUNTIME_HOME/.xinitrc" << 'XINITEOF'
+#!/bin/bash
+exec ~/screentinker-kiosk.sh
+XINITEOF
+    chmod +x "$RUNTIME_HOME/.xinitrc"
+    chown "$RUNTIME_USER":"$RUNTIME_USER" "$RUNTIME_HOME/.xinitrc"
+
+    if [ "$NEED_SERVER" = true ]; then
+        KIOSK_AFTER="After=screentinker-server.service"
+        KIOSK_REQ="Requires=screentinker-server.service"
+    else
+        KIOSK_AFTER="After=network-online.target"
+        KIOSK_REQ="Wants=network-online.target"
+    fi
+
+    log "Creating kiosk service..."
+    cat > /etc/systemd/system/screentinker-kiosk.service << SERVICEEOF
+[Unit]
+Description=ScreenTinker Kiosk Display
+${KIOSK_AFTER}
+${KIOSK_REQ}
+# Prevent conflicts with getty on tty1
+Conflicts=getty@tty1.service
+After=getty@tty1.service
+
+[Service]
+Type=simple
+User=${RUNTIME_USER}
+Environment=DISPLAY=:0
+Environment=XAUTHORITY=${RUNTIME_HOME}/.Xauthority
+# Remove stale X lock files from previous crashes before starting
+ExecStartPre=/bin/bash -c 'rm -f /tmp/.X0-lock /tmp/.X11-unix/X0'
+ExecStartPre=/bin/sleep 3
+ExecStart=/usr/bin/startx ${RUNTIME_HOME}/.xinitrc -- :0 -nolisten tcp vt1
+Restart=on-failure
+RestartSec=10
+StartLimitBurst=5
+StartLimitIntervalSec=120
+
+TTYPath=/dev/tty1
+StandardInput=tty
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=screentinker-kiosk
+
+[Install]
+WantedBy=multi-user.target
+SERVICEEOF
+
+    systemctl daemon-reload
+    systemctl enable screentinker-kiosk.service
+    log "Kiosk service enabled"
+
+    log "Configuring auto-login on tty1..."
+    mkdir -p /etc/systemd/system/getty@tty1.service.d
+    cat > /etc/systemd/system/getty@tty1.service.d/autologin.conf << AUTOLOGINEOF
+[Service]
+ExecStart=
+ExecStart=-/sbin/agetty --autologin ${RUNTIME_USER} --noclear %I \$TERM
+AUTOLOGINEOF
+
+    # Disable getty on tty1 so it doesn't conflict with the kiosk service
+    systemctl disable getty@tty1.service 2>/dev/null || true
+    systemctl mask getty@tty1.service 2>/dev/null || true
+fi
+
+if [ "$NEED_SERVER" = true ]; then
+    log "Creating management scripts..."
+
+    cat > /usr/local/bin/screentinker-update << 'UPDATEEOF'
+#!/bin/bash
+echo "Stopping services..."
+sudo systemctl stop screentinker-kiosk.service 2>/dev/null || true
+sudo systemctl stop screentinker-server.service 2>/dev/null || true
+
+echo "Pulling latest..."
+cd /opt/screentinker && git pull origin main
+
+echo "Installing dependencies..."
+cd server && npm install --production
+
+echo "Starting services..."
+sudo systemctl start screentinker-server.service
+if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
+  sleep 3
+  sudo systemctl start screentinker-kiosk.service
+fi
+
+echo ""
+echo "Done! Server: $(systemctl is-active screentinker-server.service)"
+if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
+  echo "      Kiosk:  $(systemctl is-active screentinker-kiosk.service)"
+fi
+UPDATEEOF
+    chmod +x /usr/local/bin/screentinker-update
+
+    cat > /usr/local/bin/screentinker-status << 'STATUSEOF'
+#!/bin/bash
+echo ""
+echo "=== ScreenTinker Status ==="
+echo ""
+IP=$(hostname -I | awk '{print $1}')
+
+if systemctl is-active screentinker-server.service &>/dev/null; then
+    echo "Server:    RUNNING (PID $(systemctl show screentinker-server.service -p MainPID --value))"
+else
+    echo "Server:    STOPPED"
+fi
+
+if systemctl list-unit-files | grep -q '^screentinker-kiosk.service'; then
+    if systemctl is-active screentinker-kiosk.service &>/dev/null; then
+        echo "Kiosk:     RUNNING"
+    else
+        echo "Kiosk:     STOPPED"
+    fi
+fi
+
+echo ""
+echo "Uptime:    $(uptime -p)"
+echo "Disk:      $(df -h /opt/screentinker 2>/dev/null | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
+echo "Memory:    $(free -h | awk '/Mem:/ {print $3 " / " $2}')"
+echo ""
+echo "Dashboard: http://${IP}:3001"
+echo "Player:    http://${IP}:3001/player"
+echo "mDNS:      http://$(hostname).local:3001"
+echo ""
+STATUSEOF
+    chmod +x /usr/local/bin/screentinker-status
+
+    cat > /usr/local/bin/screentinker-logs << 'LOGSEOF'
+#!/bin/bash
+case "${1:-server}" in
+    server) journalctl -u screentinker-server.service -f --no-hostname ;;
+    kiosk)  journalctl -u screentinker-kiosk.service -f --no-hostname ;;
+    all)    journalctl -u screentinker-server.service -u screentinker-kiosk.service -f --no-hostname ;;
+    *)      echo "Usage: screentinker-logs [server|kiosk|all]" ;;
+esac
+LOGSEOF
+    chmod +x /usr/local/bin/screentinker-logs
+fi
+
+cat > /etc/motd << 'MOTDEOF'
+
+  ____                        _____          _
+ / ___|  ___ _ __ ___  ___  |_   _|_ _ __ | | _____ _ __
+ \___ \ / __| '__/ _ \/ _ \   | || | '_ \| |/ / _ \ '__|
+  ___) | (__| | |  __/  __/   | || | | | |   <  __/ |
+ |____/ \___|_|  \___|\___|   |_||_|_| |_|_|\_\___|_|
+
+ Open-Source Digital Signage for Any Screen
+
+ Commands:
+   screentinker-status   Show system info and URLs
+   screentinker-update   Pull latest and restart
+   screentinker-logs     Follow logs (server|kiosk|all)
+
+MOTDEOF
+
+if grep -q "#RuntimeWatchdogSec=0" /etc/systemd/system.conf 2>/dev/null; then
+    sed -i 's/#RuntimeWatchdogSec=0/RuntimeWatchdogSec=10/' /etc/systemd/system.conf
+    log "Hardware watchdog enabled (10s)"
+fi
+
+# Disable console blanking so the screen stays on during boot
+if [ -f /etc/default/grub ]; then
+    if ! grep -q "consoleblank=0" /etc/default/grub; then
+        sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 consoleblank=0"/' /etc/default/grub
+        update-grub >> "$LOG_FILE" 2>&1 && log "Console blanking disabled in GRUB" || warn "update-grub failed (non-fatal)"
+    fi
+fi
+
+echo ""
+echo -e "${GREEN}======================================${NC}"
+echo -e "${GREEN}   ScreenTinker Setup Complete!${NC}"
+echo -e "${GREEN}======================================${NC}"
+echo ""
+
+IP=$(hostname -I | awk '{print $1}')
+
+if [ "$MODE" = "both" ]; then
+    echo "Mode: Server + Player"
+    echo "Dashboard: http://${IP}:${SCREENTINKER_PORT}"
+    echo "Player:    http://${IP}:${SCREENTINKER_PORT}/player"
+elif [ "$MODE" = "server" ]; then
+    echo "Mode: Server Only"
+    echo "Dashboard: http://${IP}:${SCREENTINKER_PORT}"
+else
+    echo "Mode: Player Only"
+    echo "Server: $SERVER_URL"
+fi
+
+echo ""
+echo "Services:"
+if [ "$NEED_SERVER" = true ]; then
+    echo "  sudo systemctl [start|stop|restart] screentinker-server"
+fi
+if [ "$NEED_PLAYER" = true ]; then
+    echo "  sudo systemctl [start|stop|restart] screentinker-kiosk"
+fi
+echo ""
+echo -e "${YELLOW}Reboot to start:  sudo reboot${NC}"
+echo ""
--- a/scripts/raspberry-pi-setup.sh
+++ b/scripts/raspberry-pi-setup.sh
@ -280,7 +280,7 @@ fi
 if echo "\$KIOSK_URL" | grep -q "localhost"; then
    echo "Waiting for ScreenTinker server..."
    for i in \$(seq 1 30); do
-        if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/health" >/dev/null 2>&1; then
+        if curl -sf "http://localhost:${SCREENTINKER_PORT}/api/status" >/dev/null 2>&1; then
            echo "Server ready"
            break
        fi
@ -288,8 +288,19 @@ if echo "\$KIOSK_URL" | grep -q "localhost"; then
    done
 fi

+# Detect screen resolution so Chromium fills the display on minimal X11 (no WM)
+SCREEN_RES=\$(xrandr 2>/dev/null | grep ' connected' | grep -oE '[0-9]+x[0-9]+' | head -1)
+SCREEN_W=\${SCREEN_RES%%x*}
+SCREEN_H=\${SCREEN_RES##*x}
+if [ -z "\$SCREEN_W" ] || [ -z "\$SCREEN_H" ]; then
+    SCREEN_W=1920
+    SCREEN_H=1080
+fi
+
 exec ${CHROMIUM_BIN} \\
    --kiosk \\
+    --window-position=0,0 \\
+    --window-size=\${SCREEN_W},\${SCREEN_H} \\
    --noerrdialogs \\
    --disable-infobars \\
    --disable-session-crashed-bubble \\
@ -298,7 +309,6 @@ exec ${CHROMIUM_BIN} \\
    --check-for-update-interval=31536000 \\
    --autoplay-policy=no-user-gesture-required \\
    --no-first-run \\
-    --start-fullscreen \\
    --disable-pinch \\
    --overscroll-history-navigation=0 \\
    --disable-translate \\
--- a/server/config.js
+++ b/server/config.js
@ -90,4 +90,63 @@ module.exports = {
  // on MSP-style deployments where an admin/operator assigns users to existing
  // orgs after signup instead.
  autoCreateOrgOnSignup: !['false', '0'].includes(String(process.env.AUTO_CREATE_ORG_ON_SIGNUP || '').toLowerCase()),
+
+  // #142 event-loop lag telemetry (services/loop-lag.js). perf_hooks
+  // monitorEventLoopDelay is C++-backed, so continuous sampling is cheap. Each
+  // window's p99 is persisted to event_loop_lag (bounded: indexed + pruned from
+  // day one) and drives the banded load level the reconnect throttle reads.
+  lagSampleIntervalMs: parseInt(process.env.LAG_SAMPLE_INTERVAL_MS) || 1000,
+  lagResolutionMs: parseInt(process.env.LAG_RESOLUTION_MS) || 20,
+  lagTelemetryRetentionDays: parseFloat(process.env.LAG_TELEMETRY_RETENTION_DAYS) || 3,
+  lagPruneIntervalMs: parseInt(process.env.LAG_PRUNE_INTERVAL_MS) || 3600000,
+  // Banded load levels from the window p99 (ms). Asymmetric by design: a band is
+  // entered immediately when its up-threshold is crossed (tighten fast), but
+  // released only one step at a time after lagReleaseSamples consecutive samples
+  // fall below a deadband (release slow), so small fluctuations don't flap it.
+  // Bands ONLY scale how hard an already-flagged device is throttled; a healthy
+  // device is never gated by global lag.
+  lagElevatedMs: parseInt(process.env.LAG_ELEVATED_MS) || 100,
+  lagCriticalMs: parseInt(process.env.LAG_CRITICAL_MS) || 250,
+  lagReleaseSamples: parseInt(process.env.LAG_RELEASE_SAMPLES) || 5,
+
+  // #142 load-aware per-device reconnect throttle (lib/reconnect-throttle.js).
+  // The verdict of WHO is misbehaving is ALWAYS per-device (keyed on device_id):
+  // a device is flagged only when it exceeds reconnectBaseMax genuine reconnects
+  // per reconnectWindowMs. Global lag never flags a healthy device — the lag band
+  // only MULTIPLIES how hard an already-flagged device is backed off.
+  reconnectWindowMs: parseInt(process.env.RECONNECT_WINDOW_MS) || 10000,
+  reconnectBaseMax: parseInt(process.env.RECONNECT_BASE_MAX) || 5,
+  // Absolute per-device ceiling, independent of band AND of warm-up: no device may
+  // exceed this many reconnects/window no matter what the adaptive logic computes,
+  // so a slow-ramp attacker can't train its way through.
+  reconnectHardCeiling: parseInt(process.env.RECONNECT_HARD_CEILING) || 20,
+  // Server-enforced backoff for a flagged device: baseBackoff * 2^(level-1) * band
+  // multiplier, capped at maxBackoff. Level escalates while it keeps storming
+  // (tighten fast) and decays one step per reconnectReleaseMs of calm (release slow).
+  reconnectBaseBackoffMs: parseInt(process.env.RECONNECT_BASE_BACKOFF_MS) || 1000,
+  reconnectMaxBackoffMs: parseInt(process.env.RECONNECT_MAX_BACKOFF_MS) || 60000,
+  reconnectMaxLevel: parseInt(process.env.RECONNECT_MAX_LEVEL) || 10,
+  reconnectReleaseMs: parseInt(process.env.RECONNECT_RELEASE_MS) || 30000,
+  // Cold start: for this long after process start, lag is high while the whole
+  // fleet reconnects at once. Treat leniently — force the 'normal' band and apply
+  // only the hard ceiling (no rate-band throttle) so a deploy can't throttle
+  // healthy screens. Throttle state is in-memory and resets on restart.
+  reconnectWarmupMs: parseInt(process.env.RECONNECT_WARMUP_MS) || 30000,
+  reconnectBandElevatedMult: parseFloat(process.env.RECONNECT_BAND_ELEVATED_MULT) || 2,
+  reconnectBandCriticalMult: parseFloat(process.env.RECONNECT_BAND_CRITICAL_MULT) || 4,
+
+  // #142 device_status_log retention. A GLOBAL scheduled sweep (pruneStatusLog in
+  // db/database.js, run on startup + the heartbeat interval) deletes rows older
+  // than this across ALL devices — covering what the per-device insert-time prune
+  // in deviceSocket.js misses: removed/idle devices that never insert again, and
+  // the heartbeat.js offline_timeout insert that bypasses logDeviceStatus. Default
+  // is LOWER than the old hardcoded 7 days (the reporter's bloat happened under 7d);
+  // 2-3 days is plenty for the dashboard's 24h uptime view + diagnostics.
+  statusLogRetentionDays: parseFloat(process.env.STATUS_LOG_RETENTION_DAYS) || 3,
+
+  // #142 content-ack dedup window (deviceSocket.js). A device (esp. older apps)
+  // can spam "content <id>: ready" for the same item; suppress identical
+  // (device_id, content_id, status) reports within this window. A status CHANGE
+  // has a different key and passes immediately. In-memory; resets on restart.
+  contentAckDedupMs: parseInt(process.env.CONTENT_ACK_DEDUP_MS) || 10000,
 };
--- a/server/db/database.js
+++ b/server/db/database.js
@ -216,6 +216,24 @@ const migrations = [
  // signal, so the two differ — surfacing both explains "reports 720 but monitor sees 1080".
  "ALTER TABLE devices ADD COLUMN render_width INTEGER",
  "ALTER TABLE devices ADD COLUMN render_height INTEGER",
+  // #139 Phase 2: device-reported OTA backoff status, so the dashboard can flag screens that
+  // can't self-install (Fire TV: no device-owner path) and need a hands-on update. ADD COLUMN
+  // with defaults is non-destructive in SQLite, and the apply loop below swallows "duplicate
+  // column" — so this is idempotent and upgrades an existing populated db without data loss.
+  // ota_updated_at = server receipt time (s), stamped on each register persist.
+  "ALTER TABLE devices ADD COLUMN ota_status TEXT DEFAULT 'none'",
+  "ALTER TABLE devices ADD COLUMN ota_target_version TEXT",
+  "ALTER TABLE devices ADD COLUMN ota_attempts INTEGER DEFAULT 0",
+  "ALTER TABLE devices ADD COLUMN ota_updated_at INTEGER",
+  // #142: index device_status_log for the per-device + time-window access pattern.
+  // schema.sql creates this on fresh installs; this migration covers existing DBs.
+  // Both the dashboard uptime query and the retention prune were full scans — the
+  // dashboard-degradation cause once the table reached 1M+ rows.
+  "CREATE INDEX IF NOT EXISTS idx_device_status_log_device_ts ON device_status_log(device_id, timestamp)",
+  // #142: event-loop lag telemetry table (bounded: indexed + scheduled prune).
+  // schema.sql creates these on fresh installs; this covers existing DBs.
+  "CREATE TABLE IF NOT EXISTS event_loop_lag (id INTEGER PRIMARY KEY AUTOINCREMENT, sampled_at INTEGER NOT NULL DEFAULT (strftime('%s','now')), mean_ms REAL NOT NULL, p50_ms REAL NOT NULL, p99_ms REAL NOT NULL, max_ms REAL NOT NULL, band TEXT NOT NULL DEFAULT 'normal')",
+  "CREATE INDEX IF NOT EXISTS idx_event_loop_lag_sampled ON event_loop_lag(sampled_at)",
 ];
 // Apply each ALTER idempotently. A "duplicate column name" / "already exists"
 // error means the column is already present (expected on a migrated DB) - benign.
@ -732,6 +750,21 @@ const { applyTenantDeleteCascade } = require('../lib/tenant-cascade-migration');
  }
 })();

+// #142 GLOBAL device_status_log retention sweep across ALL devices. Run on startup
+// and on the heartbeat interval (services/heartbeat.js). This covers the rows the
+// per-device insert-time prune in deviceSocket.js misses: removed/idle devices that
+// never insert again, and the heartbeat offline_timeout insert that bypasses
+// logDeviceStatus. A plain time-range delete (like the play_logs prune) — runs off
+// the hot path; after the first sweep the table is small, so the cost is negligible.
+function pruneStatusLog() {
+  try {
+    const maxAgeSec = Math.round(config.statusLogRetentionDays * 86400);
+    const n = db.prepare("DELETE FROM device_status_log WHERE timestamp < strftime('%s','now') - ?").run(maxAgeSec).changes;
+    if (n > 0) console.log(`[status-log] pruned ${n} row(s) older than ${config.statusLogRetentionDays}d`);
+    return n;
+  } catch (_) { return 0; }
+}
+
 // Prune old telemetry (keep last 24h worth at 15s intervals = ~5760, cap at 6000)
 function pruneTelemetry(deviceId) {
  db.prepare(`
@ -804,4 +837,4 @@ try {
 const { verifyAndRepairSchema } = require('../lib/schema-check');
 verifyAndRepairSchema(db);

-module.exports = { db, pruneTelemetry, pruneScreenshots };
+module.exports = { db, pruneTelemetry, pruneScreenshots, pruneStatusLog };
--- a/server/db/schema.sql
+++ b/server/db/schema.sql
@ -463,6 +463,27 @@ CREATE TABLE IF NOT EXISTS device_status_log (
    status          TEXT NOT NULL,
    timestamp       INTEGER NOT NULL DEFAULT (strftime('%s','now'))
 );
+-- #142: index the per-device + time-window access pattern. Both the dashboard
+-- uptime query (WHERE device_id=? AND timestamp>?) and the retention prune
+-- (WHERE device_id=? AND timestamp<?) were full table scans; at 1M+ rows that
+-- was the dashboard-degradation cause in the outage report.
+CREATE INDEX IF NOT EXISTS idx_device_status_log_device_ts ON device_status_log(device_id, timestamp);
+
+-- ===================== EVENT LOOP LAG (#142) =====================
+-- Event-loop delay telemetry from perf_hooks.monitorEventLoopDelay(). Bounded
+-- from day one: indexed on sampled_at and pruned on a schedule (see
+-- services/loop-lag.js, LAG_TELEMETRY_RETENTION_DAYS) so it can never become a
+-- second unbounded-growth table.
+CREATE TABLE IF NOT EXISTS event_loop_lag (
+    id          INTEGER PRIMARY KEY AUTOINCREMENT,
+    sampled_at  INTEGER NOT NULL DEFAULT (strftime('%s','now')),
+    mean_ms     REAL NOT NULL,
+    p50_ms      REAL NOT NULL,
+    p99_ms      REAL NOT NULL,
+    max_ms      REAL NOT NULL,
+    band        TEXT NOT NULL DEFAULT 'normal'
+);
+CREATE INDEX IF NOT EXISTS idx_event_loop_lag_sampled ON event_loop_lag(sampled_at);

 -- ===================== DEVICE FINGERPRINTS =====================

@ -484,13 +505,6 @@ CREATE TABLE IF NOT EXISTS alert_configs (
    created_at      INTEGER NOT NULL DEFAULT (strftime('%s','now'))
 );

-CREATE TABLE IF NOT EXISTS device_status_log (
-    id              INTEGER PRIMARY KEY AUTOINCREMENT,
-    device_id       TEXT NOT NULL,
-    status          TEXT NOT NULL,
-    timestamp       INTEGER NOT NULL DEFAULT (strftime('%s','now'))
-);
-
 -- ===================== PLAYER DEBUG LOGS =====================
 -- Smart TVs (Tizen, WebOS, Fire TV, etc.) have no accessible devtools. The
 -- player captures errors into window.__debugLog client-side and POSTs them
--- a/server/lib/reconnect-throttle.js
+++ b/server/lib/reconnect-throttle.js
@ -0,0 +1,98 @@
+// #142 step 3 — load-aware per-device reconnect throttle (the outage fix).
+//
+// A single device stuck in a tight websocket reconnect loop can flood the server
+// with full register cycles (DB writes + playlist build) and saturate the event
+// loop. This module gates genuine reconnects PER DEVICE, before that heavy work
+// runs in deviceSocket.js.
+//
+// Design (mirrors the issue's suggested mitigation + the lastPlayLogAt pattern):
+//   - WHO is always per-device: a device is "flagged" only when it exceeds
+//     reconnectBaseMax genuine reconnects within reconnectWindowMs. Global lag
+//     NEVER flags a healthy device.
+//   - Load-awareness is BANDED (normal/elevated/critical from services/loop-lag),
+//     not a continuous controller — deterministic and testable. The band only
+//     MULTIPLIES the backoff applied to an ALREADY-flagged device.
+//   - Hysteresis: escalate immediately while storming (tighten fast); decay the
+//     escalation level one step per reconnectReleaseMs of calm (release slow).
+//   - HARD CEILING: independent of band and of warm-up, no device may exceed
+//     reconnectHardCeiling/window — a slow-ramp attacker can't train through it.
+//   - COLD START: for reconnectWarmupMs after process start, force the 'normal'
+//     band and apply only the hard ceiling, so a full-fleet reconnect right after
+//     a deploy doesn't throttle healthy screens.
+//   - State is in-memory (resets on restart), like pair-lockout / totp-lockout.
+
+const config = require('../config');
+const loopLag = require('../services/loop-lag');
+
+// deviceId -> { hits: number[], level: number, blockedUntil: ms, lastThrottleAt: ms }
+const state = new Map();
+let startedAt = Date.now();
+
+function bandMultiplier(band) {
+  if (band === 'critical') return config.reconnectBandCriticalMult;
+  if (band === 'elevated') return config.reconnectBandElevatedMult;
+  return 1;
+}
+
+function reject(s, now, band, reason, observed, allowed) {
+  s.level = Math.min(s.level + 1, config.reconnectMaxLevel);
+  const backoff = Math.min(
+    config.reconnectBaseBackoffMs * Math.pow(2, s.level - 1) * bandMultiplier(band),
+    config.reconnectMaxBackoffMs
+  );
+  s.blockedUntil = now + backoff;
+  s.lastThrottleAt = now;
+  return { allow: false, retryAfterMs: backoff, reason, observed, allowed, band, level: s.level };
+}
+
+// Decide whether to allow a genuine reconnect for `deviceId`.
+// `now` and `bandOverride` are injectable for deterministic tests; production
+// passes only deviceId.
+function check(deviceId, now = Date.now(), bandOverride = null) {
+  const warmup = (now - startedAt) < config.reconnectWarmupMs;
+  const band = bandOverride !== null ? bandOverride : (warmup ? 'normal' : loopLag.getBand());
+
+  let s = state.get(deviceId);
+  if (!s) { s = { hits: [], level: 0, blockedUntil: 0, lastThrottleAt: 0 }; state.set(deviceId, s); }
+
+  // Already inside an enforced backoff window: reject and escalate (tighten fast).
+  if (now < s.blockedUntil) {
+    return reject(s, now, band, 'in-backoff', s.hits.length, config.reconnectBaseMax);
+  }
+
+  // Sliding window of genuine reconnects.
+  s.hits = s.hits.filter((t) => now - t < config.reconnectWindowMs);
+  s.hits.push(now);
+  const observed = s.hits.length;
+
+  // Hard ceiling — always enforced, regardless of band or warm-up.
+  if (observed > config.reconnectHardCeiling) {
+    return reject(s, now, band, 'hard-ceiling', observed, config.reconnectHardCeiling);
+  }
+
+  // Cold start: only the hard ceiling applies; never rate-throttle during warm-up.
+  if (warmup) return allow(s, now, band);
+
+  // Healthy device: under the per-device threshold -> always allowed.
+  if (observed <= config.reconnectBaseMax) return allow(s, now, band);
+
+  // Flagged: storming beyond the per-device threshold -> throttle (band-scaled).
+  return reject(s, now, band, 'rate', observed, config.reconnectBaseMax);
+}
+
+function allow(s, now, band) {
+  // Release slow: decay one escalation level per reconnectReleaseMs of calm.
+  if (s.level > 0 && now - s.lastThrottleAt > config.reconnectReleaseMs) {
+    s.level = Math.max(0, s.level - 1);
+    s.lastThrottleAt = now;
+  }
+  return { allow: true, band, level: s.level };
+}
+
+// Test-only: clear state and optionally rewind the warm-up origin.
+function __resetForTest(opts = {}) {
+  state.clear();
+  if (opts.startedAt !== undefined) startedAt = opts.startedAt;
+}
+
+module.exports = { check, __resetForTest };
--- a/server/package-lock.json
+++ b/server/package-lock.json
@ -1,12 +1,12 @@
 {
  "name": "screentinker",
-  "version": "1.9.1-beta6",
+  "version": "1.9.2-beta1",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
      "name": "screentinker",
-      "version": "1.9.1-beta6",
+      "version": "1.9.2-beta1",
      "dependencies": {
        "@azure/msal-node": "^5.2.1",
        "archiver": "^7.0.1",
--- a/server/package.json
+++ b/server/package.json
@ -1,6 +1,6 @@
 {
  "name": "screentinker",
-  "version": "1.9.1-beta6",
+  "version": "1.9.2-beta1",
  "description": "ScreenTinker - Digital Signage Management Server",
  "main": "server.js",
  "scripts": {
--- a/server/routes/assignments.js
+++ b/server/routes/assignments.js
@ -160,20 +160,58 @@ function checkItemWrite(req, res) {
  return item;
 }

-// #129: real-time mute. Tell every device on this playlist to toggle the volume of the
-// matching currently-playing item NOW (decoupled from publish — the device matches by
-// content_id/widget_id and applies it live). The new value is also written to the row, so
-// it lands in the next published snapshot and persists across playlist reloads.
+// #129 + mute-fix: per-item mute has to do TWO things, because the device plays from
+// playlists.published_snapshot (deviceSocket.buildPlaylistPayload), NOT the draft
+// playlist_items the toggle writes:
+//   (1) LIVE — tell every device on this playlist to silence the matching currently-playing
+//       item NOW (device matches by content_id/widget_id). Mutes the in-progress playthrough.
+//   (2) PERSIST — patch the matching item's `muted` inside the published_snapshot the device
+//       actually plays, then re-push the playlist. Without this the snapshot kept muted=0, so
+//       every loop/reload re-applied full volume — the "icon red but audio plays across 3
+//       playthroughs" bug (Android re-loads each loop; web's native <video> loop masked it).
+// We patch the snapshot SURGICALLY (just the muted field of matching items) rather than calling
+// publishPlaylist, so a mute toggle can't prematurely publish other pending draft edits or flip
+// the playlist's draft/published status. muted is written as 0/1 to match buildSnapshotItems'
+// format (the player reads it via optInt). playlist_items.muted is still updated by the caller,
+// so a later full publish stays consistent.
 function emitMuteChanged(req, item, muted) {
  try {
    const io = req.app.get('io');
    if (!io) return;
    const deviceNs = io.of('/device');
+    const m = !!muted;
+
+    // (2) PERSIST: patch the published snapshot the device reads from.
+    const pl = db.prepare('SELECT published_snapshot FROM playlists WHERE id = ?').get(item.playlist_id);
+    if (pl && pl.published_snapshot) {
+      let snap = null;
+      try { snap = JSON.parse(pl.published_snapshot); } catch (e) { snap = null; }
+      if (Array.isArray(snap)) {
+        let changed = false;
+        for (const s of snap) {
+          const match = item.content_id ? s.content_id === item.content_id
+            : (item.widget_id ? s.widget_id === item.widget_id : false);
+          if (match && (s.muted ? 1 : 0) !== (m ? 1 : 0)) { s.muted = m ? 1 : 0; changed = true; }
+        }
+        if (changed) {
+          db.prepare('UPDATE playlists SET published_snapshot = ? WHERE id = ?')
+            .run(JSON.stringify(snap), item.playlist_id);
+        }
+      }
+    }
+
+    // (1) LIVE toggle + re-deliver the patched snapshot so loops re-apply the correct flag.
+    // Lazy require (matches playlists.pushToDevices) to avoid a route<->ws circular import.
+    const { buildPlaylistPayload } = require('../ws/deviceSocket');
+    const commandQueue = require('../lib/command-queue');
    const devices = db.prepare('SELECT id FROM devices WHERE playlist_id = ?').all(item.playlist_id);
-    const payload = { content_id: item.content_id || null, widget_id: item.widget_id || null, muted: !!muted };
-    for (const d of devices) deviceNs.to(d.id).emit('device:mute-changed', payload);
-    console.log(`[mute] item ${item.id} (content ${item.content_id || item.widget_id}) -> ${muted ? 'MUTED' : 'unmuted'}; notified ${devices.length} device(s)`);
-  } catch (e) { /* best-effort live toggle; the published snapshot is the source of truth */ }
+    const payload = { content_id: item.content_id || null, widget_id: item.widget_id || null, muted: m };
+    for (const d of devices) {
+      deviceNs.to(d.id).emit('device:mute-changed', payload);                        // current playthrough
+      commandQueue.queueOrEmitPlaylistUpdate(deviceNs, d.id, buildPlaylistPayload);  // future loads (no reload of current item)
+    }
+    console.log(`[mute] item ${item.id} (content ${item.content_id || item.widget_id}) -> ${m ? 'MUTED' : 'unmuted'}; snapshot patched + notified ${devices.length} device(s)`);
+  } catch (e) { /* best-effort; playlist_items.muted is still updated for the next full publish */ }
 }

 // Update playlist item
--- a/server/routes/status.js
+++ b/server/routes/status.js
@ -7,6 +7,7 @@ const fs = require('fs');
 const config = require('../config');
 const VERSION = require('../version');
 const { PLATFORM_ROLES } = require('../middleware/auth');
+const loopLag = require('../services/loop-lag');

 // Public status page
 router.get('/', (req, res) => {
@ -24,6 +25,9 @@ router.get('/', (req, res) => {
    version,
    uptime_human: formatUptime(uptime),
    timestamp: new Date().toISOString(),
+    // #142: current event-loop lag snapshot, so site lag is diagnosable from the
+    // health endpoint independent of any throttling. Cheap (in-memory read).
+    loop_lag: loopLag.getLag(),
  });
 });

--- a/server/server.js
+++ b/server/server.js
@ -625,6 +625,10 @@ app.set('io', io);
 const { startHeartbeatChecker } = require('./services/heartbeat');
 startHeartbeatChecker(io);

+// #142: start event-loop lag sampling (feeds /api/status + the reconnect throttle)
+const { startLoopLagMonitor } = require('./services/loop-lag');
+startLoopLagMonitor();
+
 // Start command-queue sweep (prunes expired entries for offline devices)
 const commandQueue = require('./lib/command-queue');
 commandQueue.startSweep();
@ -710,13 +714,22 @@ function resolveApkPath() {
  return null;
 }

+// #139: a device that can't silently install re-downloads the APK every check cycle. Don't
+// word a download as "in progress" (it may be a stuck loop, not progress), and rate-limit the
+// line to once per IP per window so a looping device can't flood the log.
+const otaDownloadLoggedAt = new Map(); // ip -> last-logged ms
+const OTA_DOWNLOAD_LOG_WINDOW_MS = 10 * 60 * 1000;
+
 // Serve APK download
 app.get('/download/apk', (req, res) => {
  const apkPath = resolveApkPath();
  if (apkPath) {
-    // #96: an APK download means a device is actually applying an OTA - log it so the
-    // update is observable end to end (check -> download -> [relaunch]).
-    console.log(`[ota] APK download by ${getClientIp(req)} (${fs.statSync(apkPath).size} bytes) - OTA update in progress`);
+    const ip = getClientIp(req);
+    const now = Date.now();
+    if (now - (otaDownloadLoggedAt.get(ip) || 0) > OTA_DOWNLOAD_LOG_WINDOW_MS) {
+      otaDownloadLoggedAt.set(ip, now);
+      console.log(`[ota] APK served to ${ip} (${fs.statSync(apkPath).size} bytes)`);
+    }
    res.setHeader('Content-Type', 'application/vnd.android.package-archive');
    res.setHeader('Content-Disposition', 'attachment; filename="ScreenTinker.apk"');
    res.setHeader('Cache-Control', 'no-cache');
--- a/server/services/heartbeat.js
+++ b/server/services/heartbeat.js
@ -1,4 +1,4 @@
-const { db } = require('../db/database');
+const { db, pruneStatusLog } = require('../db/database');
 const config = require('../config');
 const { deviceRoom, emitToWorkspace } = require('../lib/socket-rooms');

@ -6,6 +6,10 @@ const { deviceRoom, emitToWorkspace } = require('../lib/socket-rooms');
 const deviceConnections = new Map();

 function startHeartbeatChecker(io) {
+  // #142: sweep stale device_status_log rows once at startup (recovers a bloated
+  // table immediately after a deploy), then again on each interval below.
+  pruneStatusLog();
+
  setInterval(() => {
    const now = Date.now();
    const dashboardNs = io.of('/dashboard');
@ -36,19 +40,18 @@ function startHeartbeatChecker(io) {
      }
    }

-    // Cleanup: delete unclaimed provisioning devices older than 24 hours
-    // Keep imported devices (they have user_id set) so users can re-pair them
-    db.prepare(`
-      DELETE FROM devices WHERE status = 'provisioning'
-      AND user_id IS NULL
-      AND created_at < strftime('%s','now') - (365 * 86400)
-    `).run();
+    // Cleanup: delete unclaimed provisioning devices older than 24 hours.
+    pruneProvisioningDevices();

    // Cleanup: prune play logs older than 90 days
    db.prepare(`
      DELETE FROM play_logs WHERE started_at < strftime('%s','now') - (90 * 86400)
    `).run();

+    // #142: global device_status_log retention sweep (all devices, incl. removed/idle
+    // and the offline_timeout insert path that bypasses the per-device prune).
+    pruneStatusLog();
+
    // Cleanup: expired team invites
    db.prepare(`
      DELETE FROM team_invites WHERE expires_at < strftime('%s','now')
@ -83,11 +86,25 @@ function getAllConnections() {
  return deviceConnections;
 }

+// #142: sweep unclaimed provisioning devices older than 24h. The window previously
+// read `365 * 86400` (a YEAR), contradicting its own "older than 24 hours" comment,
+// so socket-register pairing junk lingered far longer than intended. Imported
+// devices keep a user_id and are preserved so they can be re-paired. Extracted from
+// the interval above so the correctness fix is unit-testable. Returns rows deleted.
+function pruneProvisioningDevices() {
+  return db.prepare(`
+    DELETE FROM devices
+    WHERE status = 'provisioning' AND user_id IS NULL
+    AND created_at < strftime('%s','now') - (24 * 3600)
+  `).run().changes;
+}
+
 module.exports = {
  startHeartbeatChecker,
  registerConnection,
  updateHeartbeat,
  removeConnection,
  getConnection,
-  getAllConnections
+  getAllConnections,
+  pruneProvisioningDevices
 };
--- a/server/services/loop-lag.js
+++ b/server/services/loop-lag.js
@ -0,0 +1,107 @@
+// #142 — Event-loop lag telemetry (the data subsystem; ships before the throttle).
+//
+// Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay()
+// (a C++-backed histogram — cheap). Each window we read mean/p50/p99/max, persist
+// a row to the bounded `event_loop_lag` table, and recompute a coarse load BAND
+// (normal | elevated | critical) from the window p99.
+//
+// The band is consumed by the reconnect throttle (#142 step 3), but this module
+// has standalone value: getLag() is surfaced on /api/status and band changes are
+// logged, so site connectivity/lag is diagnosable independent of any throttling.
+//
+// Band transitions are deliberately asymmetric (see nextBand): jump UP immediately
+// when an up-threshold is crossed (tighten fast), step DOWN only one level at a
+// time after lagReleaseSamples consecutive calm samples below a deadband (release
+// slow). This avoids band flap from transient blips.
+
+const { monitorEventLoopDelay } = require('perf_hooks');
+const { db } = require('../db/database');
+const config = require('../config');
+
+const NS_PER_MS = 1e6;
+// A band releases only once p99 falls below this fraction of the band's entry
+// threshold — the deadband that stops small fluctuations from flapping the band.
+const DEADBAND = 0.5;
+const LEVEL = { normal: 0, elevated: 1, critical: 2 };
+
+let histogram = null;
+let band = 'normal';
+let calmSamples = 0;
+let current = { mean_ms: 0, p50_ms: 0, p99_ms: 0, max_ms: 0, band: 'normal', sampled_at: 0 };
+
+// Pure band-transition function (exported for deterministic unit tests). Given the
+// current band, the window p99 (ms), and the running calm-sample count, returns the
+// next [band, calmSamples]. Up is immediate (may skip a level); down is one step
+// per release window, gated by a deadband.
+function nextBand(cur, p99, calm) {
+  const level = LEVEL[cur] ?? 0;
+  // UP — immediate, tighten fast (normal can jump straight to critical).
+  if (p99 >= config.lagCriticalMs && level < LEVEL.critical) return ['critical', 0];
+  if (p99 >= config.lagElevatedMs && level < LEVEL.elevated) return ['elevated', 0];
+  // DOWN — slow, one step, only below the current band's deadband.
+  if (level === LEVEL.critical && p99 <= config.lagCriticalMs * DEADBAND) {
+    const c = calm + 1;
+    return c >= config.lagReleaseSamples ? ['elevated', 0] : ['critical', c];
+  }
+  if (level === LEVEL.elevated && p99 <= config.lagElevatedMs * DEADBAND) {
+    const c = calm + 1;
+    return c >= config.lagReleaseSamples ? ['normal', 0] : ['elevated', c];
+  }
+  // Hold (inside deadband, or already normal): reset the calm counter.
+  return [cur, 0];
+}
+
+const round2 = (x) => Math.round(x * 100) / 100;
+
+function sample() {
+  const p99 = histogram.percentile(99) / NS_PER_MS;
+  const snap = {
+    mean_ms: round2(histogram.mean / NS_PER_MS),
+    p50_ms: round2(histogram.percentile(50) / NS_PER_MS),
+    p99_ms: round2(p99),
+    max_ms: round2(histogram.max / NS_PER_MS),
+  };
+  histogram.reset();
+
+  const prev = band;
+  [band, calmSamples] = nextBand(band, snap.p99_ms, calmSamples);
+  current = { ...snap, band, sampled_at: Math.floor(Date.now() / 1000) };
+
+  try {
+    db.prepare(
+      'INSERT INTO event_loop_lag (sampled_at, mean_ms, p50_ms, p99_ms, max_ms, band) VALUES (?, ?, ?, ?, ?, ?)'
+    ).run(current.sampled_at, snap.mean_ms, snap.p50_ms, snap.p99_ms, snap.max_ms, band);
+  } catch (_) { /* table may not exist on a partially-migrated DB */ }
+
+  // Observable: log whenever we're loaded or when the band changes (incl. back to
+  // normal). Healthy steady state stays quiet.
+  if (band !== 'normal' || prev !== 'normal') {
+    const tag = band !== prev ? ` (was ${prev})` : '';
+    console.log(`[loop-lag] band=${band}${tag} mean=${snap.mean_ms}ms p99=${snap.p99_ms}ms max=${snap.max_ms}ms`);
+  }
+}
+
+function pruneLag() {
+  try {
+    const cutoff = Math.floor(Date.now() / 1000) - Math.round(config.lagTelemetryRetentionDays * 86400);
+    const n = db.prepare('DELETE FROM event_loop_lag WHERE sampled_at < ?').run(cutoff).changes;
+    if (n > 0) console.log(`[loop-lag] pruned ${n} sample(s) older than ${config.lagTelemetryRetentionDays}d`);
+  } catch (_) { /* ignore */ }
+}
+
+function startLoopLagMonitor() {
+  if (histogram) return; // idempotent
+  histogram = monitorEventLoopDelay({ resolution: config.lagResolutionMs });
+  histogram.enable();
+  const t1 = setInterval(sample, config.lagSampleIntervalMs);
+  pruneLag(); // sweep stale rows on boot
+  const t2 = setInterval(pruneLag, config.lagPruneIntervalMs);
+  // Don't keep the process alive on these timers (matters for tests / clean exit).
+  if (t1.unref) t1.unref();
+  if (t2.unref) t2.unref();
+}
+
+function getBand() { return band; }
+function getLag() { return { ...current }; }
+
+module.exports = { startLoopLagMonitor, getBand, getLag, nextBand };
--- a/server/test/api.test.js
+++ b/server/test/api.test.js
@ -259,6 +259,32 @@ test('device WS: wrong device_token is rejected (auth-error, never registered)',
  assert.ok(!got.registered, 'wrong token must not register');
 });

+// #139 Phase 2 (Option B): event-driven OTA status. Registers (which, with no ota fields in
+// device_info, persists ota_status='none' via the backstop), then emits a valid ota-status and
+// a foreign-id one in order on the authenticated socket.
+function deviceOtaSeq(payload, otaEvents, timeoutMs = 4000) {
+  return new Promise((resolve) => {
+    const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
+    const finish = () => { try { sock.close(); } catch { /* */ } resolve(); };
+    sock.on('connect', () => sock.emit('device:register', payload));
+    sock.on('device:registered', () => { for (const e of otaEvents) sock.emit('device:ota-status', e); setTimeout(finish, 500); });
+    sock.on('device:auth-error', finish);
+    setTimeout(finish, timeoutMs);
+  });
+}
+test('device WS: device:ota-status persists the fields; a foreign device_id is a safe no-op (#139)', async () => {
+  await deviceOtaSeq(
+    { device_id: S.deviceId, device_token: S.deviceToken, device_info: { app_version: 'test' } },
+    [
+      { device_id: S.deviceId, ota_status: 'manual_update_required', ota_target_version: '1.9.1-beta6', ota_attempts: 3 },
+      { device_id: 'nope-not-a-device', ota_status: 'none', ota_target_version: null, ota_attempts: 0 }, // foreign id -> no-op, no throw
+    ]);
+  const dev = await jfetch(`/api/devices/${S.deviceId}`, auth(S.jwt));
+  assert.equal(dev.body.ota_status, 'manual_update_required', 'valid ota-status persisted');
+  assert.equal(dev.body.ota_target_version, '1.9.1-beta6');
+  assert.equal(dev.body.ota_attempts, 3, 'and the foreign-id event did not overwrite it');
+});
+
 // ───────────────────────── TIER 4: #92 FOLLOW-UP COVERAGE ─────────────────────────
 // The non-security gaps named in the self-review (issue #92): the gap-fix fields + the
 // cross-tenant guard (the security-relevant one), docs serving, and the token lifecycle
--- a/server/test/content-ack-dedup.test.js
+++ b/server/test/content-ack-dedup.test.js
@ -0,0 +1,85 @@
+'use strict';
+
+// #142 step 5 — content-ack dedup. Repeated identical (device_id, content_id, status)
+// reports are suppressed within config.contentAckDedupMs; a status change or a report
+// after the window passes. Observed via the server log (the handler logs+emits only
+// when it does NOT dedup). Unique PORT (3984) to avoid the collision class.
+
+const { test, before, after } = require('node:test');
+const assert = require('node:assert/strict');
+const { spawn } = require('node:child_process');
+const path = require('node:path');
+const os = require('node:os');
+const fs = require('node:fs');
+const crypto = require('node:crypto');
+const ioClient = require('socket.io-client');
+
+const PORT = 3984;
+const BASE = `http://127.0.0.1:${PORT}`;
+const DATA_DIR = path.join(os.tmpdir(), 'st-ack-' + crypto.randomBytes(4).toString('hex'));
+const LOG = path.join(os.tmpdir(), 'st-ack-' + crypto.randomBytes(4).toString('hex') + '.log');
+const DEDUP_MS = 600;
+let proc;
+
+const sleep = (ms) => new Promise(r => setTimeout(r, ms));
+
+before(async () => {
+  const logFd = fs.openSync(LOG, 'w');
+  proc = spawn('node', ['server.js'], {
+    cwd: path.join(__dirname, '..'),
+    env: { ...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test', CONTENT_ACK_DEDUP_MS: String(DEDUP_MS) },
+    stdio: ['ignore', logFd, logFd],
+  });
+  let up = false;
+  for (let i = 0; i < 80; i++) {
+    try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* */ }
+    await sleep(250);
+  }
+  if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
+});
+
+after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
+
+function provision() {
+  const code = String(crypto.randomInt(100000, 1000000));
+  return new Promise((resolve) => {
+    const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
+    sock.on('connect', () => sock.emit('device:register', { pairing_code: code }));
+    sock.on('device:registered', (d) => { try { sock.close(); } catch { /* */ } resolve({ id: d.device_id, token: d.device_token }); });
+    setTimeout(() => resolve(null), 4000);
+  });
+}
+
+function openRegistered(dev) {
+  return new Promise((resolve, reject) => {
+    const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
+    sock.on('connect', () => sock.emit('device:register', { device_id: dev.id, device_token: dev.token, device_info: { app_version: 'test' } }));
+    sock.on('device:registered', () => resolve(sock));
+    sock.on('device:auth-error', () => reject(new Error('auth-error')));
+    setTimeout(() => reject(new Error('register timeout')), 4000);
+  });
+}
+
+test('repeated identical content-acks are deduped; window-expiry and status-change pass', async () => {
+  const dev = await provision();
+  assert.ok(dev, 'device provisioned');
+  const sock = await openRegistered(dev);
+  const cid = 'cid-' + crypto.randomBytes(3).toString('hex');
+
+  // 5 rapid identical "ready" within the dedup window -> only ONE should log/emit
+  for (let i = 0; i < 5; i++) { sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'ready' }); await sleep(40); }
+  // wait past the window, then "ready" again -> passes (a fresh report)
+  await sleep(DEDUP_MS + 250);
+  sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'ready' });
+  // a status CHANGE has a different key -> passes immediately
+  await sleep(60);
+  sock.emit('device:content-ack', { device_id: dev.id, content_id: cid, status: 'error' });
+  await sleep(400);
+  try { sock.close(); } catch { /* */ }
+
+  const log = fs.readFileSync(LOG, 'utf8');
+  const ready = (log.match(new RegExp(`content ${cid}: ready`, 'g')) || []).length;
+  const err = (log.match(new RegExp(`content ${cid}: error`, 'g')) || []).length;
+  assert.equal(ready, 2, 'a burst of identical "ready" collapses to one; a second after the window passes -> 2 total');
+  assert.equal(err, 1, 'a status change is not deduped');
+});
--- a/server/test/loop-lag-integration.test.js
+++ b/server/test/loop-lag-integration.test.js
@ -0,0 +1,64 @@
+'use strict';
+
+// #142 step 2 — integration: the lag monitor samples, persists to a BOUNDED table,
+// and surfaces current lag on /api/status. Boots the real server with fast sampling
+// and a tiny (fractional-day) retention so the prune is observable within the test.
+
+const { test, before, after } = require('node:test');
+const assert = require('node:assert/strict');
+const { spawn } = require('node:child_process');
+const path = require('node:path');
+const os = require('node:os');
+const fs = require('node:fs');
+const crypto = require('node:crypto');
+const Database = require('better-sqlite3');
+
+const PORT = 3982;
+const BASE = `http://127.0.0.1:${PORT}`;
+const DATA_DIR = path.join(os.tmpdir(), 'st-lag-int-' + crypto.randomBytes(4).toString('hex'));
+const LOG = path.join(os.tmpdir(), 'st-lag-int-' + crypto.randomBytes(4).toString('hex') + '.log');
+let proc;
+
+before(async () => {
+  const logFd = fs.openSync(LOG, 'w');
+  proc = spawn('node', ['server.js'], {
+    cwd: path.join(__dirname, '..'),
+    env: {
+      ...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test',
+      LAG_SAMPLE_INTERVAL_MS: '200',          // sample fast
+      LAG_TELEMETRY_RETENTION_DAYS: '0.00001', // ~0.86s retention
+      LAG_PRUNE_INTERVAL_MS: '400',           // prune often
+    },
+    stdio: ['ignore', logFd, logFd],
+  });
+  let up = false;
+  for (let i = 0; i < 80; i++) {
+    try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* not yet */ }
+    await new Promise(r => setTimeout(r, 250));
+  }
+  if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
+});
+
+after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
+
+test('/api/status exposes a current loop_lag snapshot', async () => {
+  const r = await fetch(BASE + '/api/status');
+  const body = await r.json();
+  assert.ok(body.loop_lag, 'loop_lag present on /api/status');
+  assert.ok(['normal', 'elevated', 'critical'].includes(body.loop_lag.band), 'band is a valid level');
+  assert.equal(typeof body.loop_lag.p99_ms, 'number', 'p99_ms is numeric');
+  assert.equal(typeof body.loop_lag.mean_ms, 'number', 'mean_ms is numeric');
+});
+
+test('lag samples are persisted AND bounded by retention prune (not unbounded)', async () => {
+  // Let it sample for ~3s. At 200ms/sample that is ~15 inserts, but with ~0.86s
+  // retention pruned every 400ms the table must stay small — proving the table
+  // can never become a second unbounded-growth table.
+  await new Promise(r => setTimeout(r, 1800));
+  const dbPath = path.join(DATA_DIR, 'db', 'remote_display.db');
+  const db = new Database(dbPath, { readonly: true });
+  const count = db.prepare('SELECT COUNT(*) c FROM event_loop_lag').get().c;
+  db.close();
+  assert.ok(count >= 1, 'lag samples are being persisted');
+  assert.ok(count < 15, `table is bounded by the prune (held ${count} rows over ~3s of 200ms sampling)`);
+});
--- a/server/test/loop-lag.test.js
+++ b/server/test/loop-lag.test.js
@ -0,0 +1,57 @@
+'use strict';
+
+// #142 step 2 — deterministic unit tests for the event-loop-lag band transitions.
+// Pure function, no sockets/timing. Isolate the DB to a temp dir BEFORE requiring
+// the module (requiring it pulls in db/database, which initialises a DB on load).
+
+const os = require('node:os');
+const path = require('node:path');
+const crypto = require('node:crypto');
+process.env.DATA_DIR = path.join(os.tmpdir(), 'st-lag-unit-' + crypto.randomBytes(4).toString('hex'));
+
+const { test } = require('node:test');
+const assert = require('node:assert/strict');
+const { nextBand } = require('../services/loop-lag');
+
+// config defaults exercised here: elevated=100ms, critical=250ms, releaseSamples=5,
+// deadband=0.5 -> release-below thresholds: elevated@50ms, critical@125ms.
+
+test('UP is immediate and can skip a level (tighten fast)', () => {
+  assert.deepEqual(nextBand('normal', 50, 0), ['normal', 0], 'below elevated stays normal');
+  assert.deepEqual(nextBand('normal', 100, 0), ['elevated', 0], 'crossing elevated up-threshold jumps immediately');
+  assert.deepEqual(nextBand('normal', 250, 0), ['critical', 0], 'a big spike jumps normal->critical in one sample');
+  assert.deepEqual(nextBand('elevated', 250, 0), ['critical', 0]);
+});
+
+test('deadband holds the band for small fluctuations (no flap)', () => {
+  // elevated, p99 between release(50) and up(100) -> hold elevated, calm reset
+  assert.deepEqual(nextBand('elevated', 80, 3), ['elevated', 0]);
+  // critical, p99 between release(125) and up(250) -> hold critical
+  assert.deepEqual(nextBand('critical', 200, 4), ['critical', 0]);
+});
+
+test('DOWN is slow: requires lagReleaseSamples calm samples below the deadband', () => {
+  // elevated -> normal only after 5 consecutive calm samples
+  let band = 'elevated', calm = 0;
+  for (let i = 0; i < 4; i++) {
+    [band, calm] = nextBand(band, 20, calm);
+    assert.equal(band, 'elevated', `still elevated after ${i + 1} calm sample(s)`);
+  }
+  [band, calm] = nextBand(band, 20, calm); // 5th
+  assert.deepEqual([band, calm], ['normal', 0], 'drops to normal on the 5th calm sample');
+});
+
+test('DOWN releases one level at a time: critical -> elevated -> normal', () => {
+  let band = 'critical', calm = 0;
+  for (let i = 0; i < 5; i++) [band, calm] = nextBand(band, 10, calm);
+  assert.equal(band, 'elevated', 'critical releases to elevated, never straight to normal');
+  for (let i = 0; i < 5; i++) [band, calm] = nextBand(band, 10, calm);
+  assert.equal(band, 'normal', 'then elevated releases to normal');
+});
+
+test('a single calm sample does not release (calm counter resets on a non-calm sample)', () => {
+  let [band, calm] = nextBand('elevated', 20, 0); // calm=1
+  assert.deepEqual([band, calm], ['elevated', 1]);
+  [band, calm] = nextBand(band, 80, calm); // back inside deadband -> reset
+  assert.deepEqual([band, calm], ['elevated', 0], 'one blip resets the release counter');
+});
--- a/server/test/mute.test.js
+++ b/server/test/mute.test.js
@ -91,6 +91,24 @@ test('muted reaches the device via the published snapshot (buildSnapshotItems)',
  assert.equal(item.muted, 1, 'snapshot (device payload) carries muted=1');
 });

+test('mute toggle patches the published snapshot WITHOUT a manual republish (the beta7 bug)', async () => {
+  // Baseline: publish once so the device has a snapshot carrying muted=0.
+  await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: false }));
+  await jfetch(`/api/playlists/${S.playlistId}/publish`, post(S.jwt, {}));
+  const read = () => JSON.parse(db.prepare('SELECT published_snapshot FROM playlists WHERE id = ?').get(S.playlistId).published_snapshot)
+    .find((i) => i.content_id === S.contentId).muted;
+  assert.equal(read(), 0, 'baseline: snapshot the device plays carries muted=0');
+
+  // The actual bug: a mute toggle ALONE (no /publish) must reach the played snapshot.
+  // On beta7 this stayed 0 (markDraft only) so every loop re-applied full volume.
+  await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: true }));
+  assert.equal(read(), 1, 'mute toggle patched the snapshot the device plays — no manual republish needed');
+
+  // Unmute toggle reverts the snapshot too.
+  await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: false }));
+  assert.equal(read(), 0, 'unmute toggle patched the snapshot back to 0');
+});
+
 test('PUT ignoring muted (other field) leaves muted untouched', async () => {
  await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { muted: true }));
  const r = await jfetch(`/api/assignments/${S.itemId}`, put(S.jwt, { duration_sec: 15 }));
--- a/server/test/provisioning-cleanup.test.js
+++ b/server/test/provisioning-cleanup.test.js
@ -0,0 +1,41 @@
+'use strict';
+
+// #142 (cut 2) — provisioning-row cleanup window correctness. The sweep deletes
+// UNCLAIMED provisioning devices older than 24h (it previously used 365*86400 — a
+// year — contradicting its own comment). Imported devices (user_id set) and
+// non-provisioning devices are preserved. Deterministic, in-process (no server).
+
+const os = require('node:os');
+const path = require('node:path');
+const crypto = require('node:crypto');
+process.env.DATA_DIR = path.join(os.tmpdir(), 'st-provclean-' + crypto.randomBytes(4).toString('hex'));
+
+const { test } = require('node:test');
+const assert = require('node:assert/strict');
+const { db } = require('../db/database');
+const { pruneProvisioningDevices } = require('../services/heartbeat');
+
+test('sweeps unclaimed provisioning devices older than 24h, keeps the rest', () => {
+  db.pragma('foreign_keys = OFF'); // seed user_id without a real users row
+  db.exec('DELETE FROM devices');
+  const ins = db.prepare("INSERT INTO devices (id, status, user_id, created_at) VALUES (?, ?, ?, strftime('%s','now') - ?)");
+  ins.run('old-unclaimed', 'provisioning', null, 25 * 3600);   // >24h, unclaimed  -> SWEPT
+  ins.run('new-unclaimed', 'provisioning', null, 1 * 3600);    // <24h, unclaimed  -> kept
+  ins.run('old-imported', 'provisioning', 'u-imported', 25 * 3600); // >24h but imported (user_id) -> kept
+  ins.run('old-online', 'online', null, 25 * 3600);           // >24h but not provisioning -> kept
+  db.pragma('foreign_keys = ON');
+
+  assert.equal(db.prepare('SELECT COUNT(*) c FROM devices').get().c, 4, 'seeded 4');
+
+  const deleted = pruneProvisioningDevices();
+  assert.equal(deleted, 1, 'only the >24h unclaimed provisioning device is swept');
+
+  const ids = db.prepare('SELECT id FROM devices ORDER BY id').all().map(r => r.id);
+  assert.deepEqual(ids, ['new-unclaimed', 'old-imported', 'old-online']);
+  // regression guard: a 25h-old row sits well inside the OLD 365-day window, so this
+  // would have survived before the fix.
+});
+
+test('idempotent: a second sweep with nothing stale deletes nothing', () => {
+  assert.equal(pruneProvisioningDevices(), 0);
+});
--- a/server/test/reconnect-throttle-integration.test.js
+++ b/server/test/reconnect-throttle-integration.test.js
@ -0,0 +1,113 @@
+'use strict';
+
+// #142 step 3 — REQUIRED GATE TEST + storm + neighbor, over real sockets.
+//
+// Boots the real server with warm-up ACTIVE (default) so the whole suite runs in
+// the cold-start window — the exact "right after a deploy" scenario. Hard ceiling
+// and window are tightened so the storm trips quickly without thousands of connects;
+// fleet devices stay well under the ceiling.
+
+const { test, before, after } = require('node:test');
+const assert = require('node:assert/strict');
+const { spawn } = require('node:child_process');
+const path = require('node:path');
+const os = require('node:os');
+const fs = require('node:fs');
+const crypto = require('node:crypto');
+const ioClient = require('socket.io-client');
+
+const PORT = 3983;
+const BASE = `http://127.0.0.1:${PORT}`;
+const DATA_DIR = path.join(os.tmpdir(), 'st-thr-int-' + crypto.randomBytes(4).toString('hex'));
+const LOG = path.join(os.tmpdir(), 'st-thr-int-' + crypto.randomBytes(4).toString('hex') + '.log');
+let proc;
+
+before(async () => {
+  const logFd = fs.openSync(LOG, 'w');
+  proc = spawn('node', ['server.js'], {
+    cwd: path.join(__dirname, '..'),
+    env: {
+      ...process.env, DATA_DIR, SELF_HOSTED: 'true', PORT: String(PORT), NODE_ENV: 'test',
+      // warm-up left at default (30s) so the whole test runs in the cold-start window
+      RECONNECT_HARD_CEILING: '8',
+      RECONNECT_WINDOW_MS: '5000',
+      RECONNECT_BASE_MAX: '3',
+    },
+    stdio: ['ignore', logFd, logFd],
+  });
+  let up = false;
+  for (let i = 0; i < 80; i++) {
+    try { const r = await fetch(BASE + '/api/status'); if (r.ok) { up = true; break; } } catch { /* */ }
+    await new Promise(r => setTimeout(r, 250));
+  }
+  if (!up) throw new Error('server did not boot:\n' + fs.readFileSync(LOG, 'utf8').slice(-2000));
+});
+
+after(() => { try { proc.kill('SIGKILL'); } catch { /* */ } });
+
+// Provision a brand-new device via a UNIQUE pairing code -> returns {device_id, device_token}.
+function provision() {
+  const code = String(crypto.randomInt(100000, 1000000));
+  return new Promise((resolve) => {
+    const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
+    sock.on('connect', () => sock.emit('device:register', { pairing_code: code }));
+    sock.on('device:registered', (d) => { try { sock.close(); } catch { /* */ } resolve({ id: d.device_id, token: d.device_token }); });
+    setTimeout(() => { try { sock.close(); } catch { /* */ } resolve(null); }, 4000);
+  });
+}
+
+// One genuine reconnect (new socket). Resolves {registered, throttled}.
+function reconnect(dev) {
+  return new Promise((resolve) => {
+    const sock = ioClient(`${BASE}/device`, { transports: ['websocket'], reconnection: false, forceNew: true });
+    let done = false;
+    const finish = (r) => { if (done) return; done = true; try { sock.close(); } catch { /* */ } resolve(r); };
+    sock.on('connect', () => sock.emit('device:register', { device_id: dev.id, device_token: dev.token, device_info: { app_version: 'test' } }));
+    sock.on('device:registered', () => finish({ registered: true, throttled: false }));
+    sock.on('device:throttled', () => finish({ registered: false, throttled: true }));
+    setTimeout(() => finish({ registered: false, throttled: false }), 1500);
+  });
+}
+
+test('GATE: full-fleet reconnect right after restart throttles NO healthy device', async () => {
+  // 12 distinct devices, each reconnecting twice in quick succession — a deploy-time
+  // herd. The loop is transiently busy, but per-device keying means none is flagged.
+  const fleet = [];
+  for (let i = 0; i < 12; i++) { const d = await provision(); assert.ok(d, 'device provisioned'); fleet.push(d); }
+
+  let registered = 0, throttled = 0;
+  // two reconnect rounds across the whole fleet
+  for (let round = 0; round < 2; round++) {
+    const results = await Promise.all(fleet.map(reconnect));
+    for (const r of results) { if (r.registered) registered++; if (r.throttled) throttled++; }
+  }
+  assert.equal(throttled, 0, 'NO healthy fleet device may be throttled at cold start');
+  assert.equal(registered, 24, 'every fleet reconnect registered');
+});
+
+test('a single device storming IS throttled (backoff engages)', async () => {
+  const dev = await provision();
+  assert.ok(dev);
+  let registered = 0, throttled = 0;
+  // 12 sequential reconnects within the 5s window -> exceeds the hard ceiling (8)
+  for (let i = 0; i < 12; i++) {
+    const r = await reconnect(dev);
+    if (r.registered) registered++;
+    if (r.throttled) throttled++;
+  }
+  assert.ok(throttled >= 1, `storming device must be throttled (got ${throttled} throttle(s))`);
+  assert.ok(registered < 12, `not all storm reconnects should succeed (got ${registered}/12)`);
+});
+
+test('neighbor isolation: a healthy device is unaffected while another storms', async () => {
+  const stormer = await provision();
+  const neighbor = await provision();
+  assert.ok(stormer && neighbor);
+  // storm the stormer hard
+  for (let i = 0; i < 12; i++) await reconnect(stormer);
+  // neighbor reconnects normally a couple of times -> must still register
+  const a = await reconnect(neighbor);
+  const b = await reconnect(neighbor);
+  assert.ok(a.registered && b.registered, 'neighbor must register normally while another device storms');
+  assert.ok(!a.throttled && !b.throttled, 'neighbor must not be throttled by another device');
+});
--- a/server/test/reconnect-throttle.test.js
+++ b/server/test/reconnect-throttle.test.js
@ -0,0 +1,98 @@
+'use strict';
+
+// #142 step 3 — deterministic unit tests for the per-device reconnect throttle.
+// Pure logic with injected `now` / band; isolate the DB before require (the module
+// pulls in services/loop-lag -> db/database which initialises a DB on load).
+
+const os = require('node:os');
+const path = require('node:path');
+const crypto = require('node:crypto');
+process.env.DATA_DIR = path.join(os.tmpdir(), 'st-thr-unit-' + crypto.randomBytes(4).toString('hex'));
+
+const { test, beforeEach } = require('node:test');
+const assert = require('node:assert/strict');
+const throttle = require('../lib/reconnect-throttle');
+
+// config defaults: window=10000, baseMax=5, hardCeiling=20, baseBackoff=1000,
+// maxBackoff=60000, releaseMs=30000, warmup=30000, elevMult=2, critMult=4.
+const T0 = 1_000_000;            // arbitrary epoch-ms origin for the warm-up clock
+const POST = T0 + 40_000;        // safely past the 30s warm-up
+const WARM = T0 + 1_000;         // inside the warm-up window
+
+beforeEach(() => throttle.__resetForTest({ startedAt: T0 }));
+
+test('healthy device is never throttled (<= baseMax genuine reconnects)', () => {
+  for (let i = 0; i < 5; i++) {
+    const v = throttle.check('A', POST + i, 'normal');
+    assert.ok(v.allow, `reconnect ${i + 1} (<=baseMax) must be allowed`);
+  }
+});
+
+test('a per-device storm IS throttled and the backoff GROWS (tighten fast)', () => {
+  let v;
+  for (let i = 0; i < 5; i++) v = throttle.check('B', POST + i, 'normal'); // 5 allowed
+  v = throttle.check('B', POST + 5, 'normal'); // 6th -> flagged
+  assert.equal(v.allow, false);
+  assert.equal(v.reason, 'rate');
+  assert.equal(v.observed, 6);
+  assert.equal(v.allowed, 5);
+  const b1 = v.retryAfterMs;
+  // keep hammering while blocked -> escalate, longer backoff each time
+  const b2 = throttle.check('B', POST + 6, 'normal').retryAfterMs;
+  const b3 = throttle.check('B', POST + 7, 'normal').retryAfterMs;
+  assert.ok(b2 > b1 && b3 > b2, `backoff must grow: ${b1} < ${b2} < ${b3}`);
+});
+
+test('lag band multiplies an already-flagged device\'s backoff (critical > normal)', () => {
+  let v;
+  for (let i = 0; i < 5; i++) throttle.check('N', POST + i, 'normal');
+  v = throttle.check('N', POST + 5, 'normal');
+  const normalBackoff = v.retryAfterMs;
+
+  throttle.__resetForTest({ startedAt: T0 });
+  for (let i = 0; i < 5; i++) throttle.check('C', POST + i, 'critical');
+  v = throttle.check('C', POST + 5, 'critical');
+  assert.ok(v.retryAfterMs > normalBackoff, `critical backoff ${v.retryAfterMs} > normal ${normalBackoff}`);
+});
+
+test('a healthy device is NOT throttled even when the band is critical (lag never gates the healthy)', () => {
+  for (let i = 0; i < 5; i++) {
+    const v = throttle.check('H', POST + i, 'critical');
+    assert.ok(v.allow, 'healthy device stays allowed regardless of band');
+  }
+});
+
+test('COLD START: during warm-up, moderate flapping (>baseMax, <ceiling) is NOT throttled', () => {
+  for (let i = 0; i < 12; i++) { // 12 > baseMax(5) but < hardCeiling(20)
+    const v = throttle.check('W', WARM + i, 'critical'); // band forced normal in warm-up anyway
+    assert.ok(v.allow, `warm-up reconnect ${i + 1} must be lenient`);
+  }
+});
+
+test('HARD CEILING is enforced even during warm-up (slow-ramp cannot train through)', () => {
+  let v;
+  for (let i = 0; i < 20; i++) {
+    v = throttle.check('K', WARM + i, 'normal');
+    assert.ok(v.allow, `warm-up reconnect ${i + 1} (<=ceiling) allowed`);
+  }
+  v = throttle.check('K', WARM + 20, 'normal'); // 21st -> over ceiling(20)
+  assert.equal(v.allow, false);
+  assert.equal(v.reason, 'hard-ceiling');
+});
+
+test('neighbor isolation: one device storming does not throttle another', () => {
+  for (let i = 0; i < 10; i++) throttle.check('STORM', POST + i, 'normal'); // STORM gets throttled
+  const v = throttle.check('NEIGHBOR', POST + 11, 'normal');
+  assert.ok(v.allow, 'a different device must be unaffected');
+});
+
+test('release slow: escalation level decays after a calm period', () => {
+  let v;
+  for (let i = 0; i < 6; i++) v = throttle.check('R', POST + i, 'normal'); // flagged, level 1
+  assert.ok(v.level >= 1);
+  const peak = v.level;
+  // a calm reconnect well past the window AND past releaseMs(30000)
+  v = throttle.check('R', POST + 6 + 40_000, 'normal');
+  assert.ok(v.allow, 'calm reconnect after the storm is allowed');
+  assert.ok(v.level < peak, `level decays after calm: ${v.level} < ${peak}`);
+});
--- a/server/test/status-log-prune.test.js
+++ b/server/test/status-log-prune.test.js
@ -0,0 +1,48 @@
+'use strict';
+
+// #142 step 4 — global device_status_log retention sweep. Deterministic, in-process
+// (no server/port). Isolate the DB and set retention BEFORE requiring the module
+// (config reads env at load; database.js initialises a DB on load).
+
+const os = require('node:os');
+const path = require('node:path');
+const crypto = require('node:crypto');
+process.env.DATA_DIR = path.join(os.tmpdir(), 'st-statusprune-' + crypto.randomBytes(4).toString('hex'));
+process.env.STATUS_LOG_RETENTION_DAYS = '2';
+
+const { test } = require('node:test');
+const assert = require('node:assert/strict');
+const { db, pruneStatusLog } = require('../db/database');
+
+test('global sweep deletes rows older than retention across ALL devices, keeps recent', () => {
+  db.exec('DELETE FROM device_status_log'); // clean slate
+  const old = db.prepare("INSERT INTO device_status_log (device_id, status, timestamp) VALUES (?, ?, strftime('%s','now') - ?)");
+
+  // 5 days old (> 2d retention): an active device, a device NOT in the devices
+  // table (removed/idle — what the per-device insert-time prune never revisits),
+  // and the heartbeat offline_timeout status that bypasses logDeviceStatus.
+  old.run('live-dev', 'online', 5 * 86400);
+  old.run('removed-idle-dev', 'offline', 5 * 86400);
+  old.run('hb-dev', 'offline_timeout', 5 * 86400);
+  // recent (< retention): must survive, regardless of device existence / status.
+  old.run('live-dev', 'online', 0);
+  old.run('hb-dev', 'offline_timeout', 3600);
+
+  assert.equal(db.prepare('SELECT COUNT(*) c FROM device_status_log').get().c, 5, 'seeded 5 rows');
+
+  const deleted = pruneStatusLog();
+  assert.equal(deleted, 3, 'the 3 over-retention rows pruned (incl. removed-idle + offline_timeout paths)');
+
+  const remaining = db.prepare('SELECT device_id, status FROM device_status_log ORDER BY device_id').all();
+  assert.equal(remaining.length, 2);
+  // both survivors are the recent rows; no old row of any device/status survived
+  assert.deepEqual(remaining.map(r => r.device_id).sort(), ['hb-dev', 'live-dev']);
+  const oldestNow = db.prepare("SELECT MIN(timestamp) m FROM device_status_log").get().m;
+  const cutoff = Math.floor(Date.now() / 1000) - 2 * 86400;
+  assert.ok(oldestNow >= cutoff, 'no surviving row is older than the retention cutoff');
+});
+
+test('sweep is safe and idempotent on an empty/already-clean table', () => {
+  db.exec('DELETE FROM device_status_log');
+  assert.equal(pruneStatusLog(), 0, 'nothing to delete -> 0, no throw');
+});
--- a/server/ws/deviceSocket.js
+++ b/server/ws/deviceSocket.js
@ -6,6 +6,7 @@ const { db, pruneTelemetry, pruneScreenshots } = require('../db/database');
 const config = require('../config');
 const heartbeat = require('../services/heartbeat');
 const commandQueue = require('../lib/command-queue');
+const reconnectThrottle = require('../lib/reconnect-throttle');

 // Debounce window for marking a device offline on socket disconnect. Brief
 // flap (Wi-Fi blip, Engine.IO ping miss, server-side eviction-then-reconnect)
@ -27,6 +28,12 @@ const OFFLINE_DEBOUNCE_MS = 5000;
 // event is still forwarded every time, so the UI is unaffected. In-memory only.
 const lastPlayLogAt = new Map();
 const PLAY_LOG_MIN_GAP_MS = 2000;
+
+// #142 content-ack dedup. An older app can spam "content <id>: ready" for the same
+// item; each was logged + emitted individually (secondary load). Suppress identical
+// (device_id, content_id, status) reports within config.contentAckDedupMs. A status
+// CHANGE has a different key and passes immediately. In-memory; resets on restart.
+const lastContentAck = new Map();
 const { getUserPlan, getUserDeviceCount } = require('../middleware/subscription');
 // Phase 2.3: deviceRoom() resolves a device_id to its workspace room so
 // dashboardNs.emit can be scoped instead of broadcast platform-wide.
@ -353,6 +360,23 @@ module.exports = function setupDeviceSocket(io) {
            return;
          }

+          // #142: per-device reconnect throttle. Only GENUINE reconnects (a new
+          // socket) count — same-socket playlist refreshes (isPlaylistRefresh) are
+          // exempt. This runs BEFORE the heavy register work (DB writes, playlist
+          // build) so a single flapping device cannot saturate the event loop. The
+          // verdict is per-device; global lag only scales an already-flagged
+          // device's backoff, never gates a healthy one.
+          if (!isPlaylistRefresh) {
+            const verdict = reconnectThrottle.check(device_id);
+            if (!verdict.allow) {
+              console.warn(`[throttle] device ${device_id} reconnect throttled: reason=${verdict.reason} band=${verdict.band} observed=${verdict.observed}/${verdict.allowed} per ${config.reconnectWindowMs}ms -> backoff ${verdict.retryAfterMs}ms (level ${verdict.level})`);
+              socket.emit('device:throttled', { retry_after_ms: verdict.retryAfterMs, reason: 'reconnect_rate' });
+              // nextTick disconnect so the throttle notice flushes first.
+              process.nextTick(() => { try { socket.disconnect(true); } catch (_) { /* */ } });
+              return;
+            }
+          }
+
          currentDeviceId = device_id;
          authenticated = true;
          // Cancel any pending offline timer - device is back in the grace window
@ -372,8 +396,12 @@ module.exports = function setupDeviceSocket(io) {
          }

          if (device_info) {
-            db.prepare('UPDATE devices SET android_version = ?, app_version = ?, screen_width = ?, screen_height = ?, render_width = ?, render_height = ? WHERE id = ?')
-              .run(device_info.android_version, device_info.app_version, device_info.screen_width, device_info.screen_height, device_info.render_width ?? null, device_info.render_height ?? null, device_id);
+            db.prepare(`UPDATE devices SET android_version = ?, app_version = ?, screen_width = ?, screen_height = ?, render_width = ?, render_height = ?,
+              ota_status = ?, ota_target_version = ?, ota_attempts = ?, ota_updated_at = strftime('%s','now') WHERE id = ?`)
+              .run(device_info.android_version, device_info.app_version, device_info.screen_width, device_info.screen_height, device_info.render_width ?? null, device_info.render_height ?? null,
+                // #139 Phase 2: older APKs don't send these — default to a clean 'none' state.
+                device_info.ota_status ?? 'none', device_info.ota_target_version ?? null, device_info.ota_attempts ?? 0,
+                device_id);
          }

          heartbeat.registerConnection(device_id, socket.id);
@ -557,6 +585,13 @@ module.exports = function setupDeviceSocket(io) {
      if (!requireDeviceAuth()) return;
      const { device_id, content_id, status } = data;
      if (device_id !== currentDeviceId) return;
+      // #142: drop repeats of the same (device, content, status) within the dedup
+      // window. Only a change (new content/status) or a report after the window
+      // logs+emits, so a device spamming the same "ready" can't add load.
+      const ackKey = `${device_id}|${content_id}|${status}`;
+      const nowAck = Date.now();
+      if (nowAck - (lastContentAck.get(ackKey) || 0) < config.contentAckDedupMs) return;
+      lastContentAck.set(ackKey, nowAck);
      console.log(`Device ${device_id} content ${content_id}: ${status}`);
      emitToDeviceWorkspace(dashboardNs, device_id, 'dashboard:content-ack', { device_id, content_id, status });
    });
@ -585,6 +620,20 @@ module.exports = function setupDeviceSocket(io) {
      });
    });

+    // #139 Phase 2 (Option B): event-driven OTA status. The device announces a status TRANSITION
+    // ('manual_update_required' on enter-backoff, 'none' on clear) so the dashboard badge updates
+    // promptly without waiting for a reconnect. The register path still persists these fields too
+    // (the reconnect backstop if a transition event is missed). Same columns + ?? defaults.
+    socket.on('device:ota-status', (data) => {
+      if (!requireDeviceAuth()) return;
+      const { device_id, ota_status, ota_target_version, ota_attempts } = data || {};
+      // Unknown / forged / mismatched id -> no-op. WHERE id = ? also makes an unregistered id a
+      // 0-row update (never throws), so a stray event can't error the socket.
+      if (!device_id || device_id !== currentDeviceId) return;
+      db.prepare("UPDATE devices SET ota_status = ?, ota_target_version = ?, ota_attempts = ?, ota_updated_at = strftime('%s','now') WHERE id = ?")
+        .run(ota_status ?? 'none', ota_target_version ?? null, ota_attempts ?? 0, device_id);
+    });
+
    // Play event logging (proof-of-play)
    socket.on('device:play-event', (data) => {
      if (!requireDeviceAuth()) return;
--- a/tizen/config.xml
+++ b/tizen/config.xml
@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <widget xmlns="http://www.w3.org/ns/widgets" xmlns:tizen="http://tizen.org/ns/widgets"
-        id="http://screentinker.com/player" version="1.9.1" viewmodes="maximized">
+        id="http://screentinker.com/player" version="1.9.2" viewmodes="maximized">
    <tizen:application id="ScrnTinkr1.ScreenTinker" package="ScrnTinkr1" required_version="2.4"/>
    <tizen:profile name="tv"/>
    <name>ScreenTinker</name>
Author	SHA1	Message	Date
ScreenTinker	d9fb914b9e	chore(release): v1.9.2-beta1 Some checks failed CI / Unit tests (node --test) (push) Has been cancelled Details CI / OpenAPI spec lint (push) Has been cancelled Details CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Has been cancelled Details CI / Boot smoke + version check (push) Has been cancelled Details	2026-06-27 19:59:34 -05:00
ScreenTinker	ce78d0dde4	docs(#142 ): 1.9.2-beta1 changelog + device_status_log VACUUM maintenance note Documents the #142 changes and tells operators with an already-bloated device_status_log to reclaim space with a one-time manual VACUUM in a maintenance window (retention now bounds further growth). Explains why auto-VACUUM is not enabled. New doc: docs/maintenance-device-status-log.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:59:17 -05:00
ScreenTinker	f206537fed	Merge #142 (reconnect-storm hardening) into main for 1.9.2-beta1 Brings the full #142 stack onto main on top of the 1.9.1 stable cut: - device_status_log index + de-dupe - event-loop lag telemetry (bounded) - load-aware per-device reconnect throttle (the outage fix) - global device_status_log retention sweep (STATUS_LOG_RETENTION_DAYS) - content-ack dedup - provisioning-row cleanup window 365d -> 24h	2026-06-27 19:56:46 -05:00
ScreenTinker	139d7d09fa	fix(#142 ): provisioning-row cleanup window 365d -> 24h (matches its own comment) services/heartbeat.js deleted unclaimed provisioning devices with created_at < now - (365 * 86400) — a YEAR — while its own comment said "older than 24 hours". So socket-register pairing junk lingered ~365x longer than intended. Change the window to 24 * 3600 to match the comment. Correctness fix only — does NOT touch the pre-auth register path or add a rate limiter (that pre-auth hardening is a separate security issue, out of this cut). Extracted the sweep into pruneProvisioningDevices() (still in heartbeat.js, called from the same interval) so it is unit-testable. Test asserts a >24h unclaimed provisioning row is swept while a <24h row, an imported row (user_id set), and a non-provisioning row are kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:56:32 -05:00
ScreenTinker	852219cb45	chore(release): v1.9.1	2026-06-27 19:50:09 -05:00
ScreenTinker	15448d1c5d	fix(#142 ): dedup repeated content-ack reports (secondary load) device:content-ack logged + emitted every message, so a device repeatedly reporting the same "content <id>: ready" (observed from an older app version) added avoidable load per message. - Suppress identical (device_id, content_id, status) reports within config.contentAckDedupMs (default 10s), modeled on the lastPlayLogAt throttle. A status change has a different key and passes immediately; a fresh report after the window passes too. In-memory, resets on restart. The handler does no DB writes, so this is purely shedding redundant log+emit work. test: integration over a real authenticated device socket — a burst of identical "ready" collapses to one log/emit, a "ready" after the window passes, and a status change is never deduped. Unique PORT (3984). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:35:04 -05:00
ScreenTinker	29a8896aa8	fix(#142 ): global device_status_log retention sweep + STATUS_LOG_RETENTION_DAYS The per-device insert-time prune (deviceSocket.js) only ever touches a device that is actively inserting, so it misses two paths: removed/idle devices whose rows linger forever, and heartbeat.js's offline_timeout insert that bypasses logDeviceStatus entirely. The reporter's 1.2M-row bloat accumulated UNDER a 7-day per-device prune for exactly this reason. - pruneStatusLog() (db/database.js): a GLOBAL time-range sweep across ALL devices, modeled on the play_logs prune. Run once on startup (recovers a bloated table right after deploy) and on the heartbeat interval (services/heartbeat.js). - STATUS_LOG_RETENTION_DAYS env, default 3 (lower than the old hardcoded 7d; the dashboard only shows a 24h uptime window, so 2-3d is ample for diagnostics). - Deliberately NO per-device row cap: Step 3's throttle already bounds how fast a storming device can generate status rows, so a cap would add sweep complexity for little gain (noted for later if needed). - NO VACUUM / auto_vacuum here (kept off the hot path); space reclaim is left as a separate decision (see report). test: deterministic in-process unit test proves the sweep deletes over-retention rows across all devices — including a device absent from the devices table and an offline_timeout row — while keeping recent rows; idempotent on an empty table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:34:19 -05:00
ScreenTinker	101f086204	fix(#142 ): load-aware per-device reconnect throttle (the outage fix) Gates genuine reconnects PER DEVICE before the heavy register work (DB writes + playlist build) runs, so a single flapping device can no longer saturate the event loop and take down the server. - Actuator is per-device, keyed on device_id (modeled on lastPlayLogAt). A device is flagged only when it exceeds reconnectBaseMax genuine reconnects per window. Same-socket playlist refreshes (isPlaylistRefresh) are exempt. - Load-awareness is BANDED (normal/elevated/critical from the step-2 lag signal), not a continuous controller. The band only MULTIPLIES an already-flagged device's backoff; global lag never gates a healthy device. - Hysteresis: escalate immediately while storming (tighten fast); decay one level per reconnectReleaseMs of calm (release slow). - HARD CEILING per device, independent of band and warm-up — a slow-ramp attacker can't train through it. - COLD START: for reconnectWarmupMs after boot, force the normal band and apply only the hard ceiling, so a full-fleet reconnect after a deploy doesn't throttle healthy screens. State is in-memory, resets on restart. - Observability: every throttle engagement logs device, band, observed vs allowed rate, and backoff. Throttled device gets device:throttled + a deferred disconnect. Tests (api.test.js style): - unit: healthy-never-throttled, storm-throttled-with-growing-backoff, band multiplies backoff, hard-ceiling-even-in-warmup, warm-up leniency, neighbor isolation, slow release. - integration GATE (the required one): full-fleet reconnect right after restart throttles NO healthy device; a single device storming IS throttled; a neighbor stays unaffected while another storms. - also fixes pre-existing test PORT collisions (my new integration files clashed with totp.test.js:3979 and totp-keyrotation.test.js:3980 -> moved to 3982/3983); full suite now green serially AND in parallel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:18:00 -05:00
ScreenTinker	ed3cf72b82	feat(#142 ): event-loop lag telemetry (perf_hooks) + bounded storage Continuously samples event-loop delay via perf_hooks.monitorEventLoopDelay() (C++-backed histogram; cheap). Each window persists mean/p50/p99/max to a new event_loop_lag table and recomputes a coarse load band (normal/elevated/critical) from the window p99. Standalone value: current lag is exposed on /api/status and band changes are logged, so site lag is diagnosable independent of throttling. The band feeds the #142 reconnect throttle (next commit) but ships first as its own subsystem. - event_loop_lag is bounded from day one: indexed on sampled_at + scheduled prune (LAG_TELEMETRY_RETENTION_DAYS, small default) modeled on the play_logs prune. Deliberately NOT another unbounded-growth table. - Band transitions are asymmetric: jump up immediately (tighten fast), release one level at a time after N calm samples below a deadband (release slow, no flap). Pure nextBand() function, unit-tested deterministically. - config: LAG_SAMPLE_INTERVAL_MS, LAG_RESOLUTION_MS, LAG_TELEMETRY_RETENTION_DAYS, LAG_PRUNE_INTERVAL_MS, LAG_ELEVATED_MS, LAG_CRITICAL_MS, LAG_RELEASE_SAMPLES. - tests: band-transition unit tests; integration proves sampling persists, stays bounded under the prune, and surfaces on /api/status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 19:01:08 -05:00
ScreenTinker	d90cfb3986	fix(#142 ): index device_status_log + de-dupe its CREATE TABLE The dashboard uptime query (WHERE device_id=? AND timestamp>?) and the per-device retention prune (WHERE device_id=? AND timestamp<?) were both full table scans. At 1M+ rows (the outage report) this was the dashboard-degradation cause that persisted even after the reconnect storm stopped. - schema.sql: add idx_device_status_log_device_ts(device_id, timestamp); both queries now SEARCH ... USING INDEX instead of SCAN (verified via EXPLAIN). - database.js: same index as a migration for existing DBs (idempotent). - schema.sql defined device_status_log twice; drop the duplicate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 18:54:57 -05:00
ScreenTinker	f96b65576f	chore(release): guard bump-version.sh against a diverged origin/main Some checks failed CI / Unit tests (node --test) (push) Has been cancelled Details CI / OpenAPI spec lint (push) Has been cancelled Details CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Has been cancelled Details CI / Boot smoke + version check (push) Has been cancelled Details Add a pre-push fast-forward check: fetch origin/main and abort if it has commits not in local HEAD, BEFORE the annotated tag is created. Prevents the beta9 incident where origin/main had advanced by one commit so 'git push origin main' was rejected, but the tag pushed anyway and fired release.yml from a commit not on main. Best-effort fetch — warns and proceeds when offline (the push stays the backstop). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 12:26:23 -05:00
ScreenTinker	ed164647b8	Merge origin/main (Update SECURITY.md) into beta9 cut	2026-06-25 12:16:47 -05:00
ScreenTinker	ae018b8eea	chore(release): v1.9.1-beta9	2026-06-25 12:06:44 -05:00
ScreenTinker	071d7cc9c3	fix(server): persist per-item mute into the published snapshot (#129 ) A mute toggle wrote the draft playlist_items + emitted a live device:mute-changed but only markDraft()'d — it never updated playlists.published_snapshot, the copy the device actually plays. So the device's item.muted stayed 0 and every loop/reload re-applied full volume: dashboard icon red but audio kept playing (Android; web's native <video> loop masked it). emitMuteChanged now surgically patches the matching item's muted (0/1) inside the published_snapshot and re-pushes the playlist, so loops re-apply the correct flag. Surgical patch (not publishPlaylist) so a mute toggle can't prematurely publish other draft edits or flip publish state. Adds a regression test that fails without the patch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 12:06:29 -05:00
screentinker	1e1ed7e29a	Update SECURITY.md Some checks are pending CI / Unit tests (node --test) (push) Waiting to run Details CI / OpenAPI spec lint (push) Waiting to run Details CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run Details CI / Boot smoke + version check (push) Waiting to run Details	2026-06-24 12:09:25 -05:00
ScreenTinker	36c4bf523f	chore(release): v1.9.1-beta8	2026-06-24 11:43:31 -05:00
ScreenTinker	16c381254b	fix(android): lower minSdk 26 -> 24 to support Android 7.0/7.1 panels (#141 ) Covers API 24 (7.0) + 25 (7.1.2); all 26+ APIs were already guarded with graceful else branches; no dependency bumps. Validated on API 24 + 25 emulators: install, foreground service, #139 OTA verify on the legacy GET_SIGNATURES path (incl. tampered-refuse), EncryptedSharedPreferences, and playback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 11:38:56 -05:00
Christopher Cookman	01e5b10f53	feat(setup): Debian 13 player/server install script (#137 ) Some checks are pending CI / Unit tests (node --test) (push) Waiting to run Details CI / OpenAPI spec lint (push) Waiting to run Details CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run Details CI / Boot smoke + version check (push) Waiting to run Details Community contribution from @ChrisChrome (tested on Debian 13 headless). Adds scripts/debian-13-setup.sh — server/player/both modes, systemd units, kiosk autologin, and management scripts (status/update/logs) — modeled on the Raspberry Pi setup. Also fixes Chromium fullscreen by detecting screen resolution at runtime (replacing --start-fullscreen), applied to both the Debian and Pi scripts, plus a README entry. Maintainer review fix: the kiosk wait-loop now polls /api/status (the server's real readiness endpoint) instead of the non-existent /api/health, which had been silently burning the ~120s timeout on every all-in-one boot (bug inherited from the Pi script, fixed in both).	2026-06-23 23:47:22 -05:00
ScreenTinker	9c990ff91f	chore(release): v1.9.1-beta7 Some checks are pending CI / Unit tests (node --test) (push) Waiting to run Details CI / OpenAPI spec lint (push) Waiting to run Details CI / Android unit tests (Kotlin schedule evaluator vectors) (push) Waiting to run Details CI / Boot smoke + version check (push) Waiting to run Details	2026-06-23 23:23:00 -05:00
ScreenTinker	a6fe849c67	Merge fix/ota-redownload-loop (#140 ): stop OTA re-download loop on devices that can't silently install (#139 )	2026-06-23 23:22:29 -05:00
ScreenTinker	0c0a8dd68a	fix(ota): surface stuck OTA on dashboard + read APK signer correctly on API 28/29 (#139 ) Follow-up to the cache/backoff loop fix (`aa23cf0`): make a device that can't self-install visible to operators, and fix the signature-verify bug that kept the whole #139 fix from engaging on the actual Fire OS target. Dashboard surface (Phase 2): - devices gains ota_status / ota_target_version / ota_attempts / ota_updated_at via the idempotent ALTER TABLE ADD COLUMN migration (non-destructive, default-backfilled, idempotent on re-run). - The device reports ota_status (OtaThrottle.statusFor -> none \| pending \| manual_update_required) in device_info; the server persists it on register (the reconnect backstop). devices d.* already surfaces it to the dashboard. - Dashboard shows a non-blocking amber badge when manual_update_required ("Update available (vX) - install failed N times, manual update required"); i18n key in en.js (non-en inherits via the en fallback). Server suite +1 test. Event-driven status (Option B): - New device:ota-status WS message, emitted on STATE TRANSITIONS only (enter-backoff -> manual_update_required, clear -> none), so the badge updates promptly without waiting for a reconnect and without per-poll/heartbeat chatter. Server handler persists the same fields; an unknown/forged device_id is a safe no-op. The register-path persist stays as the reconnect backstop. Signature-verify fix (the critical piece): verifyApkSignature read the downloaded APK's signer via getPackageArchiveInfo(GET_SIGNING_CERTIFICATES).signingInfo, but that field is null for ARCHIVE files on API 28/29 (populated only from API 30). On Fire OS 8 (Android 9 / API 28) - the actual deployment target - this returned 0 certs from a correctly-signed APK, so every OTA was refused as "tampered," the cache was deleted, and the full APK re-downloaded every check cycle. This was the real cause of the #139 re-download loop, NOT a silent-install failure: the cache and backoff added in this branch sit behind this verify gate and never engaged on the target. Fix: below API 30, read the archive's signer via the legacy GET_SIGNATURES + .signatures (its v1/JAR cert, which IS populated on 28/29). Keep GET_SIGNING_CERTIFICATES + signingInfo for API >= 30 and for the installed-app read (which works on 28+). The archive's signer is still extracted and compared to the installed app's signer; a mismatch or zero-cert APK is still rejected. This reads the cert correctly on old APIs - it does not weaken verification. Verified on emulators: - API 28: verify now passes for a legit APK (was: 0 certs, refused). Full backoff then engages - 8.5MB pulled once, cache-hit on retries, backoff after 3, manual_update_required emitted once; clears on successful update. - API 28 negative: a re-signed (different-key) APK is still refused on cert MISMATCH - no hole opened. - API 30: unchanged path still passes (no regression). - server suite 173/173, OtaThrottleTest 7/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:49:01 -05:00
ScreenTinker	aa23cf02dd	fix(ota): stop OTA re-download loop on devices that cannot silently install (#139 ) Devices that download an OTA APK but cannot silently install it (Fire TV: no device-owner path) re-downloaded the full APK every check cycle indefinitely - install never completes, version never advances, next check re-triggers. Client (UpdateChecker.kt, ServerConfig.kt, OtaThrottle.kt): - Reuse a cached, signature-verified APK instead of re-downloading every cycle; delete leftover invalid files; keep the verified APK on disk as the manual-install artifact. - Persisted per-version attempt budget (EncryptedSharedPreferences) so it survives the Fire OS app restarts that drive the loop. An attempt is counted only when an install is launched - a download/verify failure does not consume the budget, so a transient network problem cannot park a healthy device in backoff. After 3 failed installs, back off to one retry per 24h. - Clear OTA state and caches when a check returns update_available=false while state is pending (app relaunched as the new version). - Report OTA status to the dashboard via device:log (tag ota) on state transitions only (enter-backoff, clear) to avoid flooding the channel. - Extract throttle decision logic into a pure OtaThrottle object (no Android deps) with JUnit coverage (OtaThrottleTest) for the state transitions. Server (server.js): - Reword /download/apk log from "OTA update in progress" to "APK served" and rate-limit to once per IP / 10 min so a looping device cannot flood the log. Note: client-cooperative fix - prevents the loop in cohorts running this APK. Currently-stuck beta4 devices still require a one-time manual update. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 19:53:55 -05:00