# SSH Subsystem (Operator Runbook) The orchestrator can run shell commands on remote servers (`SshExec`) and move files between the workspace and remote hosts (`SshUpload`/`SshDownload`) through a dedicated, audited SSH subsystem. Like the MCP integration, **the feature is off by default** and requires a key + config flip to enable. This document is the **operator runbook** for setting up, granting access, verifying host keys, rotating the master key, and troubleshooting. For the LLM-facing tool semantics see [docs/tools/ssh-tools.md](./tools/ssh-tools.md). For the internal design (threat model, risk register, schema, 12-step orchestration flow) see ## At a glance | Aspect | Behavior | |---|---| | Default | `ssh.enabled: false` — tools hidden, panels hidden, API returns 503 | | Tools exposed when enabled | `SshExec`, `SshUpload`, `SshDownload` | | Authentication | Public-key only; passwords are **not** supported | | Host key trust | TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed | | Connection ownership | User-owned (private) **or** Global (admin-managed, shared via grants) | | Encryption at rest | AES-256-GCM (per-row DEK, master key = `MCP_ENCRYPTION_KEY`) | | Audit | Dedicated `ssh_audit_log` table, `pending → success/failed/denied/aborted` lifecycle | | Abuse defense | 3-scope counters (`user` / `host:user` / `host`) with auto-lock | | Network policy | SSRF strict by default; per-connection opt-in for private IPs | | Algorithm policy | Strict allowlist (no SHA1-RSA, no weak DH/HMAC) | ## Prerequisites ### 1. `MCP_ENCRYPTION_KEY` The SSH subsystem **shares the same master key as MCP** — there is only one key per orchestrator. All private keys, passphrases, and global-connection DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is wrapped by this master key. Generate it once (32 bytes = 64 hex chars): ```bash openssl rand -hex 32 ``` Export it before starting the server: ```bash export MCP_ENCRYPTION_KEY= scripts/server.sh start ``` If `MCP_ENCRYPTION_KEY` is **not** set when `ssh.enabled: true`, the SSH subsystem boots **fail-soft**: a warning is logged, all SSH endpoints return 503, the tools are hidden from LLM, and the UI panels show a configuration error banner. Other features (MCP excepted) continue normally. > ⚠ **Key rotation invalidates existing encrypted material.** There is a > built-in [master key rotation flow](#master-key-rotation) that rewraps > every row in maintenance mode. Do **not** swap the env var manually > without using that flow — half-rotated state breaks every connection. ### 2. `ssh.enabled: true` Flip the flag in `config.yaml`: ```yaml ssh: enabled: true ``` This is the master switch. With it `false`: - HTTP endpoints (`/api/ssh/*`, `/api/ssh/admin/*`) return 503 - Tool defs are not exposed to the LLM (the dispatcher returns null) - UI panels render an "SSH is disabled" empty state - Database tables remain present (no destructive change) Restart is **not** required — `ConfigManager` reload picks up the change and rebuilds the SSH router. ### 3. `system_deks` bootstrap The first time the orchestrator boots with `ssh.enabled: true` AND a valid `MCP_ENCRYPTION_KEY`, it provisions a single row in `system_deks` (via `INSERT OR IGNORE` inside a transaction, `CHECK(id=1)`). This DEK encrypts **global connections** (those without an owner). On every subsequent boot, `verifySystemDek` decrypts the stored DEK to prove the master key still works. If it fails (key rotated outside the rotation flow, or env var differs from when the DEK was wrapped), SSH **fails closed for the session** and a `system_dek_verify_failed` error is logged. User-owned connections may still partially work (their DEKs are wrapped per user), but global connections will all error. ### 4. Optional: `allow_private_addresses` By default, SSH connections are routed through the SSRF strict-check, which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12, 192.168/16, fc00::/7, 169.254/16) addresses. For **LAN targets** you must opt in. There are two scopes: ```yaml ssh: enabled: true allow_private_addresses: true # global default ``` ```sql -- per-connection opt-in (admin-only flag, audited) UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?; ``` The per-connection flag is preferred — narrow the blast radius. The global flag exists for trusted dev networks (homelab, isolated VPC). The per-connection flag can only be set on **global** (admin-managed) connections; for user-owned connections, the global flag applies. ## Quickstart ```bash # 1. Set the key openssl rand -hex 32 > ~/.mcp_encryption_key export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key) # 2. Enable SSH + allow LAN cat >> config.yaml <<'YAML' ssh: enabled: true allow_private_addresses: true # only if you're targeting LAN YAML # 3. Restart scripts/server.sh restart ``` Then in the UI: 1. **Settings → User Folder → SSH Connections → Add** 2. Fill `label`, `host`, `port` (default 22), `username`, paste private key (OpenSSH PEM) 3. Optionally set `remote_path_prefix` (default `/`) — restricts upload/download paths 4. Click **Test** → first call returns `host_key_first_observe` with a fingerprint 5. **Verify** in the dialog (compare fingerprint with what you expect from `ssh-keyscan `) 6. Add the connection's UUID to a piece's `allowed_ssh_connections`: ```yaml # pieces/example.yaml name: ssh-example movements: - name: deploy allowed_tools: [SshExec, SshUpload] allowed_ssh_connections: ["abcd1234-..."] rules: - condition: done next: COMPLETE instruction: | Use SshExec to ... ``` 7. Test the piece via the normal task UI. ## `config.yaml` Reference Full SSH section with defaults: ```yaml ssh: # master switch enabled: false # SSRF policy — when true, allow private/loopback addresses (global) allow_private_addresses: false # wall-clock timeout for connect + handshake + exec/transfer (seconds) call_timeout_seconds: 30 # stdout/stderr byte cap for SshExec (bytes) max_output_bytes: 32768 # 32 KiB # SFTP transfer size caps (MB) max_upload_size_mb: 100 max_download_size_mb: 100 # ssh_audit_log retention (days). Admin can prune via UI. audit_retention_days: 90 # When true (default), admins can use any connection without an explicit # grant. Audited regardless. Set false for stricter least-privilege. admin_bypasses_grants: true # Abuse counters abuse_window_minutes: 10 # rolling window for failure counting abuse_failure_threshold: 5 # failures within window → lock abuse_lock_minutes: 30 # lock duration on threshold breach ``` All keys translate to camelCase in `SshRuntimeConfig` (`src/ssh/config.ts`). The `transformKeys` helper in `src/config.ts` handles the conversion. ## Connection Model ### Owner Each row in `ssh_connections` has an `owner_id`: | Owner | Visibility | Who creates | |---|---|---| | **User-owned** (`owner_id = userId`) | Only the owner; admin can also list but not edit | Any authenticated user (`POST /api/ssh/connections`) | | **Global** (`owner_id IS NULL`) | All users see it in the picker (subject to grants) | Admin only (`POST /api/ssh/admin/globals`) | Global connections solve the "team-shared infra account" use case — a single set of credentials that multiple users invoke under their own identity, audited per user, gated by grants. ### Encryption For each connection: 1. Generate a fresh 32-byte DEK 2. Encrypt `private_key_pem` (and optionally `passphrase`) with the DEK 3. Wrap the DEK with the master key (`MCP_ENCRYPTION_KEY`) 4. Store: `private_key_enc`, `private_key_dek_enc`, `key_version` `key_version` allows progressive rewrap during master key rotation (each row tracks which generation of master key its DEK is wrapped under). Global connections use the single `system_deks` row (id=1) rather than a per-row DEK. > ⚠ ssh2's internal key handling is opaque — once the PEM is loaded into > the library, it lives in JS heap memory for the lifetime of the > connection. We `Buffer.fill(0, 0)` our copies in the `finally` block > but cannot reach into ssh2 internals. This is an acknowledged > limitation; see the plan doc's "Acknowledged limitations" section. ## UI Walkthrough ### User: Settings → User Folder → SSH Connections The **SshConnectionsPanel** lists the user's connections and any global connections they have a grant for. Each row shows: - Label + host:port + username - Host key fingerprint + verify state (verified / pending / first_observe / mismatch) - Lock state (if abuse counter triggered) - Actions: **Test**, **Verify host key**, **Replace host key** (with reason), **Edit**, **Delete** The "Add Connection" form (`SshConnectionForm`) collects: - Label - Host, port, username - Private key (textarea — PEM format; passphrase optional) - Remote path prefix (default `/`) - Custom deny/allow regex patterns (newline-separated, validated at save-time) `SshHostKeyDialog` opens on first_observe / mismatch and shows the observed fingerprint side-by-side with the previously-stored one (if any). "Trust this key" requires typing the fingerprint to confirm. ### Admin: Settings → SSH Four sub-panels under `SshForm`: | Panel | Component | Purpose | |---|---|---| | **Global Connections** | `SshGlobalConnectionsForm` | CRUD on global connections (`owner_id IS NULL`). Includes the `allow_remote_unrestricted` and per-connection `allow_private_addresses` flags | | **Grants** | `SshGrantsForm` | List/create/delete grants. Per-piece or `applies_to_all_pieces`. Subject: user or org. Reason required | | **Audit Log** | `SshAuditLog` | All-tenant audit view. Filter by action / outcome / connection / time range. Pagination | | **Master Key Rotation** | `SshMasterKeyRotationForm` | Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status | Admin can also force-unlock abuse counters from the per-connection page (requires reason; rate-limited to 10/hour total). ## Host Key TOFU Flow SSH security depends on knowing the **right** host key. We use Trust-On- First-Use: the first time a connection is exercised, we record the observed key and require explicit user verification before treating it as trusted. ### States `ssh_connections` carries three host-key columns: | Column | Meaning | |---|---| | `host_key_b64` | The observed public key in OpenSSH base64 form. NULL = never observed. | | `host_key_fingerprint` | SHA-256 fingerprint for UI display (`SHA256:...`). | | `host_key_verified_at` | ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending. | | `host_key_pending_token` | UUID issued at first_observe / mismatch; consumed atomically by `/verify-host-key`. | A connection is **trusted** iff `host_key_verified_at IS NOT NULL` AND the observed key during connect matches `host_key_b64`. ### Lifecycle ``` new connection │ ├─ user clicks Test (or LLM calls SshExec) │ │ │ ▼ │ sshTest() observes the host key │ │ │ ▼ │ onFirstObserve hook fires │ - writes host_key_b64, host_key_fingerprint, host_key_pending_token │ - audit row: ssh.connection.host_key.first_observe │ - returns SshSessionError('host_key_first_observe') │ │ (UI shows the fingerprint + pending token) │ ├─ user clicks Verify (typing fingerprint to confirm) │ │ │ ▼ │ POST /api/ssh/connections/:id/verify-host-key │ {token, fingerprint} │ - atomic compare-and-set: token + fingerprint match → set host_key_verified_at │ - audit row: ssh.connection.host_key.verify │ ▼ verified — Exec/Upload/Download now work ``` On `host_key_mismatch` (server rebuilt, key rotated, or MITM): ``` ├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload │ │ │ ▼ │ ssh2 observes key ≠ host_key_b64 │ │ │ ▼ │ onMismatch hook fires │ - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet) │ - audit row: ssh.connection.host_key.mismatch │ - returns SshSessionError('host_key_mismatch') │ │ (UI shows OLD vs NEW fingerprint side-by-side) │ ├─ user investigates externally (ssh-keyscan, IT team, etc.) │ ├─ user clicks "Replace key" with reason │ │ │ ▼ │ POST /api/ssh/connections/:id/replace-host-key │ {token, fingerprint, reason} │ - atomic compare-and-set │ - writes new host_key_b64, host_key_fingerprint, host_key_verified_at │ - audit row: ssh.connection.host_key.replace ``` The pending token mechanism prevents a "verify swap" race: if a second TOFU observation happens between the user's verify request and its arrival, the old token is overwritten and the verify endpoint returns `409 stale_token`. ### Banned algorithms Even before TOFU completes, the host-key algorithm is checked against an allowlist. SHA1-RSA and other weak algorithms are rejected before the key is recorded (`host_key_alg_not_allowed`). This is hard-coded in `src/ssh/session.ts` to avoid misconfiguration. ## Per-piece `allowed_ssh_connections` A piece's movement must explicitly opt in to SSH usage. The piece-runner enforces three invariants: 1. If a movement's `allowed_tools` contains any SSH tool name (`SshExec`/`SshUpload`/`SshDownload`), `allowed_ssh_connections` **must be declared** on that movement (even if empty) 2. The field must be an array of strings 3. Each entry must be `*` or a lowercase hex+hyphen UUID (≥ 8 chars) Lint failures abort piece load. ### Forms ```yaml # Explicit allowlist (most common) allowed_ssh_connections: ["abcd1234-...", "ef567890-..."] # Wildcard (admin-style — use sparingly) allowed_ssh_connections: ["*"] # Deny-all (still allows SSH tool in allowed_tools but refuses every UUID) allowed_ssh_connections: [] ``` The `*` form skips the per-piece check but **does not** skip the [access grant check](#access-grants). A user without a grant for a given connection still cannot use it even when the piece says `*`. ### Example ```yaml name: backup-rotation description: Daily backup rotation on prod servers movements: - name: list allowed_tools: [SshExec] allowed_ssh_connections: ["abcd1234-...", "ef567890-..."] instruction: | List the existing backup files on each server. rules: - condition: ready to rotate next: rotate - name: rotate allowed_tools: [SshExec, SshUpload] allowed_ssh_connections: ["abcd1234-...", "ef567890-..."] instruction: | Rotate the oldest backup ... rules: - condition: done next: COMPLETE ``` ## Access Grants Grants connect a **subject** (user or org) to a **connection**, scoped to a **piece** (or all pieces, admin-only). ### Schema ```sql CREATE TABLE ssh_connection_grants ( id TEXT PRIMARY KEY, connection_id TEXT NOT NULL, subject_type TEXT NOT NULL, -- 'user' | 'org' subject_id TEXT NOT NULL, piece_name TEXT, -- NULL iff applies_to_all_pieces=1 applies_to_all_pieces INTEGER NOT NULL DEFAULT 0, granted_by_user_id TEXT NOT NULL, reason TEXT NOT NULL, -- required, ≥ 8 chars expires_at TEXT, -- ISO8601 or NULL created_at TEXT NOT NULL ); ``` ### Decision tree For a given `(userId, orgIds, connectionId, pieceName)`: 1. **Owner check**: if `connection.owner_id == userId` → access granted (owner of own connection) 2. **Admin bypass**: if user is admin AND `ssh.admin_bypasses_grants: true` → granted (audited) 3. **Grant lookup**: - find rows where `connection_id = ?` - subject matches (`subject_type='user' AND subject_id=userId`) OR (`subject_type='org' AND subject_id IN orgIds`) - piece matches (`applies_to_all_pieces=1` OR `piece_name = ?`) - not expired (`expires_at IS NULL OR expires_at > now()`) - **any** matching row → granted 4. Otherwise → denied (`access denied (no_grant)`) ### Creating grants Admin-only via `POST /api/ssh/admin/grants`: ```json { "connection_id": "abcd1234-...", "subject_type": "user", "subject_id": "alice", "piece_name": "backup-rotation", "applies_to_all_pieces": false, "reason": "Alice owns backups for prod-east cluster", "expires_at": null } ``` For `applies_to_all_pieces: true`: - `piece_name` **must be null** - the admin endpoint requires explicit `reason` containing scope justification - audit row records `action: ssh.grant.create` with `detail.applies_to_all=true` - this is the highest-privilege grant — review carefully ### Org grants Same schema with `subject_type: "org"`, `subject_id: `. Membership comes from `user_gitea_orgs` (populated at login via Gitea OAuth). A user with multiple org memberships matches grants for any of those orgs. ### Expiration `expires_at` is checked at decision time (no background sweep). Expired rows remain in the table for audit purposes. Admin can delete them via `DELETE /api/ssh/admin/grants/:id`. ## Path Policy ### Local path (workspace) For `SshUpload.local_path` and `SshDownload.local_path`: - Resolved against `ctx.workspacePath` (the job's workspace root) - `..` traversal → reject - Symlinks: open with `O_NOFOLLOW`, lstat every parent → reject if any parent is a symlink leaving the workspace - For download: parent directory must exist; target file must NOT exist (`O_CREAT | O_EXCL`) ### Remote path For `remote_path` on upload/download: - Must be **absolute** (starts with `/`) - After POSIX normalization (`path.posix.normalize`), must start with the connection's `remote_path_prefix` - `..` segments are collapsed by normalize; the post-normalize check catches escape attempts - No glob expansion — exact path only Example: connection has `remote_path_prefix = '/srv/agent'` | Input | Normalized | Result | |---|---|---| | `/srv/agent/file.txt` | `/srv/agent/file.txt` | ✅ | | `/srv/agent/sub/file.txt` | `/srv/agent/sub/file.txt` | ✅ | | `/srv/agent/../etc/passwd` | `/etc/passwd` | ❌ outside prefix | | `/srv/agentish/file` | `/srv/agentish/file` | ❌ prefix mismatch (not `/srv/agent/...`) | | `file.txt` (relative) | n/a | ❌ not absolute | ## Command Filtering `SshExec.command` runs through a two-stage filter. ### Stage 1: built-in deny-list Hard-coded patterns in `src/ssh/deny-list.ts`. Examples (not exhaustive): - `rm -rf /` and variants - fork bombs (`:(){:|:&};:`) - `mkfs.*`, `dd if=/dev/zero ...` - shutdown / reboot / poweroff - `:>/dev/sda` style block-device writes If matched, the call is rejected with `command rejected by built-in deny-list (matched pattern: ...).` and audited as `outcome=denied`. The built-in list is **not** a comprehensive sandbox — it's a tripwire against the most catastrophic typos and worst-case prompt injection payloads. Production deployments should also configure connection-level patterns. ### Stage 2: per-connection regex (optional) Each connection can carry: - `deny_patterns`: newline-separated regex list. Match → reject. - `allow_patterns`: newline-separated regex list. If set, every command must match at least one allow pattern (after passing both deny stages). Both are validated at save-time by `validateCustomPatterns`: - Each pattern must compile - Each must pass the `safe-regex` ReDoS check - Aggregate length capped (no megabyte-blobs of regex) Example: ``` # deny_patterns sudo ^\s*rm\s+ nc\s+-l # allow_patterns ^(ls|cat|grep|tail|head|systemctl|journalctl)\s ^/srv/agent/scripts/ ``` ReDoS-safe regex is enforced because user-supplied patterns run synchronously on the command string before each call. ## SSRF + Algorithms ### SSRF (host resolution) Every connection target goes through `ssrfStrict(host, allowPrivate)`: 1. DNS resolve host → list of A/AAAA records 2. For each address, check against the IP-policy: - Reject 0.0.0.0, ::/0 - Reject 127.0.0.0/8, ::1 (loopback) - Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private) - Reject 169.254/16 (link-local — including AWS metadata) 3. **DNS pinning**: the resolved address is captured before connect; ssh2 connects to the pinned IP, not to the hostname. This prevents DNS rebinding (round 1: public IP passes check; round 2: returns loopback during connect). `allowPrivate` short-circuits step 2. Two opt-in flags compose: - Global: `ssh.allow_private_addresses: true` in config.yaml - Per-connection: `allow_private_addresses=1` on the row (admin sets via `/api/ssh/admin/globals` or `/api/ssh/admin/connections/:id`) Either being true allows private/loopback. Both default false. ### Algorithm allowlist Hard-coded in `src/ssh/session.ts`: | Category | Allowed | |---|---| | Key exchange | `curve25519-sha256`, `curve25519-sha256@libssh.org`, `ecdh-sha2-nistp256/384/521`, `diffie-hellman-group14/16/18-sha256/512` | | Server host key | `ssh-ed25519`, `rsa-sha2-256`, `rsa-sha2-512`, `ecdsa-sha2-nistp256/384/521` | | Cipher | `aes256-gcm@openssh.com`, `aes128-gcm@openssh.com`, `aes256-ctr`, `aes192-ctr`, `aes128-ctr` | | HMAC | `hmac-sha2-512-etm@openssh.com`, `hmac-sha2-256-etm@openssh.com`, `hmac-sha2-512`, `hmac-sha2-256` | Notably banned: `ssh-rsa` (SHA1), `ssh-dss`, all `arcfour*`, `hmac-md5*`, `hmac-sha1*`. Mismatch returns `host_key_alg_not_allowed` or `auth_failed` depending on which stage caught it. ## Audit Log Single table: `ssh_audit_log`. Every SSH operation writes here. ### Lifecycle ``` begin (outcome=pending) [commits before remote call] ↓ remote call (ssh2 connect / exec / sftp) ↓ complete (outcome=success|failed|denied|aborted) [updates same row] ``` If the orchestrator crashes between `begin` and `complete`, the row stays `pending`. On next boot, the recovery sweep (`src/ssh/recovery.ts`) updates pending rows older than 10 minutes to `aborted` with `detail.recovered_at` set. ### Actions | Action | Triggered by | |---|---| | `ssh.exec` | `SshExec` | | `ssh.upload` | `SshUpload` | | `ssh.download` | `SshDownload` | | `ssh.connection.upsert` | User/admin connection create/edit | | `ssh.connection.delete` | User/admin connection delete | | `ssh.connection.host_key.first_observe` | TOFU first observation | | `ssh.connection.host_key.mismatch` | TOFU mismatch | | `ssh.connection.host_key.tofu_record` | Internal helper write | | `ssh.connection.host_key.verify` | User `/verify-host-key` | | `ssh.connection.host_key.replace` | User `/replace-host-key` | | `ssh.connection.disable` | Admin disable | | `ssh.connection.enable` | Admin enable | | `ssh.abuse.unlock_manual` | Admin force-unlock | | `ssh.grant.create` | Admin grant create | | `ssh.grant.delete` | Admin grant delete | | `ssh.master_key.rotate.start` | Admin rotation start | ### Detail column JSON blob with action-specific fields: - `ssh.exec`: `{command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}` - `ssh.upload`: `{local_path: "...", remote_path: "...", bytes: 4096}` - `ssh.download`: same shape - `ssh.connection.host_key.first_observe`: `{fingerprint: "SHA256:...", pending_token: "uuid"}` - All denied: `{error: "no_grant" | "abuse_locked" | "disabled" | ...}` The `ssh.exec` action does **not** record the command string — only its SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you need to investigate a specific exec, correlate the hash with stdin logs from the LLM activity log (workspace `logs/activity.log`). ### Retention `ssh.audit_retention_days` (default 90) controls a lazy sweep. Admin can trigger pruning manually from the UI. There is no hard cap on table size — disk-fill is mitigated by the hashing + truncation strategy above, plus admin-driven cleanup. ## Abuse Counters & Lock Defends against credential spraying, mistyped scripts in loops, and brute-force scans. ### Three scopes | Scope | Key | When | |---|---|---| | `user` | `(user_id,)` | Any SSH failure by this user | | `host:user` | `(host, username)` | Failure on this (host, username) tuple | | `host` | `(host,)` | Failure on this host (global connections only) | The `host` scope intentionally **only counts failures on global connections** to prevent cross-user DoS: a user repeatedly failing on their own connection cannot lock out other users from a shared host. For user-owned connections, the `host` counter is updated for admin-notification only — no lock applies. ### Algorithm ``` on failure: for each scope: increment counter if count(within abuseWindowMinutes) >= abuseFailureThreshold: lock until now + abuseLockMinutes on success (user scope only): reset user counter other scopes age out naturally with the window ``` Counters are stored in `ssh_abuse_counters`, with separate columns per scope. All updates are transactional (no UPSERT race). ### Force-unlock Admin can force-unlock from `SshGlobalConnectionsForm` or per-connection admin page: ``` POST /api/ssh/admin/connections/:id/force-unlock {reason: "Confirmed credentials rotated; user retried with old key"} ``` Rate-limited to 10/hour total across all admins (`admin-rate-limit.ts`, token bucket). Audited as `ssh.abuse.unlock_manual`. ## Master Key Rotation Replaces `MCP_ENCRYPTION_KEY` and rewraps every row's DEK under the new key. This is **the** way to rotate the master key — do not edit the env var manually. ### Flow 1. **Admin starts** via `POST /api/ssh/admin/rotate-master-key`: ```json {"new_key_hex": "<64-hex>", "reason": "Annual rotation"} ``` 2. **Maintenance mode engages** — `sshMaintenance.enter()` returns 503 for all SSH write endpoints (read endpoints stay alive). The LLM sees `SSH subsystem is in maintenance` errors for tool calls. 3. **Per-row rewrap**: - For each `ssh_connections` row: decrypt DEK under old key, re-encrypt under new key, bump `key_version`, commit (one tx per row) - For each `system_deks` row: same 4. **New key validated** by decrypting a test value 5. **Maintenance exits** automatically 6. **Caller polls** `GET /api/ssh/admin/rotate-master-key/:jobId` for status (`running` / `succeeded` / `failed`) ### Failure modes - **Crash mid-rotation**: rows have mixed `key_version`. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key. - **Wrong old key**: the first row decryption fails → job aborts before any change, maintenance exits, audit records `ssh.master_key.rotate.start` with `outcome=failed`. - **Disk write fails mid-row**: that single row is rolled back; rotation continues. Operator must re-run. The rotation job runs in-process (not as a separate worker). For large fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write. ### `MCP_ENCRYPTION_KEY` env var After successful rotation, **the env var must be updated to the new key** before the next restart. The orchestrator writes the new key to the audit log (encrypted under the OLD key) and returns it once in the HTTP response — there's no second chance. Update your secrets store immediately. ## Operator Runbook ### A. Add a global (admin-managed) connection ```bash # Via UI: Settings → SSH → Global Connections → Add # Or via API (requires admin session cookie): curl -X POST http://localhost:3000/api/ssh/admin/globals \ -H 'Content-Type: application/json' \ -d @- <<'JSON' { "label": "prod-east-bastion", "host": "bastion.prod-east.example.com", "port": 22, "username": "deploy", "private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...", "passphrase": null, "remote_path_prefix": "/srv/deploy", "allow_private_addresses": false, "deny_patterns": "sudo\n^\\s*rm\\s+", "allow_patterns": "", "reason": "Production deploy bastion — owned by SRE" } JSON ``` Then verify the host key (next section) and grant access. ### B. Grant org access to a global connection ```bash curl -X POST http://localhost:3000/api/ssh/admin/grants \ -H 'Content-Type: application/json' \ -d '{ "connection_id": "", "subject_type": "org", "subject_id": "engineering", "piece_name": "prod-deploy", "applies_to_all_pieces": false, "reason": "Engineering org runs prod-deploy piece" }' ``` ### C. Verify a TOFU first-observe 1. From the user's side or admin side, click **Test** in the SshConnections panel 2. The response is `host_key_first_observe` with a SHA-256 fingerprint and pending token 3. **Verify externally** that the fingerprint matches the real server: ```bash ssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \ | ssh-keygen -lf - ``` Compare the resulting `SHA256:...` with what the UI shows 4. In the dialog, type the fingerprint to confirm and click **Verify** 5. Audit row `ssh.connection.host_key.verify` recorded; subsequent calls succeed ### D. Force-unlock a stuck connection Symptom: user reports "SshExec returns `access denied (abuse_locked)`" 1. Settings → SSH → Global Connections → click the row → "Locks" section 2. Inspect the counter state (which scope is locked, until when) 3. If genuinely needs early unlock (e.g. user fixed the bad credentials), click **Force unlock**, enter reason 4. If suspicious (unexplained 5+ failures), investigate audit log first ### E. Rotate the master key ```bash NEW_KEY=$(openssl rand -hex 32) # Start rotation JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \ -H 'Content-Type: application/json' \ -d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \ | jq -r .job_id) # Poll until done while true; do STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \ | jq -r .status) echo "$STATUS" [ "$STATUS" = "succeeded" ] && break [ "$STATUS" = "failed" ] && { echo FAILED; exit 1; } sleep 2 done # Update env var BEFORE next restart echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env ``` ### F. Prune old audit logs Settings → SSH → Audit Log → "Prune older than N days" (defaults to the config retention value). Or via API: ```bash curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90' ``` ## Troubleshooting ### Symptom → cause table | Error | Common cause | Fix | |---|---|---| | `SSH is disabled` (503) | `ssh.enabled: false` | Set true, restart not required | | `SSH subsystem is in maintenance` | Master key rotation in progress | Wait for job to complete, or check rotation log | | `access denied (no_grant)` | User lacks grant for connection | Admin creates a grant, or user uses an owned connection | | `access denied (disabled)` | Admin disabled the connection | Admin re-enables, or use different connection | | `access denied (abuse_locked)` | Counter triggered | Wait for lock window, or admin force-unlocks | | `piece "X" does not list connection Y` | `allowed_ssh_connections` missing UUID | Add UUID to the movement's `allowed_ssh_connections` | | `host_key_first_observe` | First time exercising connection | Verify fingerprint in UI | | `host_key_not_verified` | Key recorded but never verified | Click Verify in UI | | `host_key_mismatch` | Server key changed | Investigate (legitimate rotation? MITM?), then Replace via UI | | `host_key_alg_not_allowed` | Server using SHA1-RSA etc. | Upgrade server to ed25519 / rsa-sha2-256 | | `auth_failed` | Wrong key, wrong username | Re-check connection settings | | `connect_timeout` | Network unreachable, firewall | Check from host, check SSRF policy | | `exec_timeout` | Long-running command | Increase `timeout_ms`, or run in background and `Download` results | | `output_too_large` | stdout > 32 KiB | Filter the command, or write to file and `Download` | | `forbidden_address` | Target is private IP, no opt-in | Set `allow_private_addresses` per-connection or globally | | `system_dek_verify_failed` (log) | `MCP_ENCRYPTION_KEY` changed without rotation flow | Stop server, restore old key OR re-rotate via flow | ### Where to look | Question | Source | |---|---| | What did the LLM try to do? | `logs/activity.log` in the job's workspace | | What did SSH do? | `ssh_audit_log` (Admin UI or SQL) | | Was it actually denied at the SSH layer? | Audit row `outcome` = `denied` | | What was the exit code? | Audit `detail.exit_code` (for `ssh.exec`) | | Did it crash mid-call? | Audit `outcome` = `aborted` (recovery sweep) | | Why was the host key flagged? | Audit `ssh.connection.host_key.*` rows | | Who has access to a connection? | `ssh_connection_grants` filtered by `connection_id` | ## Security Model Summary Detailed threat model + risk register: see plan doc §"Security Design Deep-Dive (rev 3)" and §"Risk Register (rev 3)". Key points operators must understand: 1. **The orchestrator is a credential proxy.** Anyone with admin rights can read connection plaintext (via the rotation flow, which decrypts server-side). Treat admin access as production-credential-equivalent. 2. **TOFU is the floor, not the ceiling.** First-observe is unauthenticated. For high-stakes targets, pre-populate `host_key_b64` from a trusted bootstrap (e.g. baked into the connection at create time via the `host_key_b64` field) rather than relying on the orchestrator's first observation. 3. **The deny-list is not a sandbox.** Built-in patterns catch obvious misuse. Real isolation requires connection-level configuration (restricted shell account, `remote_path_prefix`, narrow `allow_patterns`) and target-side controls. 4. **Audit log is local-only.** No HMAC chain (acknowledged limitation R-audit-tamper). For tamper-evidence, ship `ssh_audit_log` rows to an external SIEM via SQLite hooks or periodic export. 5. **ssh2 internal key retention** (R-ssh2-leak): the PEM lives in JS heap for the connection lifetime. Process compromise reveals plaintext credentials. Mitigations: short-lived processes, separate worker per high-stakes connection. 6. **Master key compromise = total compromise.** Key rotation invalidates already-leaked encrypted material — if an attacker has both the DB and the old master key, all stored creds are theirs. Rotate keys immediately on suspected compromise AND rotate every stored credential on the target side. ## HTTP API Reference User router: mounted at `/api/ssh` — requires `requireAuth`. | Method | Path | Purpose | |---|---|---| | GET | `/connections` | List own connections + grant-visible globals | | POST | `/connections` | Create user-owned connection | | GET | `/connections/:id` | Read | | PATCH | `/connections/:id` | Edit (owner only) | | DELETE | `/connections/:id` | Delete (owner only) | | POST | `/connections/:id/test` | Trigger TOFU observation / verify path | | POST | `/connections/:id/verify-host-key` | Atomic verify (token + fingerprint) | | POST | `/connections/:id/replace-host-key` | Atomic replace (token + fingerprint + reason) | | GET | `/connections/:id/audit` | Owner's view of audit rows for this connection | | GET | `/grants/visible-to-me` | List grants visible (subject=user or matching org) | Admin router: mounted at `/api/ssh/admin` — requires `requireAdmin`. | Method | Path | Purpose | |---|---|---| | GET | `/connections` | All connections (cross-tenant) | | GET | `/connections/:id` | Admin read | | PATCH | `/connections/:id/disable` | Soft-disable (audited; reason required) | | PATCH | `/connections/:id/enable` | Re-enable | | DELETE | `/connections/:id` | Hard-delete | | POST | `/connections/:id/force-unlock` | Clear abuse counter (rate-limited; reason required) | | POST | `/globals` | Create global connection | | PATCH | `/globals/:id` | Edit global | | DELETE | `/globals/:id` | Delete global | | GET | `/grants` | List all grants | | POST | `/grants` | Create grant | | DELETE | `/grants/:id` | Delete grant | | POST | `/rotate-master-key` | Start master key rotation | | GET | `/rotate-master-key/:jobId` | Poll rotation status | | GET | `/audit` | All-tenant audit view (paginated) | All admin write endpoints require: - `requireAdmin` middleware - `maintenance503()` guard (rejects writes during rotation) - `validateReason()` on `body.reason` (≥ 8 chars) - `auditRepo.beginAndComplete()` for success/failure both ## SSH Console (Interactive) `ssh.console.enabled: true` で有効化。 - 1 タスク = 1 PTY セッション。job をまたいで shell state を維持 - Tab `SSH` がタスク詳細に出る (piece が SshConsole* を allow している場合) - WebSocket: `/api/local/tasks/:taskId/console/ws` - REST status: `GET /api/local/tasks/:taskId/console/status` - 監査: `ssh.console.{open,send,snapshot,resize,input_rejected,close}` - 自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill - 同 connection あたり最大 3 sessions (古い順に evict) Admin: `GET /api/admin/ssh/console-sessions` で一覧、 `POST /api/admin/ssh/console-sessions/:taskId/kill` で kill (admin role only)。 ## See Also - [docs/tools/ssh-tools.md](./tools/ssh-tools.md) — LLM-facing tool semantics - [docs/tools/ssh-console-tools.md](./tools/ssh-console-tools.md) — SSH Console tool semantics (Ensure/Send/Snapshot) - [docs/mcp.md](./mcp.md) — MCP integration (shares `MCP_ENCRYPTION_KEY`) - [docs/maintenance-checklist.md](./maintenance-checklist.md) §12 — checklist for SSH-related code changes