maestro/docs/ssh.md

# SSH Subsystem (Operator Runbook)

The orchestrator can run shell commands on remote servers (`SshExec`) and
move files between the workspace and remote hosts (`SshUpload`/`SshDownload`)
through a dedicated, audited SSH subsystem. Like the MCP integration, **the
feature is off by default** and requires a key + config flip to enable.

This document is the **operator runbook** for setting up, granting access,
verifying host keys, rotating the master key, and troubleshooting. For the
LLM-facing tool semantics see [docs/tools/ssh-tools.md](./tools/ssh-tools.md).
For the internal design (threat model, risk register, schema, 12-step
orchestration flow) see

## At a glance

| Aspect | Behavior |
|---|---|
| Default | `ssh.enabled: false` — tools hidden, panels hidden, API returns 503 |
| Tools exposed when enabled | `SshExec`, `SshUpload`, `SshDownload` |
| Authentication | Public-key only; passwords are **not** supported |
| Host key trust | TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed |
| Connection ownership | User-owned (private) **or** Global (admin-managed, shared via grants) |
| Encryption at rest | AES-256-GCM (per-row DEK, master key = `MCP_ENCRYPTION_KEY`) |
| Audit | Dedicated `ssh_audit_log` table, `pending → success/failed/denied/aborted` lifecycle |
| Abuse defense | 3-scope counters (`user` / `host:user` / `host`) with auto-lock |
| Network policy | SSRF strict by default; per-connection opt-in for private IPs |
| Algorithm policy | Strict allowlist (no SHA1-RSA, no weak DH/HMAC) |

## Prerequisites

### 1. `MCP_ENCRYPTION_KEY`

The SSH subsystem **shares the same master key as MCP** — there is only one
key per orchestrator. All private keys, passphrases, and global-connection
DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is
wrapped by this master key.

Generate it once (32 bytes = 64 hex chars):

```bash
openssl rand -hex 32
```

Export it before starting the server:

```bash
export MCP_ENCRYPTION_KEY=<the 64-hex output>
scripts/server.sh start
```

If `MCP_ENCRYPTION_KEY` is **not** set when `ssh.enabled: true`, the SSH
subsystem boots **fail-soft**: a warning is logged, all SSH endpoints
return 503, the tools are hidden from LLM, and the UI panels show a
configuration error banner. Other features (MCP excepted) continue
normally.

> ⚠ **Key rotation invalidates existing encrypted material.** There is a
> built-in [master key rotation flow](#master-key-rotation) that rewraps
> every row in maintenance mode. Do **not** swap the env var manually
> without using that flow — half-rotated state breaks every connection.

### 2. `ssh.enabled: true`

Flip the flag in `config.yaml`:

```yaml
ssh:
  enabled: true
```

This is the master switch. With it `false`:
- HTTP endpoints (`/api/ssh/*`, `/api/ssh/admin/*`) return 503
- Tool defs are not exposed to the LLM (the dispatcher returns null)
- UI panels render an "SSH is disabled" empty state
- Database tables remain present (no destructive change)

Restart is **not** required — `ConfigManager` reload picks up the change
and rebuilds the SSH router.

### 3. `system_deks` bootstrap

The first time the orchestrator boots with `ssh.enabled: true` AND a
valid `MCP_ENCRYPTION_KEY`, it provisions a single row in `system_deks`
(via `INSERT OR IGNORE` inside a transaction, `CHECK(id=1)`). This DEK
encrypts **global connections** (those without an owner).

On every subsequent boot, `verifySystemDek` decrypts the stored DEK to
prove the master key still works. If it fails (key rotated outside the
rotation flow, or env var differs from when the DEK was wrapped), SSH
**fails closed for the session** and a `system_dek_verify_failed` error
is logged. User-owned connections may still partially work (their DEKs
are wrapped per user), but global connections will all error.

### 4. Optional: `allow_private_addresses`

By default, SSH connections are routed through the SSRF strict-check,
which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12,
192.168/16, fc00::/7, 169.254/16) addresses. For **LAN targets** you
must opt in.

There are two scopes:

```yaml
ssh:
  enabled: true
  allow_private_addresses: true   # global default
```

```sql
-- per-connection opt-in (admin-only flag, audited)
UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?;
```

The per-connection flag is preferred — narrow the blast radius. The
global flag exists for trusted dev networks (homelab, isolated VPC).
The per-connection flag can only be set on **global** (admin-managed)
connections; for user-owned connections, the global flag applies.

## Quickstart

```bash
# 1. Set the key
openssl rand -hex 32 > ~/.mcp_encryption_key
export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key)

# 2. Enable SSH + allow LAN
cat >> config.yaml <<'YAML'
ssh:
  enabled: true
  allow_private_addresses: true   # only if you're targeting LAN
YAML

# 3. Restart
scripts/server.sh restart
```

Then in the UI:

1. **Settings → User Folder → SSH Connections → Add**
2. Fill `label`, `host`, `port` (default 22), `username`, paste private key (OpenSSH PEM)
3. Optionally set `remote_path_prefix` (default `/`) — restricts upload/download paths
4. Click **Test** → first call returns `host_key_first_observe` with a fingerprint
5. **Verify** in the dialog (compare fingerprint with what you expect from `ssh-keyscan <host>`)
6. Add the connection's UUID to a piece's `allowed_ssh_connections`:

```yaml
# pieces/example.yaml
name: ssh-example
movements:
  - name: deploy
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-..."]
    rules:
      - condition: done
        next: COMPLETE
    instruction: |
      Use SshExec to ...
```

7. Test the piece via the normal task UI.

## `config.yaml` Reference

Full SSH section with defaults:

```yaml
ssh:
  # master switch
  enabled: false

  # SSRF policy — when true, allow private/loopback addresses (global)
  allow_private_addresses: false

  # wall-clock timeout for connect + handshake + exec/transfer (seconds)
  call_timeout_seconds: 30

  # stdout/stderr byte cap for SshExec (bytes)
  max_output_bytes: 32768          # 32 KiB

  # SFTP transfer size caps (MB)
  max_upload_size_mb: 100
  max_download_size_mb: 100

  # ssh_audit_log retention (days). Admin can prune via UI.
  audit_retention_days: 90

  # When true (default), admins can use any connection without an explicit
  # grant. Audited regardless. Set false for stricter least-privilege.
  admin_bypasses_grants: true

  # Abuse counters
  abuse_window_minutes: 10        # rolling window for failure counting
  abuse_failure_threshold: 5      # failures within window → lock
  abuse_lock_minutes: 30          # lock duration on threshold breach
```

All keys translate to camelCase in `SshRuntimeConfig` (`src/ssh/config.ts`).
The `transformKeys` helper in `src/config.ts` handles the conversion.

## Connection Model

### Owner

Each row in `ssh_connections` has an `owner_id`:

| Owner | Visibility | Who creates |
|---|---|---|
| **User-owned** (`owner_id = userId`) | Only the owner; admin can also list but not edit | Any authenticated user (`POST /api/ssh/connections`) |
| **Global** (`owner_id IS NULL`) | All users see it in the picker (subject to grants) | Admin only (`POST /api/ssh/admin/globals`) |

Global connections solve the "team-shared infra account" use case — a
single set of credentials that multiple users invoke under their own
identity, audited per user, gated by grants.

### Encryption

For each connection:

1. Generate a fresh 32-byte DEK
2. Encrypt `private_key_pem` (and optionally `passphrase`) with the DEK
3. Wrap the DEK with the master key (`MCP_ENCRYPTION_KEY`)
4. Store: `private_key_enc`, `private_key_dek_enc`, `key_version`

`key_version` allows progressive rewrap during master key rotation (each
row tracks which generation of master key its DEK is wrapped under).
Global connections use the single `system_deks` row (id=1) rather than a
per-row DEK.

> ⚠ ssh2's internal key handling is opaque — once the PEM is loaded into
> the library, it lives in JS heap memory for the lifetime of the
> connection. We `Buffer.fill(0, 0)` our copies in the `finally` block
> but cannot reach into ssh2 internals. This is an acknowledged
> limitation; see the plan doc's "Acknowledged limitations" section.

## UI Walkthrough

### User: Settings → User Folder → SSH Connections

The **SshConnectionsPanel** lists the user's connections and any global
connections they have a grant for. Each row shows:

- Label + host:port + username
- Host key fingerprint + verify state (verified / pending / first_observe / mismatch)
- Lock state (if abuse counter triggered)
- Actions: **Test**, **Verify host key**, **Replace host key** (with reason), **Edit**, **Delete**

The "Add Connection" form (`SshConnectionForm`) collects:

- Label
- Host, port, username
- Private key (textarea — PEM format; passphrase optional)
- Remote path prefix (default `/`)
- Custom deny/allow regex patterns (newline-separated, validated at save-time)

`SshHostKeyDialog` opens on first_observe / mismatch and shows the
observed fingerprint side-by-side with the previously-stored one (if
any). "Trust this key" requires typing the fingerprint to confirm.

### Admin: Settings → SSH

Four sub-panels under `SshForm`:

| Panel | Component | Purpose |
|---|---|---|
| **Global Connections** | `SshGlobalConnectionsForm` | CRUD on global connections (`owner_id IS NULL`). Includes the `allow_remote_unrestricted` and per-connection `allow_private_addresses` flags |
| **Grants** | `SshGrantsForm` | List/create/delete grants. Per-piece or `applies_to_all_pieces`. Subject: user or org. Reason required |
| **Audit Log** | `SshAuditLog` | All-tenant audit view. Filter by action / outcome / connection / time range. Pagination |
| **Master Key Rotation** | `SshMasterKeyRotationForm` | Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status |

Admin can also force-unlock abuse counters from the per-connection page
(requires reason; rate-limited to 10/hour total).

## Host Key TOFU Flow

SSH security depends on knowing the **right** host key. We use Trust-On-
First-Use: the first time a connection is exercised, we record the
observed key and require explicit user verification before treating
it as trusted.

### States

`ssh_connections` carries three host-key columns:

| Column | Meaning |
|---|---|
| `host_key_b64` | The observed public key in OpenSSH base64 form. NULL = never observed. |
| `host_key_fingerprint` | SHA-256 fingerprint for UI display (`SHA256:...`). |
| `host_key_verified_at` | ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending. |
| `host_key_pending_token` | UUID issued at first_observe / mismatch; consumed atomically by `/verify-host-key`. |

A connection is **trusted** iff `host_key_verified_at IS NOT NULL` AND
the observed key during connect matches `host_key_b64`.

### Lifecycle

```
new connection
   │
   ├─ user clicks Test (or LLM calls SshExec)
   │     │
   │     ▼
   │   sshTest() observes the host key
   │     │
   │     ▼
   │   onFirstObserve hook fires
   │     - writes host_key_b64, host_key_fingerprint, host_key_pending_token
   │     - audit row: ssh.connection.host_key.first_observe
   │     - returns SshSessionError('host_key_first_observe')
   │
   │   (UI shows the fingerprint + pending token)
   │
   ├─ user clicks Verify (typing fingerprint to confirm)
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/verify-host-key
   │     {token, fingerprint}
   │     - atomic compare-and-set: token + fingerprint match → set host_key_verified_at
   │     - audit row: ssh.connection.host_key.verify
   │
   ▼
verified — Exec/Upload/Download now work
```

On `host_key_mismatch` (server rebuilt, key rotated, or MITM):

```
   ├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload
   │     │
   │     ▼
   │   ssh2 observes key ≠ host_key_b64
   │     │
   │     ▼
   │   onMismatch hook fires
   │     - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet)
   │     - audit row: ssh.connection.host_key.mismatch
   │     - returns SshSessionError('host_key_mismatch')
   │
   │   (UI shows OLD vs NEW fingerprint side-by-side)
   │
   ├─ user investigates externally (ssh-keyscan, IT team, etc.)
   │
   ├─ user clicks "Replace key" with reason
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/replace-host-key
   │     {token, fingerprint, reason}
   │     - atomic compare-and-set
   │     - writes new host_key_b64, host_key_fingerprint, host_key_verified_at
   │     - audit row: ssh.connection.host_key.replace
```

The pending token mechanism prevents a "verify swap" race: if a second
TOFU observation happens between the user's verify request and its
arrival, the old token is overwritten and the verify endpoint returns
`409 stale_token`.

### Banned algorithms

Even before TOFU completes, the host-key algorithm is checked against
an allowlist. SHA1-RSA and other weak algorithms are rejected before
the key is recorded (`host_key_alg_not_allowed`). This is hard-coded
in `src/ssh/session.ts` to avoid misconfiguration.

## Per-piece `allowed_ssh_connections`

A piece's movement must explicitly opt in to SSH usage. The
piece-runner enforces three invariants:

1. If a movement's `allowed_tools` contains any SSH tool name
   (`SshExec`/`SshUpload`/`SshDownload`), `allowed_ssh_connections`
   **must be declared** on that movement (even if empty)
2. The field must be an array of strings
3. Each entry must be `*` or a lowercase hex+hyphen UUID (≥ 8 chars)

Lint failures abort piece load.

### Forms

```yaml
# Explicit allowlist (most common)
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]

# Wildcard (admin-style — use sparingly)
allowed_ssh_connections: ["*"]

# Deny-all (still allows SSH tool in allowed_tools but refuses every UUID)
allowed_ssh_connections: []
```

The `*` form skips the per-piece check but **does not** skip the
[access grant check](#access-grants). A user without a grant for a
given connection still cannot use it even when the piece says `*`.

### Example

```yaml
name: backup-rotation
description: Daily backup rotation on prod servers
movements:
  - name: list
    allowed_tools: [SshExec]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      List the existing backup files on each server.
    rules:
      - condition: ready to rotate
        next: rotate

  - name: rotate
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      Rotate the oldest backup ...
    rules:
      - condition: done
        next: COMPLETE
```

## Access Grants

Grants connect a **subject** (user or org) to a **connection**, scoped
to a **piece** (or all pieces, admin-only).

### Schema

```sql
CREATE TABLE ssh_connection_grants (
  id TEXT PRIMARY KEY,
  connection_id TEXT NOT NULL,
  subject_type TEXT NOT NULL,        -- 'user' | 'org'
  subject_id TEXT NOT NULL,
  piece_name TEXT,                   -- NULL iff applies_to_all_pieces=1
  applies_to_all_pieces INTEGER NOT NULL DEFAULT 0,
  granted_by_user_id TEXT NOT NULL,
  reason TEXT NOT NULL,              -- required, ≥ 8 chars
  expires_at TEXT,                   -- ISO8601 or NULL
  created_at TEXT NOT NULL
);
```

### Decision tree

For a given `(userId, orgIds, connectionId, pieceName)`:

1. **Owner check**: if `connection.owner_id == userId` → access granted (owner of own connection)
2. **Admin bypass**: if user is admin AND `ssh.admin_bypasses_grants: true` → granted (audited)
3. **Grant lookup**:
   - find rows where `connection_id = ?`
   - subject matches (`subject_type='user' AND subject_id=userId`) OR (`subject_type='org' AND subject_id IN orgIds`)
   - piece matches (`applies_to_all_pieces=1` OR `piece_name = ?`)
   - not expired (`expires_at IS NULL OR expires_at > now()`)
   - **any** matching row → granted
4. Otherwise → denied (`access denied (no_grant)`)

### Creating grants

Admin-only via `POST /api/ssh/admin/grants`:

```json
{
  "connection_id": "abcd1234-...",
  "subject_type": "user",
  "subject_id": "alice",
  "piece_name": "backup-rotation",
  "applies_to_all_pieces": false,
  "reason": "Alice owns backups for prod-east cluster",
  "expires_at": null
}
```

For `applies_to_all_pieces: true`:
- `piece_name` **must be null**
- the admin endpoint requires explicit `reason` containing scope justification
- audit row records `action: ssh.grant.create` with `detail.applies_to_all=true`
- this is the highest-privilege grant — review carefully

### Org grants

Same schema with `subject_type: "org"`, `subject_id: <gitea org name>`.
Membership comes from `user_gitea_orgs` (populated at login via Gitea
OAuth). A user with multiple org memberships matches grants for any
of those orgs.

### Expiration

`expires_at` is checked at decision time (no background sweep). Expired
rows remain in the table for audit purposes. Admin can delete them via
`DELETE /api/ssh/admin/grants/:id`.

## Path Policy

### Local path (workspace)

For `SshUpload.local_path` and `SshDownload.local_path`:

- Resolved against `ctx.workspacePath` (the job's workspace root)
- `..` traversal → reject
- Symlinks: open with `O_NOFOLLOW`, lstat every parent → reject if any parent is a symlink leaving the workspace
- For download: parent directory must exist; target file must NOT exist (`O_CREAT | O_EXCL`)

### Remote path

For `remote_path` on upload/download:

- Must be **absolute** (starts with `/`)
- After POSIX normalization (`path.posix.normalize`), must start with the connection's `remote_path_prefix`
- `..` segments are collapsed by normalize; the post-normalize check catches escape attempts
- No glob expansion — exact path only

Example: connection has `remote_path_prefix = '/srv/agent'`

| Input | Normalized | Result |
|---|---|---|
| `/srv/agent/file.txt` | `/srv/agent/file.txt` | ✅ |
| `/srv/agent/sub/file.txt` | `/srv/agent/sub/file.txt` | ✅ |
| `/srv/agent/../etc/passwd` | `/etc/passwd` | ❌ outside prefix |
| `/srv/agentish/file` | `/srv/agentish/file` | ❌ prefix mismatch (not `/srv/agent/...`) |
| `file.txt` (relative) | n/a | ❌ not absolute |

## Command Filtering

`SshExec.command` runs through a two-stage filter.

### Stage 1: built-in deny-list

Hard-coded patterns in `src/ssh/deny-list.ts`. Examples (not exhaustive):

- `rm -rf /` and variants
- fork bombs (`:(){:|:&};:`)
- `mkfs.*`, `dd if=/dev/zero ...`
- shutdown / reboot / poweroff
- `:>/dev/sda` style block-device writes

If matched, the call is rejected with `command rejected by built-in
deny-list (matched pattern: ...).` and audited as `outcome=denied`.

The built-in list is **not** a comprehensive sandbox — it's a tripwire
against the most catastrophic typos and worst-case prompt injection
payloads. Production deployments should also configure connection-level
patterns.

### Stage 2: per-connection regex (optional)

Each connection can carry:

- `deny_patterns`: newline-separated regex list. Match → reject.
- `allow_patterns`: newline-separated regex list. If set, every command
  must match at least one allow pattern (after passing both deny stages).

Both are validated at save-time by `validateCustomPatterns`:

- Each pattern must compile
- Each must pass the `safe-regex` ReDoS check
- Aggregate length capped (no megabyte-blobs of regex)

Example:

```
# deny_patterns
sudo
^\s*rm\s+
nc\s+-l

# allow_patterns
^(ls|cat|grep|tail|head|systemctl|journalctl)\s
^/srv/agent/scripts/
```

ReDoS-safe regex is enforced because user-supplied patterns run
synchronously on the command string before each call.

## SSRF + Algorithms

### SSRF (host resolution)

Every connection target goes through `ssrfStrict(host, allowPrivate)`:

1. DNS resolve host → list of A/AAAA records
2. For each address, check against the IP-policy:
   - Reject 0.0.0.0, ::/0
   - Reject 127.0.0.0/8, ::1 (loopback)
   - Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private)
   - Reject 169.254/16 (link-local — including AWS metadata)
3. **DNS pinning**: the resolved address is captured before connect;
   ssh2 connects to the pinned IP, not to the hostname. This prevents
   DNS rebinding (round 1: public IP passes check; round 2: returns
   loopback during connect).

`allowPrivate` short-circuits step 2. Two opt-in flags compose:

- Global: `ssh.allow_private_addresses: true` in config.yaml
- Per-connection: `allow_private_addresses=1` on the row (admin sets via
  `/api/ssh/admin/globals` or `/api/ssh/admin/connections/:id`)

Either being true allows private/loopback. Both default false.

### Algorithm allowlist

Hard-coded in `src/ssh/session.ts`:

| Category | Allowed |
|---|---|
| Key exchange | `curve25519-sha256`, `curve25519-sha256@libssh.org`, `ecdh-sha2-nistp256/384/521`, `diffie-hellman-group14/16/18-sha256/512` |
| Server host key | `ssh-ed25519`, `rsa-sha2-256`, `rsa-sha2-512`, `ecdsa-sha2-nistp256/384/521` |
| Cipher | `aes256-gcm@openssh.com`, `aes128-gcm@openssh.com`, `aes256-ctr`, `aes192-ctr`, `aes128-ctr` |
| HMAC | `hmac-sha2-512-etm@openssh.com`, `hmac-sha2-256-etm@openssh.com`, `hmac-sha2-512`, `hmac-sha2-256` |

Notably banned: `ssh-rsa` (SHA1), `ssh-dss`, all `arcfour*`, `hmac-md5*`,
`hmac-sha1*`. Mismatch returns `host_key_alg_not_allowed` or
`auth_failed` depending on which stage caught it.

## Audit Log

Single table: `ssh_audit_log`. Every SSH operation writes here.

### Lifecycle

```
begin (outcome=pending) [commits before remote call]
   ↓
remote call (ssh2 connect / exec / sftp)
   ↓
complete (outcome=success|failed|denied|aborted) [updates same row]
```

If the orchestrator crashes between `begin` and `complete`, the row
stays `pending`. On next boot, the recovery sweep (`src/ssh/recovery.ts`)
updates pending rows older than 10 minutes to `aborted` with
`detail.recovered_at` set.

### Actions

| Action | Triggered by |
|---|---|
| `ssh.exec` | `SshExec` |
| `ssh.upload` | `SshUpload` |
| `ssh.download` | `SshDownload` |
| `ssh.connection.upsert` | User/admin connection create/edit |
| `ssh.connection.delete` | User/admin connection delete |
| `ssh.connection.host_key.first_observe` | TOFU first observation |
| `ssh.connection.host_key.mismatch` | TOFU mismatch |
| `ssh.connection.host_key.tofu_record` | Internal helper write |
| `ssh.connection.host_key.verify` | User `/verify-host-key` |
| `ssh.connection.host_key.replace` | User `/replace-host-key` |
| `ssh.connection.disable` | Admin disable |
| `ssh.connection.enable` | Admin enable |
| `ssh.abuse.unlock_manual` | Admin force-unlock |
| `ssh.grant.create` | Admin grant create |
| `ssh.grant.delete` | Admin grant delete |
| `ssh.master_key.rotate.start` | Admin rotation start |

### Detail column

JSON blob with action-specific fields:

- `ssh.exec`: `{command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}`
- `ssh.upload`: `{local_path: "...", remote_path: "...", bytes: 4096}`
- `ssh.download`: same shape
- `ssh.connection.host_key.first_observe`: `{fingerprint: "SHA256:...", pending_token: "uuid"}`
- All denied: `{error: "no_grant" | "abuse_locked" | "disabled" | ...}`

The `ssh.exec` action does **not** record the command string — only its
SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you
need to investigate a specific exec, correlate the hash with stdin
logs from the LLM activity log (workspace `logs/activity.log`).

### Retention

`ssh.audit_retention_days` (default 90) controls a lazy sweep. Admin can
trigger pruning manually from the UI. There is no hard cap on table
size — disk-fill is mitigated by the hashing + truncation strategy
above, plus admin-driven cleanup.

## Abuse Counters & Lock

Defends against credential spraying, mistyped scripts in loops, and
brute-force scans.

### Three scopes

| Scope | Key | When |
|---|---|---|
| `user` | `(user_id,)` | Any SSH failure by this user |
| `host:user` | `(host, username)` | Failure on this (host, username) tuple |
| `host` | `(host,)` | Failure on this host (global connections only) |

The `host` scope intentionally **only counts failures on global
connections** to prevent cross-user DoS: a user repeatedly failing on
their own connection cannot lock out other users from a shared host.
For user-owned connections, the `host` counter is updated for
admin-notification only — no lock applies.

### Algorithm

```
on failure:
  for each scope:
    increment counter
    if count(within abuseWindowMinutes) >= abuseFailureThreshold:
      lock until now + abuseLockMinutes

on success (user scope only):
  reset user counter
  other scopes age out naturally with the window
```

Counters are stored in `ssh_abuse_counters`, with separate columns per
scope. All updates are transactional (no UPSERT race).

### Force-unlock

Admin can force-unlock from `SshGlobalConnectionsForm` or per-connection
admin page:

```
POST /api/ssh/admin/connections/:id/force-unlock
  {reason: "Confirmed credentials rotated; user retried with old key"}
```

Rate-limited to 10/hour total across all admins (`admin-rate-limit.ts`,
token bucket). Audited as `ssh.abuse.unlock_manual`.

## Master Key Rotation

Replaces `MCP_ENCRYPTION_KEY` and rewraps every row's DEK under the new
key. This is **the** way to rotate the master key — do not edit the env
var manually.

### Flow

1. **Admin starts** via `POST /api/ssh/admin/rotate-master-key`:
   ```json
   {"new_key_hex": "<64-hex>", "reason": "Annual rotation"}
   ```
2. **Maintenance mode engages** — `sshMaintenance.enter()` returns 503
   for all SSH write endpoints (read endpoints stay alive). The LLM
   sees `SSH subsystem is in maintenance` errors for tool calls.
3. **Per-row rewrap**:
   - For each `ssh_connections` row: decrypt DEK under old key, re-encrypt under new key, bump `key_version`, commit (one tx per row)
   - For each `system_deks` row: same
4. **New key validated** by decrypting a test value
5. **Maintenance exits** automatically
6. **Caller polls** `GET /api/ssh/admin/rotate-master-key/:jobId` for status (`running` / `succeeded` / `failed`)

### Failure modes

- **Crash mid-rotation**: rows have mixed `key_version`. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key.
- **Wrong old key**: the first row decryption fails → job aborts before any change, maintenance exits, audit records `ssh.master_key.rotate.start` with `outcome=failed`.
- **Disk write fails mid-row**: that single row is rolled back; rotation continues. Operator must re-run.

The rotation job runs in-process (not as a separate worker). For large
fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write.

### `MCP_ENCRYPTION_KEY` env var

After successful rotation, **the env var must be updated to the new
key** before the next restart. The orchestrator writes the new key to
the audit log (encrypted under the OLD key) and returns it once in the
HTTP response — there's no second chance. Update your secrets store
immediately.

## Operator Runbook

### A. Add a global (admin-managed) connection

```bash
# Via UI: Settings → SSH → Global Connections → Add
# Or via API (requires admin session cookie):

curl -X POST http://localhost:3000/api/ssh/admin/globals \
  -H 'Content-Type: application/json' \
  -d @- <<'JSON'
{
  "label": "prod-east-bastion",
  "host": "bastion.prod-east.example.com",
  "port": 22,
  "username": "deploy",
  "private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...",
  "passphrase": null,
  "remote_path_prefix": "/srv/deploy",
  "allow_private_addresses": false,
  "deny_patterns": "sudo\n^\\s*rm\\s+",
  "allow_patterns": "",
  "reason": "Production deploy bastion — owned by SRE"
}
JSON
```

Then verify the host key (next section) and grant access.

### B. Grant org access to a global connection

```bash
curl -X POST http://localhost:3000/api/ssh/admin/grants \
  -H 'Content-Type: application/json' \
  -d '{
    "connection_id": "<uuid>",
    "subject_type": "org",
    "subject_id": "engineering",
    "piece_name": "prod-deploy",
    "applies_to_all_pieces": false,
    "reason": "Engineering org runs prod-deploy piece"
  }'
```

### C. Verify a TOFU first-observe

1. From the user's side or admin side, click **Test** in the SshConnections panel
2. The response is `host_key_first_observe` with a SHA-256 fingerprint and pending token
3. **Verify externally** that the fingerprint matches the real server:
   ```bash
   ssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \
     | ssh-keygen -lf -
   ```
   Compare the resulting `SHA256:...` with what the UI shows
4. In the dialog, type the fingerprint to confirm and click **Verify**
5. Audit row `ssh.connection.host_key.verify` recorded; subsequent calls succeed

### D. Force-unlock a stuck connection

Symptom: user reports "SshExec returns `access denied (abuse_locked)`"

1. Settings → SSH → Global Connections → click the row → "Locks" section
2. Inspect the counter state (which scope is locked, until when)
3. If genuinely needs early unlock (e.g. user fixed the bad credentials), click **Force unlock**, enter reason
4. If suspicious (unexplained 5+ failures), investigate audit log first

### E. Rotate the master key

```bash
NEW_KEY=$(openssl rand -hex 32)

# Start rotation
JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \
  -H 'Content-Type: application/json' \
  -d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \
  | jq -r .job_id)

# Poll until done
while true; do
  STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \
    | jq -r .status)
  echo "$STATUS"
  [ "$STATUS" = "succeeded" ] && break
  [ "$STATUS" = "failed" ] && { echo FAILED; exit 1; }
  sleep 2
done

# Update env var BEFORE next restart
echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env
```

### F. Prune old audit logs

Settings → SSH → Audit Log → "Prune older than N days" (defaults to the
config retention value). Or via API:

```bash
curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90'
```

## Troubleshooting

### Symptom → cause table

| Error | Common cause | Fix |
|---|---|---|
| `SSH is disabled` (503) | `ssh.enabled: false` | Set true, restart not required |
| `SSH subsystem is in maintenance` | Master key rotation in progress | Wait for job to complete, or check rotation log |
| `access denied (no_grant)` | User lacks grant for connection | Admin creates a grant, or user uses an owned connection |
| `access denied (disabled)` | Admin disabled the connection | Admin re-enables, or use different connection |
| `access denied (abuse_locked)` | Counter triggered | Wait for lock window, or admin force-unlocks |
| `piece "X" does not list connection Y` | `allowed_ssh_connections` missing UUID | Add UUID to the movement's `allowed_ssh_connections` |
| `host_key_first_observe` | First time exercising connection | Verify fingerprint in UI |
| `host_key_not_verified` | Key recorded but never verified | Click Verify in UI |
| `host_key_mismatch` | Server key changed | Investigate (legitimate rotation? MITM?), then Replace via UI |
| `host_key_alg_not_allowed` | Server using SHA1-RSA etc. | Upgrade server to ed25519 / rsa-sha2-256 |
| `auth_failed` | Wrong key, wrong username | Re-check connection settings |
| `connect_timeout` | Network unreachable, firewall | Check from host, check SSRF policy |
| `exec_timeout` | Long-running command | Increase `timeout_ms`, or run in background and `Download` results |
| `output_too_large` | stdout > 32 KiB | Filter the command, or write to file and `Download` |
| `forbidden_address` | Target is private IP, no opt-in | Set `allow_private_addresses` per-connection or globally |
| `system_dek_verify_failed` (log) | `MCP_ENCRYPTION_KEY` changed without rotation flow | Stop server, restore old key OR re-rotate via flow |

### Where to look

| Question | Source |
|---|---|
| What did the LLM try to do? | `logs/activity.log` in the job's workspace |
| What did SSH do? | `ssh_audit_log` (Admin UI or SQL) |
| Was it actually denied at the SSH layer? | Audit row `outcome` = `denied` |
| What was the exit code? | Audit `detail.exit_code` (for `ssh.exec`) |
| Did it crash mid-call? | Audit `outcome` = `aborted` (recovery sweep) |
| Why was the host key flagged? | Audit `ssh.connection.host_key.*` rows |
| Who has access to a connection? | `ssh_connection_grants` filtered by `connection_id` |

## Security Model Summary

Detailed threat model + risk register: see plan doc §"Security Design
Deep-Dive (rev 3)" and §"Risk Register (rev 3)".

Key points operators must understand:

1. **The orchestrator is a credential proxy.** Anyone with admin rights
   can read connection plaintext (via the rotation flow, which decrypts
   server-side). Treat admin access as production-credential-equivalent.

2. **TOFU is the floor, not the ceiling.** First-observe is unauthenticated.
   For high-stakes targets, pre-populate `host_key_b64` from a trusted
   bootstrap (e.g. baked into the connection at create time via the
   `host_key_b64` field) rather than relying on the orchestrator's first
   observation.

3. **The deny-list is not a sandbox.** Built-in patterns catch obvious
   misuse. Real isolation requires connection-level configuration
   (restricted shell account, `remote_path_prefix`, narrow `allow_patterns`)
   and target-side controls.

4. **Audit log is local-only.** No HMAC chain (acknowledged limitation
   R-audit-tamper). For tamper-evidence, ship `ssh_audit_log` rows to an
   external SIEM via SQLite hooks or periodic export.

5. **ssh2 internal key retention** (R-ssh2-leak): the PEM lives in JS
   heap for the connection lifetime. Process compromise reveals plaintext
   credentials. Mitigations: short-lived processes, separate worker
   per high-stakes connection.

6. **Master key compromise = total compromise.** Key rotation invalidates
   already-leaked encrypted material — if an attacker has both the DB
   and the old master key, all stored creds are theirs. Rotate keys
   immediately on suspected compromise AND rotate every stored credential
   on the target side.

## HTTP API Reference

User router: mounted at `/api/ssh` — requires `requireAuth`.

| Method | Path | Purpose |
|---|---|---|
| GET | `/connections` | List own connections + grant-visible globals |
| POST | `/connections` | Create user-owned connection |
| GET | `/connections/:id` | Read |
| PATCH | `/connections/:id` | Edit (owner only) |
| DELETE | `/connections/:id` | Delete (owner only) |
| POST | `/connections/:id/test` | Trigger TOFU observation / verify path |
| POST | `/connections/:id/verify-host-key` | Atomic verify (token + fingerprint) |
| POST | `/connections/:id/replace-host-key` | Atomic replace (token + fingerprint + reason) |
| GET | `/connections/:id/audit` | Owner's view of audit rows for this connection |
| GET | `/grants/visible-to-me` | List grants visible (subject=user or matching org) |

Admin router: mounted at `/api/ssh/admin` — requires `requireAdmin`.

| Method | Path | Purpose |
|---|---|---|
| GET | `/connections` | All connections (cross-tenant) |
| GET | `/connections/:id` | Admin read |
| PATCH | `/connections/:id/disable` | Soft-disable (audited; reason required) |
| PATCH | `/connections/:id/enable` | Re-enable |
| DELETE | `/connections/:id` | Hard-delete |
| POST | `/connections/:id/force-unlock` | Clear abuse counter (rate-limited; reason required) |
| POST | `/globals` | Create global connection |
| PATCH | `/globals/:id` | Edit global |
| DELETE | `/globals/:id` | Delete global |
| GET | `/grants` | List all grants |
| POST | `/grants` | Create grant |
| DELETE | `/grants/:id` | Delete grant |
| POST | `/rotate-master-key` | Start master key rotation |
| GET | `/rotate-master-key/:jobId` | Poll rotation status |
| GET | `/audit` | All-tenant audit view (paginated) |

All admin write endpoints require:
- `requireAdmin` middleware
- `maintenance503()` guard (rejects writes during rotation)
- `validateReason()` on `body.reason` (≥ 8 chars)
- `auditRepo.beginAndComplete()` for success/failure both

## SSH Console (Interactive)

`ssh.console.enabled: true` で有効化。

- 1 タスク = 1 PTY セッション。job をまたいで shell state を維持
- Tab `SSH` がタスク詳細に出る (piece が SshConsole* を allow している場合)
- WebSocket: `/api/local/tasks/:taskId/console/ws`
- REST status: `GET /api/local/tasks/:taskId/console/status`
- 監査: `ssh.console.{open,send,snapshot,resize,input_rejected,close}`
- 自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill
- 同 connection あたり最大 3 sessions (古い順に evict)

Admin: `GET /api/admin/ssh/console-sessions` で一覧、 `POST /api/admin/ssh/console-sessions/:taskId/kill` で kill (admin role only)。

## See Also

- [docs/tools/ssh-tools.md](./tools/ssh-tools.md) — LLM-facing tool semantics
- [docs/tools/ssh-console-tools.md](./tools/ssh-console-tools.md) — SSH Console tool semantics (Ensure/Send/Snapshot)
- [docs/mcp.md](./mcp.md) — MCP integration (shares `MCP_ENCRYPTION_KEY`)
- [docs/maintenance-checklist.md](./maintenance-checklist.md) §12 — checklist for SSH-related code changes