oss-sync f5c7666f6b feat: initial public release (MAESTRO)

2026-06-03 05:08:00 +00:00

37 KiB

Raw Blame History

SSH Subsystem (Operator Runbook)

The orchestrator can run shell commands on remote servers (SshExec) and move files between the workspace and remote hosts (SshUpload/SshDownload) through a dedicated, audited SSH subsystem. Like the MCP integration, the feature is off by default and requires a key + config flip to enable.

This document is the operator runbook for setting up, granting access, verifying host keys, rotating the master key, and troubleshooting. For the LLM-facing tool semantics see docs/tools/ssh-tools.md. For the internal design (threat model, risk register, schema, 12-step orchestration flow) see

At a glance

Aspect	Behavior
Default	`ssh.enabled: false` — tools hidden, panels hidden, API returns 503
Tools exposed when enabled	`SshExec`, `SshUpload`, `SshDownload`
Authentication	Public-key only; passwords are not supported
Host key trust	TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed
Connection ownership	User-owned (private) or Global (admin-managed, shared via grants)
Encryption at rest	AES-256-GCM (per-row DEK, master key = `MCP_ENCRYPTION_KEY`)
Audit	Dedicated `ssh_audit_log` table, `pending → success/failed/denied/aborted` lifecycle
Abuse defense	3-scope counters (`user` / `host:user` / `host`) with auto-lock
Network policy	SSRF strict by default; per-connection opt-in for private IPs
Algorithm policy	Strict allowlist (no SHA1-RSA, no weak DH/HMAC)

Prerequisites

1. `MCP_ENCRYPTION_KEY`

The SSH subsystem shares the same master key as MCP — there is only one key per orchestrator. All private keys, passphrases, and global-connection DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is wrapped by this master key.

Generate it once (32 bytes = 64 hex chars):

openssl rand -hex 32

Export it before starting the server:

export MCP_ENCRYPTION_KEY=<the 64-hex output>
scripts/server.sh start

If MCP_ENCRYPTION_KEY is not set when ssh.enabled: true, the SSH subsystem boots fail-soft: a warning is logged, all SSH endpoints return 503, the tools are hidden from LLM, and the UI panels show a configuration error banner. Other features (MCP excepted) continue normally.

⚠ Key rotation invalidates existing encrypted material. There is a built-in master key rotation flow that rewraps every row in maintenance mode. Do not swap the env var manually without using that flow — half-rotated state breaks every connection.

2. `ssh.enabled: true`

Flip the flag in config.yaml:

ssh:
  enabled: true

This is the master switch. With it false:

HTTP endpoints (/api/ssh/*, /api/ssh/admin/*) return 503
Tool defs are not exposed to the LLM (the dispatcher returns null)
UI panels render an "SSH is disabled" empty state
Database tables remain present (no destructive change)

Restart is not required — ConfigManager reload picks up the change and rebuilds the SSH router.

3. `system_deks` bootstrap

The first time the orchestrator boots with ssh.enabled: true AND a valid MCP_ENCRYPTION_KEY, it provisions a single row in system_deks (via INSERT OR IGNORE inside a transaction, CHECK(id=1)). This DEK encrypts global connections (those without an owner).

On every subsequent boot, verifySystemDek decrypts the stored DEK to prove the master key still works. If it fails (key rotated outside the rotation flow, or env var differs from when the DEK was wrapped), SSH fails closed for the session and a system_dek_verify_failed error is logged. User-owned connections may still partially work (their DEKs are wrapped per user), but global connections will all error.

4. Optional: `allow_private_addresses`

By default, SSH connections are routed through the SSRF strict-check, which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12, 192.168/16, fc00::/7, 169.254/16) addresses. For LAN targets you must opt in.

There are two scopes:

ssh:
  enabled: true
  allow_private_addresses: true   # global default

-- per-connection opt-in (admin-only flag, audited)
UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?;

The per-connection flag is preferred — narrow the blast radius. The global flag exists for trusted dev networks (homelab, isolated VPC). The per-connection flag can only be set on global (admin-managed) connections; for user-owned connections, the global flag applies.

Quickstart

# 1. Set the key
openssl rand -hex 32 > ~/.mcp_encryption_key
export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key)

# 2. Enable SSH + allow LAN
cat >> config.yaml <<'YAML'
ssh:
  enabled: true
  allow_private_addresses: true   # only if you're targeting LAN
YAML

# 3. Restart
scripts/server.sh restart

Then in the UI:

Settings → User Folder → SSH Connections → Add
Fill label, host, port (default 22), username, paste private key (OpenSSH PEM)
Optionally set remote_path_prefix (default /) — restricts upload/download paths
Click Test → first call returns host_key_first_observe with a fingerprint
Verify in the dialog (compare fingerprint with what you expect from ssh-keyscan <host>)
Add the connection's UUID to a piece's allowed_ssh_connections:

# pieces/example.yaml
name: ssh-example
movements:
  - name: deploy
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-..."]
    rules:
      - condition: done
        next: COMPLETE
    instruction: |
      Use SshExec to ...

Test the piece via the normal task UI.

`config.yaml` Reference

Full SSH section with defaults:

ssh:
  # master switch
  enabled: false

  # SSRF policy — when true, allow private/loopback addresses (global)
  allow_private_addresses: false

  # wall-clock timeout for connect + handshake + exec/transfer (seconds)
  call_timeout_seconds: 30

  # stdout/stderr byte cap for SshExec (bytes)
  max_output_bytes: 32768          # 32 KiB

  # SFTP transfer size caps (MB)
  max_upload_size_mb: 100
  max_download_size_mb: 100

  # ssh_audit_log retention (days). Admin can prune via UI.
  audit_retention_days: 90

  # When true (default), admins can use any connection without an explicit
  # grant. Audited regardless. Set false for stricter least-privilege.
  admin_bypasses_grants: true

  # Abuse counters
  abuse_window_minutes: 10        # rolling window for failure counting
  abuse_failure_threshold: 5      # failures within window → lock
  abuse_lock_minutes: 30          # lock duration on threshold breach

All keys translate to camelCase in SshRuntimeConfig (src/ssh/config.ts). The transformKeys helper in src/config.ts handles the conversion.

Connection Model

Owner

Each row in ssh_connections has an owner_id:

Owner	Visibility	Who creates
User-owned (`owner_id = userId`)	Only the owner; admin can also list but not edit	Any authenticated user (`POST /api/ssh/connections`)
Global (`owner_id IS NULL`)	All users see it in the picker (subject to grants)	Admin only (`POST /api/ssh/admin/globals`)

Global connections solve the "team-shared infra account" use case — a single set of credentials that multiple users invoke under their own identity, audited per user, gated by grants.

Encryption

For each connection:

Generate a fresh 32-byte DEK
Encrypt private_key_pem (and optionally passphrase) with the DEK
Wrap the DEK with the master key (MCP_ENCRYPTION_KEY)
Store: private_key_enc, private_key_dek_enc, key_version

key_version allows progressive rewrap during master key rotation (each row tracks which generation of master key its DEK is wrapped under). Global connections use the single system_deks row (id=1) rather than a per-row DEK.

⚠ ssh2's internal key handling is opaque — once the PEM is loaded into the library, it lives in JS heap memory for the lifetime of the connection. We Buffer.fill(0, 0) our copies in the finally block but cannot reach into ssh2 internals. This is an acknowledged limitation; see the plan doc's "Acknowledged limitations" section.

UI Walkthrough

User: Settings → User Folder → SSH Connections

The SshConnectionsPanel lists the user's connections and any global connections they have a grant for. Each row shows:

Label + host:port + username
Host key fingerprint + verify state (verified / pending / first_observe / mismatch)
Lock state (if abuse counter triggered)
Actions: Test, Verify host key, Replace host key (with reason), Edit, Delete

The "Add Connection" form (SshConnectionForm) collects:

Label
Host, port, username
Private key (textarea — PEM format; passphrase optional)
Remote path prefix (default /)
Custom deny/allow regex patterns (newline-separated, validated at save-time)

SshHostKeyDialog opens on first_observe / mismatch and shows the observed fingerprint side-by-side with the previously-stored one (if any). "Trust this key" requires typing the fingerprint to confirm.

Admin: Settings → SSH

Four sub-panels under SshForm:

Panel	Component	Purpose
Global Connections	`SshGlobalConnectionsForm`	CRUD on global connections (`owner_id IS NULL`). Includes the `allow_remote_unrestricted` and per-connection `allow_private_addresses` flags
Grants	`SshGrantsForm`	List/create/delete grants. Per-piece or `applies_to_all_pieces`. Subject: user or org. Reason required
Audit Log	`SshAuditLog`	All-tenant audit view. Filter by action / outcome / connection / time range. Pagination
Master Key Rotation	`SshMasterKeyRotationForm`	Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status

Admin can also force-unlock abuse counters from the per-connection page (requires reason; rate-limited to 10/hour total).

Host Key TOFU Flow

SSH security depends on knowing the right host key. We use Trust-On- First-Use: the first time a connection is exercised, we record the observed key and require explicit user verification before treating it as trusted.

States

ssh_connections carries three host-key columns:

Column	Meaning
`host_key_b64`	The observed public key in OpenSSH base64 form. NULL = never observed.
`host_key_fingerprint`	SHA-256 fingerprint for UI display (`SHA256:...`).
`host_key_verified_at`	ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending.
`host_key_pending_token`	UUID issued at first_observe / mismatch; consumed atomically by `/verify-host-key`.

A connection is trusted iff host_key_verified_at IS NOT NULL AND the observed key during connect matches host_key_b64.

Lifecycle

new connection
   │
   ├─ user clicks Test (or LLM calls SshExec)
   │     │
   │     ▼
   │   sshTest() observes the host key
   │     │
   │     ▼
   │   onFirstObserve hook fires
   │     - writes host_key_b64, host_key_fingerprint, host_key_pending_token
   │     - audit row: ssh.connection.host_key.first_observe
   │     - returns SshSessionError('host_key_first_observe')
   │
   │   (UI shows the fingerprint + pending token)
   │
   ├─ user clicks Verify (typing fingerprint to confirm)
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/verify-host-key
   │     {token, fingerprint}
   │     - atomic compare-and-set: token + fingerprint match → set host_key_verified_at
   │     - audit row: ssh.connection.host_key.verify
   │
   ▼
verified — Exec/Upload/Download now work

On host_key_mismatch (server rebuilt, key rotated, or MITM):

   ├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload
   │     │
   │     ▼
   │   ssh2 observes key ≠ host_key_b64
   │     │
   │     ▼
   │   onMismatch hook fires
   │     - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet)
   │     - audit row: ssh.connection.host_key.mismatch
   │     - returns SshSessionError('host_key_mismatch')
   │
   │   (UI shows OLD vs NEW fingerprint side-by-side)
   │
   ├─ user investigates externally (ssh-keyscan, IT team, etc.)
   │
   ├─ user clicks "Replace key" with reason
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/replace-host-key
   │     {token, fingerprint, reason}
   │     - atomic compare-and-set
   │     - writes new host_key_b64, host_key_fingerprint, host_key_verified_at
   │     - audit row: ssh.connection.host_key.replace

The pending token mechanism prevents a "verify swap" race: if a second TOFU observation happens between the user's verify request and its arrival, the old token is overwritten and the verify endpoint returns 409 stale_token.

Banned algorithms

Even before TOFU completes, the host-key algorithm is checked against an allowlist. SHA1-RSA and other weak algorithms are rejected before the key is recorded (host_key_alg_not_allowed). This is hard-coded in src/ssh/session.ts to avoid misconfiguration.

Per-piece `allowed_ssh_connections`

A piece's movement must explicitly opt in to SSH usage. The piece-runner enforces three invariants:

If a movement's allowed_tools contains any SSH tool name (SshExec/SshUpload/SshDownload), allowed_ssh_connections must be declared on that movement (even if empty)
The field must be an array of strings
Each entry must be * or a lowercase hex+hyphen UUID (≥ 8 chars)

Lint failures abort piece load.

Forms

# Explicit allowlist (most common)
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]

# Wildcard (admin-style — use sparingly)
allowed_ssh_connections: ["*"]

# Deny-all (still allows SSH tool in allowed_tools but refuses every UUID)
allowed_ssh_connections: []

The * form skips the per-piece check but does not skip the access grant check. A user without a grant for a given connection still cannot use it even when the piece says *.

Example

name: backup-rotation
description: Daily backup rotation on prod servers
movements:
  - name: list
    allowed_tools: [SshExec]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      List the existing backup files on each server.
    rules:
      - condition: ready to rotate
        next: rotate

  - name: rotate
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      Rotate the oldest backup ...
    rules:
      - condition: done
        next: COMPLETE

Access Grants

Grants connect a subject (user or org) to a connection, scoped to a piece (or all pieces, admin-only).

Schema

CREATE TABLE ssh_connection_grants (
  id TEXT PRIMARY KEY,
  connection_id TEXT NOT NULL,
  subject_type TEXT NOT NULL,        -- 'user' | 'org'
  subject_id TEXT NOT NULL,
  piece_name TEXT,                   -- NULL iff applies_to_all_pieces=1
  applies_to_all_pieces INTEGER NOT NULL DEFAULT 0,
  granted_by_user_id TEXT NOT NULL,
  reason TEXT NOT NULL,              -- required, ≥ 8 chars
  expires_at TEXT,                   -- ISO8601 or NULL
  created_at TEXT NOT NULL
);

Decision tree

For a given (userId, orgIds, connectionId, pieceName):

Owner check: if connection.owner_id == userId → access granted (owner of own connection)
Admin bypass: if user is admin AND ssh.admin_bypasses_grants: true → granted (audited)
Grant lookup:
- find rows where connection_id = ?
- subject matches (subject_type='user' AND subject_id=userId) OR (subject_type='org' AND subject_id IN orgIds)
- piece matches (applies_to_all_pieces=1 OR piece_name = ?)
- not expired (expires_at IS NULL OR expires_at > now())
- any matching row → granted
Otherwise → denied (access denied (no_grant))

Creating grants

Admin-only via POST /api/ssh/admin/grants:

{
  "connection_id": "abcd1234-...",
  "subject_type": "user",
  "subject_id": "alice",
  "piece_name": "backup-rotation",
  "applies_to_all_pieces": false,
  "reason": "Alice owns backups for prod-east cluster",
  "expires_at": null
}

For applies_to_all_pieces: true:

piece_name must be null
the admin endpoint requires explicit reason containing scope justification
audit row records action: ssh.grant.create with detail.applies_to_all=true
this is the highest-privilege grant — review carefully

Org grants

Same schema with subject_type: "org", subject_id: <gitea org name>. Membership comes from user_gitea_orgs (populated at login via Gitea OAuth). A user with multiple org memberships matches grants for any of those orgs.

Expiration

expires_at is checked at decision time (no background sweep). Expired rows remain in the table for audit purposes. Admin can delete them via DELETE /api/ssh/admin/grants/:id.

Path Policy

Local path (workspace)

For SshUpload.local_path and SshDownload.local_path:

Resolved against ctx.workspacePath (the job's workspace root)
.. traversal → reject
Symlinks: open with O_NOFOLLOW, lstat every parent → reject if any parent is a symlink leaving the workspace
For download: parent directory must exist; target file must NOT exist (O_CREAT | O_EXCL)

Remote path

For remote_path on upload/download:

Must be absolute (starts with /)
After POSIX normalization (path.posix.normalize), must start with the connection's remote_path_prefix
.. segments are collapsed by normalize; the post-normalize check catches escape attempts
No glob expansion — exact path only

Example: connection has remote_path_prefix = '/srv/agent'

Input	Normalized	Result
`/srv/agent/file.txt`	`/srv/agent/file.txt`	✅
`/srv/agent/sub/file.txt`	`/srv/agent/sub/file.txt`	✅
`/srv/agent/../etc/passwd`	`/etc/passwd`	❌ outside prefix
`/srv/agentish/file`	`/srv/agentish/file`	❌ prefix mismatch (not `/srv/agent/...`)
`file.txt` (relative)	n/a	❌ not absolute

Command Filtering

SshExec.command runs through a two-stage filter.

Stage 1: built-in deny-list

Hard-coded patterns in src/ssh/deny-list.ts. Examples (not exhaustive):

rm -rf / and variants
fork bombs (:(){:|:&};:)
mkfs.*, dd if=/dev/zero ...
shutdown / reboot / poweroff
:>/dev/sda style block-device writes

If matched, the call is rejected with command rejected by built-in deny-list (matched pattern: ...). and audited as outcome=denied.

The built-in list is not a comprehensive sandbox — it's a tripwire against the most catastrophic typos and worst-case prompt injection payloads. Production deployments should also configure connection-level patterns.

Stage 2: per-connection regex (optional)

Each connection can carry:

deny_patterns: newline-separated regex list. Match → reject.
allow_patterns: newline-separated regex list. If set, every command must match at least one allow pattern (after passing both deny stages).

Both are validated at save-time by validateCustomPatterns:

Each pattern must compile
Each must pass the safe-regex ReDoS check
Aggregate length capped (no megabyte-blobs of regex)

Example:

# deny_patterns
sudo
^\s*rm\s+
nc\s+-l

# allow_patterns
^(ls|cat|grep|tail|head|systemctl|journalctl)\s
^/srv/agent/scripts/

ReDoS-safe regex is enforced because user-supplied patterns run synchronously on the command string before each call.

SSRF + Algorithms

SSRF (host resolution)

Every connection target goes through ssrfStrict(host, allowPrivate):

DNS resolve host → list of A/AAAA records
For each address, check against the IP-policy:
- Reject 0.0.0.0, ::/0
- Reject 127.0.0.0/8, ::1 (loopback)
- Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private)
- Reject 169.254/16 (link-local — including AWS metadata)
DNS pinning: the resolved address is captured before connect; ssh2 connects to the pinned IP, not to the hostname. This prevents DNS rebinding (round 1: public IP passes check; round 2: returns loopback during connect).

allowPrivate short-circuits step 2. Two opt-in flags compose:

Global: ssh.allow_private_addresses: true in config.yaml
Per-connection: allow_private_addresses=1 on the row (admin sets via /api/ssh/admin/globals or /api/ssh/admin/connections/:id)

Either being true allows private/loopback. Both default false.

Algorithm allowlist

Hard-coded in src/ssh/session.ts:

Category	Allowed
Key exchange	`curve25519-sha256`, `curve25519-sha256@libssh.org`, `ecdh-sha2-nistp256/384/521`, `diffie-hellman-group14/16/18-sha256/512`
Server host key	`ssh-ed25519`, `rsa-sha2-256`, `rsa-sha2-512`, `ecdsa-sha2-nistp256/384/521`
Cipher	`aes256-gcm@openssh.com`, `aes128-gcm@openssh.com`, `aes256-ctr`, `aes192-ctr`, `aes128-ctr`
HMAC	`hmac-sha2-512-etm@openssh.com`, `hmac-sha2-256-etm@openssh.com`, `hmac-sha2-512`, `hmac-sha2-256`

Notably banned: ssh-rsa (SHA1), ssh-dss, all arcfour*, hmac-md5*, hmac-sha1*. Mismatch returns host_key_alg_not_allowed or auth_failed depending on which stage caught it.

Audit Log

Single table: ssh_audit_log. Every SSH operation writes here.

Lifecycle

begin (outcome=pending) [commits before remote call]
   ↓
remote call (ssh2 connect / exec / sftp)
   ↓
complete (outcome=success|failed|denied|aborted) [updates same row]

If the orchestrator crashes between begin and complete, the row stays pending. On next boot, the recovery sweep (src/ssh/recovery.ts) updates pending rows older than 10 minutes to aborted with detail.recovered_at set.

Actions

Action	Triggered by
`ssh.exec`	`SshExec`
`ssh.upload`	`SshUpload`
`ssh.download`	`SshDownload`
`ssh.connection.upsert`	User/admin connection create/edit
`ssh.connection.delete`	User/admin connection delete
`ssh.connection.host_key.first_observe`	TOFU first observation
`ssh.connection.host_key.mismatch`	TOFU mismatch
`ssh.connection.host_key.tofu_record`	Internal helper write
`ssh.connection.host_key.verify`	User `/verify-host-key`
`ssh.connection.host_key.replace`	User `/replace-host-key`
`ssh.connection.disable`	Admin disable
`ssh.connection.enable`	Admin enable
`ssh.abuse.unlock_manual`	Admin force-unlock
`ssh.grant.create`	Admin grant create
`ssh.grant.delete`	Admin grant delete
`ssh.master_key.rotate.start`	Admin rotation start

Detail column

JSON blob with action-specific fields:

ssh.exec: {command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}
ssh.upload: {local_path: "...", remote_path: "...", bytes: 4096}
ssh.download: same shape
ssh.connection.host_key.first_observe: {fingerprint: "SHA256:...", pending_token: "uuid"}
All denied: {error: "no_grant" | "abuse_locked" | "disabled" | ...}

The ssh.exec action does not record the command string — only its SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you need to investigate a specific exec, correlate the hash with stdin logs from the LLM activity log (workspace logs/activity.log).

Retention

ssh.audit_retention_days (default 90) controls a lazy sweep. Admin can trigger pruning manually from the UI. There is no hard cap on table size — disk-fill is mitigated by the hashing + truncation strategy above, plus admin-driven cleanup.

Abuse Counters & Lock

Defends against credential spraying, mistyped scripts in loops, and brute-force scans.

Three scopes

Scope	Key	When
`user`	`(user_id,)`	Any SSH failure by this user
`host:user`	`(host, username)`	Failure on this (host, username) tuple
`host`	`(host,)`	Failure on this host (global connections only)

The host scope intentionally only counts failures on global connections to prevent cross-user DoS: a user repeatedly failing on their own connection cannot lock out other users from a shared host. For user-owned connections, the host counter is updated for admin-notification only — no lock applies.

Algorithm

on failure:
  for each scope:
    increment counter
    if count(within abuseWindowMinutes) >= abuseFailureThreshold:
      lock until now + abuseLockMinutes

on success (user scope only):
  reset user counter
  other scopes age out naturally with the window

Counters are stored in ssh_abuse_counters, with separate columns per scope. All updates are transactional (no UPSERT race).

Force-unlock

Admin can force-unlock from SshGlobalConnectionsForm or per-connection admin page:

POST /api/ssh/admin/connections/:id/force-unlock
  {reason: "Confirmed credentials rotated; user retried with old key"}

Rate-limited to 10/hour total across all admins (admin-rate-limit.ts, token bucket). Audited as ssh.abuse.unlock_manual.

Master Key Rotation

Replaces MCP_ENCRYPTION_KEY and rewraps every row's DEK under the new key. This is the way to rotate the master key — do not edit the env var manually.

Flow

Admin starts via POST /api/ssh/admin/rotate-master-key:

{"new_key_hex": "<64-hex>", "reason": "Annual rotation"}

Maintenance mode engages — sshMaintenance.enter() returns 503 for all SSH write endpoints (read endpoints stay alive). The LLM sees SSH subsystem is in maintenance errors for tool calls.
Per-row rewrap:
- For each ssh_connections row: decrypt DEK under old key, re-encrypt under new key, bump key_version, commit (one tx per row)
- For each system_deks row: same
New key validated by decrypting a test value
Maintenance exits automatically
Caller polls GET /api/ssh/admin/rotate-master-key/:jobId for status (running / succeeded / failed)

Failure modes

Crash mid-rotation: rows have mixed key_version. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key.
Wrong old key: the first row decryption fails → job aborts before any change, maintenance exits, audit records ssh.master_key.rotate.start with outcome=failed.
Disk write fails mid-row: that single row is rolled back; rotation continues. Operator must re-run.

The rotation job runs in-process (not as a separate worker). For large fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write.

`MCP_ENCRYPTION_KEY` env var

After successful rotation, the env var must be updated to the new key before the next restart. The orchestrator writes the new key to the audit log (encrypted under the OLD key) and returns it once in the HTTP response — there's no second chance. Update your secrets store immediately.

Operator Runbook

A. Add a global (admin-managed) connection

# Via UI: Settings → SSH → Global Connections → Add
# Or via API (requires admin session cookie):

curl -X POST http://localhost:3000/api/ssh/admin/globals \
  -H 'Content-Type: application/json' \
  -d @- <<'JSON'
{
  "label": "prod-east-bastion",
  "host": "bastion.prod-east.example.com",
  "port": 22,
  "username": "deploy",
  "private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...",
  "passphrase": null,
  "remote_path_prefix": "/srv/deploy",
  "allow_private_addresses": false,
  "deny_patterns": "sudo\n^\\s*rm\\s+",
  "allow_patterns": "",
  "reason": "Production deploy bastion — owned by SRE"
}
JSON

Then verify the host key (next section) and grant access.

B. Grant org access to a global connection

curl -X POST http://localhost:3000/api/ssh/admin/grants \
  -H 'Content-Type: application/json' \
  -d '{
    "connection_id": "<uuid>",
    "subject_type": "org",
    "subject_id": "engineering",
    "piece_name": "prod-deploy",
    "applies_to_all_pieces": false,
    "reason": "Engineering org runs prod-deploy piece"
  }'

C. Verify a TOFU first-observe

From the user's side or admin side, click Test in the SshConnections panel
The response is host_key_first_observe with a SHA-256 fingerprint and pending token
Verify externally that the fingerprint matches the real server:
```
ssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \
  | ssh-keygen -lf -
```
Compare the resulting SHA256:... with what the UI shows
In the dialog, type the fingerprint to confirm and click Verify
Audit row ssh.connection.host_key.verify recorded; subsequent calls succeed

D. Force-unlock a stuck connection

Symptom: user reports "SshExec returns access denied (abuse_locked)"

Settings → SSH → Global Connections → click the row → "Locks" section
Inspect the counter state (which scope is locked, until when)
If genuinely needs early unlock (e.g. user fixed the bad credentials), click Force unlock, enter reason
If suspicious (unexplained 5+ failures), investigate audit log first

E. Rotate the master key

NEW_KEY=$(openssl rand -hex 32)

# Start rotation
JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \
  -H 'Content-Type: application/json' \
  -d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \
  | jq -r .job_id)

# Poll until done
while true; do
  STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \
    | jq -r .status)
  echo "$STATUS"
  [ "$STATUS" = "succeeded" ] && break
  [ "$STATUS" = "failed" ] && { echo FAILED; exit 1; }
  sleep 2
done

# Update env var BEFORE next restart
echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env

F. Prune old audit logs

Settings → SSH → Audit Log → "Prune older than N days" (defaults to the config retention value). Or via API:

curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90'

Troubleshooting

Symptom → cause table

Error	Common cause	Fix
`SSH is disabled` (503)	`ssh.enabled: false`	Set true, restart not required
`SSH subsystem is in maintenance`	Master key rotation in progress	Wait for job to complete, or check rotation log
`access denied (no_grant)`	User lacks grant for connection	Admin creates a grant, or user uses an owned connection
`access denied (disabled)`	Admin disabled the connection	Admin re-enables, or use different connection
`access denied (abuse_locked)`	Counter triggered	Wait for lock window, or admin force-unlocks
`piece "X" does not list connection Y`	`allowed_ssh_connections` missing UUID	Add UUID to the movement's `allowed_ssh_connections`
`host_key_first_observe`	First time exercising connection	Verify fingerprint in UI
`host_key_not_verified`	Key recorded but never verified	Click Verify in UI
`host_key_mismatch`	Server key changed	Investigate (legitimate rotation? MITM?), then Replace via UI
`host_key_alg_not_allowed`	Server using SHA1-RSA etc.	Upgrade server to ed25519 / rsa-sha2-256
`auth_failed`	Wrong key, wrong username	Re-check connection settings
`connect_timeout`	Network unreachable, firewall	Check from host, check SSRF policy
`exec_timeout`	Long-running command	Increase `timeout_ms`, or run in background and `Download` results
`output_too_large`	stdout > 32 KiB	Filter the command, or write to file and `Download`
`forbidden_address`	Target is private IP, no opt-in	Set `allow_private_addresses` per-connection or globally
`system_dek_verify_failed` (log)	`MCP_ENCRYPTION_KEY` changed without rotation flow	Stop server, restore old key OR re-rotate via flow

Where to look

Question	Source
What did the LLM try to do?	`logs/activity.log` in the job's workspace
What did SSH do?	`ssh_audit_log` (Admin UI or SQL)
Was it actually denied at the SSH layer?	Audit row `outcome` = `denied`
What was the exit code?	Audit `detail.exit_code` (for `ssh.exec`)
Did it crash mid-call?	Audit `outcome` = `aborted` (recovery sweep)
Why was the host key flagged?	Audit `ssh.connection.host_key.*` rows
Who has access to a connection?	`ssh_connection_grants` filtered by `connection_id`

Security Model Summary

Detailed threat model + risk register: see plan doc §"Security Design Deep-Dive (rev 3)" and §"Risk Register (rev 3)".

Key points operators must understand:

The orchestrator is a credential proxy. Anyone with admin rights can read connection plaintext (via the rotation flow, which decrypts server-side). Treat admin access as production-credential-equivalent.
TOFU is the floor, not the ceiling. First-observe is unauthenticated. For high-stakes targets, pre-populate host_key_b64 from a trusted bootstrap (e.g. baked into the connection at create time via the host_key_b64 field) rather than relying on the orchestrator's first observation.
The deny-list is not a sandbox. Built-in patterns catch obvious misuse. Real isolation requires connection-level configuration (restricted shell account, remote_path_prefix, narrow allow_patterns) and target-side controls.
Audit log is local-only. No HMAC chain (acknowledged limitation R-audit-tamper). For tamper-evidence, ship ssh_audit_log rows to an external SIEM via SQLite hooks or periodic export.
ssh2 internal key retention (R-ssh2-leak): the PEM lives in JS heap for the connection lifetime. Process compromise reveals plaintext credentials. Mitigations: short-lived processes, separate worker per high-stakes connection.
Master key compromise = total compromise. Key rotation invalidates already-leaked encrypted material — if an attacker has both the DB and the old master key, all stored creds are theirs. Rotate keys immediately on suspected compromise AND rotate every stored credential on the target side.

HTTP API Reference

User router: mounted at /api/ssh — requires requireAuth.

Method	Path	Purpose
GET	`/connections`	List own connections + grant-visible globals
POST	`/connections`	Create user-owned connection
GET	`/connections/:id`	Read
PATCH	`/connections/:id`	Edit (owner only)
DELETE	`/connections/:id`	Delete (owner only)
POST	`/connections/:id/test`	Trigger TOFU observation / verify path
POST	`/connections/:id/verify-host-key`	Atomic verify (token + fingerprint)
POST	`/connections/:id/replace-host-key`	Atomic replace (token + fingerprint + reason)
GET	`/connections/:id/audit`	Owner's view of audit rows for this connection
GET	`/grants/visible-to-me`	List grants visible (subject=user or matching org)

Admin router: mounted at /api/ssh/admin — requires requireAdmin.

Method	Path	Purpose
GET	`/connections`	All connections (cross-tenant)
GET	`/connections/:id`	Admin read
PATCH	`/connections/:id/disable`	Soft-disable (audited; reason required)
PATCH	`/connections/:id/enable`	Re-enable
DELETE	`/connections/:id`	Hard-delete
POST	`/connections/:id/force-unlock`	Clear abuse counter (rate-limited; reason required)
POST	`/globals`	Create global connection
PATCH	`/globals/:id`	Edit global
DELETE	`/globals/:id`	Delete global
GET	`/grants`	List all grants
POST	`/grants`	Create grant
DELETE	`/grants/:id`	Delete grant
POST	`/rotate-master-key`	Start master key rotation
GET	`/rotate-master-key/:jobId`	Poll rotation status
GET	`/audit`	All-tenant audit view (paginated)

All admin write endpoints require:

requireAdmin middleware
maintenance503() guard (rejects writes during rotation)
validateReason() on body.reason (≥ 8 chars)
auditRepo.beginAndComplete() for success/failure both

SSH Console (Interactive)

ssh.console.enabled: true で有効化。

1 タスク = 1 PTY セッション。job をまたいで shell state を維持
Tab SSH がタスク詳細に出る (piece が SshConsole* を allow している場合)
WebSocket: /api/local/tasks/:taskId/console/ws
REST status: GET /api/local/tasks/:taskId/console/status
監査: ssh.console.{open,send,snapshot,resize,input_rejected,close}
自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill
同 connection あたり最大 3 sessions (古い順に evict)

Admin: GET /api/admin/ssh/console-sessions で一覧、 POST /api/admin/ssh/console-sessions/:taskId/kill で kill (admin role only)。

37 KiB Raw Blame History

SSH Subsystem (Operator Runbook)

At a glance

Prerequisites

1. MCP_ENCRYPTION_KEY

2. ssh.enabled: true

3. system_deks bootstrap

4. Optional: allow_private_addresses

Quickstart

config.yaml Reference

Connection Model

Owner

Encryption

UI Walkthrough

User: Settings → User Folder → SSH Connections

Admin: Settings → SSH

Host Key TOFU Flow

States

Lifecycle

Banned algorithms

Per-piece allowed_ssh_connections

Forms

Example

Access Grants

Schema

Decision tree

Creating grants

Org grants

Expiration

Path Policy

Local path (workspace)

Remote path

Command Filtering

Stage 1: built-in deny-list

Stage 2: per-connection regex (optional)

SSRF + Algorithms

SSRF (host resolution)

Algorithm allowlist

Audit Log

Lifecycle

Actions

Detail column

Retention

Abuse Counters & Lock

Three scopes

Algorithm

Force-unlock

Master Key Rotation

Flow

Failure modes

MCP_ENCRYPTION_KEY env var

Operator Runbook

A. Add a global (admin-managed) connection

B. Grant org access to a global connection

C. Verify a TOFU first-observe

D. Force-unlock a stuck connection

E. Rotate the master key

F. Prune old audit logs

Troubleshooting

Symptom → cause table

Where to look

Security Model Summary

HTTP API Reference

SSH Console (Interactive)

See Also

37 KiB

Raw Blame History

1. `MCP_ENCRYPTION_KEY`

2. `ssh.enabled: true`

3. `system_deks` bootstrap

4. Optional: `allow_private_addresses`

`config.yaml` Reference

Per-piece `allowed_ssh_connections`

`MCP_ENCRYPTION_KEY` env var