maestro/docs/ssh.md
2026-06-03 05:08:00 +00:00

37 KiB

SSH Subsystem (Operator Runbook)

The orchestrator can run shell commands on remote servers (SshExec) and move files between the workspace and remote hosts (SshUpload/SshDownload) through a dedicated, audited SSH subsystem. Like the MCP integration, the feature is off by default and requires a key + config flip to enable.

This document is the operator runbook for setting up, granting access, verifying host keys, rotating the master key, and troubleshooting. For the LLM-facing tool semantics see docs/tools/ssh-tools.md. For the internal design (threat model, risk register, schema, 12-step orchestration flow) see

At a glance

Aspect Behavior
Default ssh.enabled: false — tools hidden, panels hidden, API returns 503
Tools exposed when enabled SshExec, SshUpload, SshDownload
Authentication Public-key only; passwords are not supported
Host key trust TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed
Connection ownership User-owned (private) or Global (admin-managed, shared via grants)
Encryption at rest AES-256-GCM (per-row DEK, master key = MCP_ENCRYPTION_KEY)
Audit Dedicated ssh_audit_log table, pending → success/failed/denied/aborted lifecycle
Abuse defense 3-scope counters (user / host:user / host) with auto-lock
Network policy SSRF strict by default; per-connection opt-in for private IPs
Algorithm policy Strict allowlist (no SHA1-RSA, no weak DH/HMAC)

Prerequisites

1. MCP_ENCRYPTION_KEY

The SSH subsystem shares the same master key as MCP — there is only one key per orchestrator. All private keys, passphrases, and global-connection DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is wrapped by this master key.

Generate it once (32 bytes = 64 hex chars):

openssl rand -hex 32

Export it before starting the server:

export MCP_ENCRYPTION_KEY=<the 64-hex output>
scripts/server.sh start

If MCP_ENCRYPTION_KEY is not set when ssh.enabled: true, the SSH subsystem boots fail-soft: a warning is logged, all SSH endpoints return 503, the tools are hidden from LLM, and the UI panels show a configuration error banner. Other features (MCP excepted) continue normally.

Key rotation invalidates existing encrypted material. There is a built-in master key rotation flow that rewraps every row in maintenance mode. Do not swap the env var manually without using that flow — half-rotated state breaks every connection.

2. ssh.enabled: true

Flip the flag in config.yaml:

ssh:
  enabled: true

This is the master switch. With it false:

  • HTTP endpoints (/api/ssh/*, /api/ssh/admin/*) return 503
  • Tool defs are not exposed to the LLM (the dispatcher returns null)
  • UI panels render an "SSH is disabled" empty state
  • Database tables remain present (no destructive change)

Restart is not required — ConfigManager reload picks up the change and rebuilds the SSH router.

3. system_deks bootstrap

The first time the orchestrator boots with ssh.enabled: true AND a valid MCP_ENCRYPTION_KEY, it provisions a single row in system_deks (via INSERT OR IGNORE inside a transaction, CHECK(id=1)). This DEK encrypts global connections (those without an owner).

On every subsequent boot, verifySystemDek decrypts the stored DEK to prove the master key still works. If it fails (key rotated outside the rotation flow, or env var differs from when the DEK was wrapped), SSH fails closed for the session and a system_dek_verify_failed error is logged. User-owned connections may still partially work (their DEKs are wrapped per user), but global connections will all error.

4. Optional: allow_private_addresses

By default, SSH connections are routed through the SSRF strict-check, which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12, 192.168/16, fc00::/7, 169.254/16) addresses. For LAN targets you must opt in.

There are two scopes:

ssh:
  enabled: true
  allow_private_addresses: true   # global default
-- per-connection opt-in (admin-only flag, audited)
UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?;

The per-connection flag is preferred — narrow the blast radius. The global flag exists for trusted dev networks (homelab, isolated VPC). The per-connection flag can only be set on global (admin-managed) connections; for user-owned connections, the global flag applies.

Quickstart

# 1. Set the key
openssl rand -hex 32 > ~/.mcp_encryption_key
export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key)

# 2. Enable SSH + allow LAN
cat >> config.yaml <<'YAML'
ssh:
  enabled: true
  allow_private_addresses: true   # only if you're targeting LAN
YAML

# 3. Restart
scripts/server.sh restart

Then in the UI:

  1. Settings → User Folder → SSH Connections → Add
  2. Fill label, host, port (default 22), username, paste private key (OpenSSH PEM)
  3. Optionally set remote_path_prefix (default /) — restricts upload/download paths
  4. Click Test → first call returns host_key_first_observe with a fingerprint
  5. Verify in the dialog (compare fingerprint with what you expect from ssh-keyscan <host>)
  6. Add the connection's UUID to a piece's allowed_ssh_connections:
# pieces/example.yaml
name: ssh-example
movements:
  - name: deploy
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-..."]
    rules:
      - condition: done
        next: COMPLETE
    instruction: |
      Use SshExec to ...
  1. Test the piece via the normal task UI.

config.yaml Reference

Full SSH section with defaults:

ssh:
  # master switch
  enabled: false

  # SSRF policy — when true, allow private/loopback addresses (global)
  allow_private_addresses: false

  # wall-clock timeout for connect + handshake + exec/transfer (seconds)
  call_timeout_seconds: 30

  # stdout/stderr byte cap for SshExec (bytes)
  max_output_bytes: 32768          # 32 KiB

  # SFTP transfer size caps (MB)
  max_upload_size_mb: 100
  max_download_size_mb: 100

  # ssh_audit_log retention (days). Admin can prune via UI.
  audit_retention_days: 90

  # When true (default), admins can use any connection without an explicit
  # grant. Audited regardless. Set false for stricter least-privilege.
  admin_bypasses_grants: true

  # Abuse counters
  abuse_window_minutes: 10        # rolling window for failure counting
  abuse_failure_threshold: 5      # failures within window → lock
  abuse_lock_minutes: 30          # lock duration on threshold breach

All keys translate to camelCase in SshRuntimeConfig (src/ssh/config.ts). The transformKeys helper in src/config.ts handles the conversion.

Connection Model

Owner

Each row in ssh_connections has an owner_id:

Owner Visibility Who creates
User-owned (owner_id = userId) Only the owner; admin can also list but not edit Any authenticated user (POST /api/ssh/connections)
Global (owner_id IS NULL) All users see it in the picker (subject to grants) Admin only (POST /api/ssh/admin/globals)

Global connections solve the "team-shared infra account" use case — a single set of credentials that multiple users invoke under their own identity, audited per user, gated by grants.

Encryption

For each connection:

  1. Generate a fresh 32-byte DEK
  2. Encrypt private_key_pem (and optionally passphrase) with the DEK
  3. Wrap the DEK with the master key (MCP_ENCRYPTION_KEY)
  4. Store: private_key_enc, private_key_dek_enc, key_version

key_version allows progressive rewrap during master key rotation (each row tracks which generation of master key its DEK is wrapped under). Global connections use the single system_deks row (id=1) rather than a per-row DEK.

⚠ ssh2's internal key handling is opaque — once the PEM is loaded into the library, it lives in JS heap memory for the lifetime of the connection. We Buffer.fill(0, 0) our copies in the finally block but cannot reach into ssh2 internals. This is an acknowledged limitation; see the plan doc's "Acknowledged limitations" section.

UI Walkthrough

User: Settings → User Folder → SSH Connections

The SshConnectionsPanel lists the user's connections and any global connections they have a grant for. Each row shows:

  • Label + host:port + username
  • Host key fingerprint + verify state (verified / pending / first_observe / mismatch)
  • Lock state (if abuse counter triggered)
  • Actions: Test, Verify host key, Replace host key (with reason), Edit, Delete

The "Add Connection" form (SshConnectionForm) collects:

  • Label
  • Host, port, username
  • Private key (textarea — PEM format; passphrase optional)
  • Remote path prefix (default /)
  • Custom deny/allow regex patterns (newline-separated, validated at save-time)

SshHostKeyDialog opens on first_observe / mismatch and shows the observed fingerprint side-by-side with the previously-stored one (if any). "Trust this key" requires typing the fingerprint to confirm.

Admin: Settings → SSH

Four sub-panels under SshForm:

Panel Component Purpose
Global Connections SshGlobalConnectionsForm CRUD on global connections (owner_id IS NULL). Includes the allow_remote_unrestricted and per-connection allow_private_addresses flags
Grants SshGrantsForm List/create/delete grants. Per-piece or applies_to_all_pieces. Subject: user or org. Reason required
Audit Log SshAuditLog All-tenant audit view. Filter by action / outcome / connection / time range. Pagination
Master Key Rotation SshMasterKeyRotationForm Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status

Admin can also force-unlock abuse counters from the per-connection page (requires reason; rate-limited to 10/hour total).

Host Key TOFU Flow

SSH security depends on knowing the right host key. We use Trust-On- First-Use: the first time a connection is exercised, we record the observed key and require explicit user verification before treating it as trusted.

States

ssh_connections carries three host-key columns:

Column Meaning
host_key_b64 The observed public key in OpenSSH base64 form. NULL = never observed.
host_key_fingerprint SHA-256 fingerprint for UI display (SHA256:...).
host_key_verified_at ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending.
host_key_pending_token UUID issued at first_observe / mismatch; consumed atomically by /verify-host-key.

A connection is trusted iff host_key_verified_at IS NOT NULL AND the observed key during connect matches host_key_b64.

Lifecycle

new connection
   │
   ├─ user clicks Test (or LLM calls SshExec)
   │     │
   │     ▼
   │   sshTest() observes the host key
   │     │
   │     ▼
   │   onFirstObserve hook fires
   │     - writes host_key_b64, host_key_fingerprint, host_key_pending_token
   │     - audit row: ssh.connection.host_key.first_observe
   │     - returns SshSessionError('host_key_first_observe')
   │
   │   (UI shows the fingerprint + pending token)
   │
   ├─ user clicks Verify (typing fingerprint to confirm)
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/verify-host-key
   │     {token, fingerprint}
   │     - atomic compare-and-set: token + fingerprint match → set host_key_verified_at
   │     - audit row: ssh.connection.host_key.verify
   │
   ▼
verified — Exec/Upload/Download now work

On host_key_mismatch (server rebuilt, key rotated, or MITM):

   ├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload
   │     │
   │     ▼
   │   ssh2 observes key ≠ host_key_b64
   │     │
   │     ▼
   │   onMismatch hook fires
   │     - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet)
   │     - audit row: ssh.connection.host_key.mismatch
   │     - returns SshSessionError('host_key_mismatch')
   │
   │   (UI shows OLD vs NEW fingerprint side-by-side)
   │
   ├─ user investigates externally (ssh-keyscan, IT team, etc.)
   │
   ├─ user clicks "Replace key" with reason
   │     │
   │     ▼
   │   POST /api/ssh/connections/:id/replace-host-key
   │     {token, fingerprint, reason}
   │     - atomic compare-and-set
   │     - writes new host_key_b64, host_key_fingerprint, host_key_verified_at
   │     - audit row: ssh.connection.host_key.replace

The pending token mechanism prevents a "verify swap" race: if a second TOFU observation happens between the user's verify request and its arrival, the old token is overwritten and the verify endpoint returns 409 stale_token.

Banned algorithms

Even before TOFU completes, the host-key algorithm is checked against an allowlist. SHA1-RSA and other weak algorithms are rejected before the key is recorded (host_key_alg_not_allowed). This is hard-coded in src/ssh/session.ts to avoid misconfiguration.

Per-piece allowed_ssh_connections

A piece's movement must explicitly opt in to SSH usage. The piece-runner enforces three invariants:

  1. If a movement's allowed_tools contains any SSH tool name (SshExec/SshUpload/SshDownload), allowed_ssh_connections must be declared on that movement (even if empty)
  2. The field must be an array of strings
  3. Each entry must be * or a lowercase hex+hyphen UUID (≥ 8 chars)

Lint failures abort piece load.

Forms

# Explicit allowlist (most common)
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]

# Wildcard (admin-style — use sparingly)
allowed_ssh_connections: ["*"]

# Deny-all (still allows SSH tool in allowed_tools but refuses every UUID)
allowed_ssh_connections: []

The * form skips the per-piece check but does not skip the access grant check. A user without a grant for a given connection still cannot use it even when the piece says *.

Example

name: backup-rotation
description: Daily backup rotation on prod servers
movements:
  - name: list
    allowed_tools: [SshExec]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      List the existing backup files on each server.
    rules:
      - condition: ready to rotate
        next: rotate

  - name: rotate
    allowed_tools: [SshExec, SshUpload]
    allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
    instruction: |
      Rotate the oldest backup ...
    rules:
      - condition: done
        next: COMPLETE

Access Grants

Grants connect a subject (user or org) to a connection, scoped to a piece (or all pieces, admin-only).

Schema

CREATE TABLE ssh_connection_grants (
  id TEXT PRIMARY KEY,
  connection_id TEXT NOT NULL,
  subject_type TEXT NOT NULL,        -- 'user' | 'org'
  subject_id TEXT NOT NULL,
  piece_name TEXT,                   -- NULL iff applies_to_all_pieces=1
  applies_to_all_pieces INTEGER NOT NULL DEFAULT 0,
  granted_by_user_id TEXT NOT NULL,
  reason TEXT NOT NULL,              -- required, ≥ 8 chars
  expires_at TEXT,                   -- ISO8601 or NULL
  created_at TEXT NOT NULL
);

Decision tree

For a given (userId, orgIds, connectionId, pieceName):

  1. Owner check: if connection.owner_id == userId → access granted (owner of own connection)
  2. Admin bypass: if user is admin AND ssh.admin_bypasses_grants: true → granted (audited)
  3. Grant lookup:
    • find rows where connection_id = ?
    • subject matches (subject_type='user' AND subject_id=userId) OR (subject_type='org' AND subject_id IN orgIds)
    • piece matches (applies_to_all_pieces=1 OR piece_name = ?)
    • not expired (expires_at IS NULL OR expires_at > now())
    • any matching row → granted
  4. Otherwise → denied (access denied (no_grant))

Creating grants

Admin-only via POST /api/ssh/admin/grants:

{
  "connection_id": "abcd1234-...",
  "subject_type": "user",
  "subject_id": "alice",
  "piece_name": "backup-rotation",
  "applies_to_all_pieces": false,
  "reason": "Alice owns backups for prod-east cluster",
  "expires_at": null
}

For applies_to_all_pieces: true:

  • piece_name must be null
  • the admin endpoint requires explicit reason containing scope justification
  • audit row records action: ssh.grant.create with detail.applies_to_all=true
  • this is the highest-privilege grant — review carefully

Org grants

Same schema with subject_type: "org", subject_id: <gitea org name>. Membership comes from user_gitea_orgs (populated at login via Gitea OAuth). A user with multiple org memberships matches grants for any of those orgs.

Expiration

expires_at is checked at decision time (no background sweep). Expired rows remain in the table for audit purposes. Admin can delete them via DELETE /api/ssh/admin/grants/:id.

Path Policy

Local path (workspace)

For SshUpload.local_path and SshDownload.local_path:

  • Resolved against ctx.workspacePath (the job's workspace root)
  • .. traversal → reject
  • Symlinks: open with O_NOFOLLOW, lstat every parent → reject if any parent is a symlink leaving the workspace
  • For download: parent directory must exist; target file must NOT exist (O_CREAT | O_EXCL)

Remote path

For remote_path on upload/download:

  • Must be absolute (starts with /)
  • After POSIX normalization (path.posix.normalize), must start with the connection's remote_path_prefix
  • .. segments are collapsed by normalize; the post-normalize check catches escape attempts
  • No glob expansion — exact path only

Example: connection has remote_path_prefix = '/srv/agent'

Input Normalized Result
/srv/agent/file.txt /srv/agent/file.txt
/srv/agent/sub/file.txt /srv/agent/sub/file.txt
/srv/agent/../etc/passwd /etc/passwd outside prefix
/srv/agentish/file /srv/agentish/file prefix mismatch (not /srv/agent/...)
file.txt (relative) n/a not absolute

Command Filtering

SshExec.command runs through a two-stage filter.

Stage 1: built-in deny-list

Hard-coded patterns in src/ssh/deny-list.ts. Examples (not exhaustive):

  • rm -rf / and variants
  • fork bombs (:(){:|:&};:)
  • mkfs.*, dd if=/dev/zero ...
  • shutdown / reboot / poweroff
  • :>/dev/sda style block-device writes

If matched, the call is rejected with command rejected by built-in deny-list (matched pattern: ...). and audited as outcome=denied.

The built-in list is not a comprehensive sandbox — it's a tripwire against the most catastrophic typos and worst-case prompt injection payloads. Production deployments should also configure connection-level patterns.

Stage 2: per-connection regex (optional)

Each connection can carry:

  • deny_patterns: newline-separated regex list. Match → reject.
  • allow_patterns: newline-separated regex list. If set, every command must match at least one allow pattern (after passing both deny stages).

Both are validated at save-time by validateCustomPatterns:

  • Each pattern must compile
  • Each must pass the safe-regex ReDoS check
  • Aggregate length capped (no megabyte-blobs of regex)

Example:

# deny_patterns
sudo
^\s*rm\s+
nc\s+-l

# allow_patterns
^(ls|cat|grep|tail|head|systemctl|journalctl)\s
^/srv/agent/scripts/

ReDoS-safe regex is enforced because user-supplied patterns run synchronously on the command string before each call.

SSRF + Algorithms

SSRF (host resolution)

Every connection target goes through ssrfStrict(host, allowPrivate):

  1. DNS resolve host → list of A/AAAA records
  2. For each address, check against the IP-policy:
    • Reject 0.0.0.0, ::/0
    • Reject 127.0.0.0/8, ::1 (loopback)
    • Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private)
    • Reject 169.254/16 (link-local — including AWS metadata)
  3. DNS pinning: the resolved address is captured before connect; ssh2 connects to the pinned IP, not to the hostname. This prevents DNS rebinding (round 1: public IP passes check; round 2: returns loopback during connect).

allowPrivate short-circuits step 2. Two opt-in flags compose:

  • Global: ssh.allow_private_addresses: true in config.yaml
  • Per-connection: allow_private_addresses=1 on the row (admin sets via /api/ssh/admin/globals or /api/ssh/admin/connections/:id)

Either being true allows private/loopback. Both default false.

Algorithm allowlist

Hard-coded in src/ssh/session.ts:

Category Allowed
Key exchange curve25519-sha256, curve25519-sha256@libssh.org, ecdh-sha2-nistp256/384/521, diffie-hellman-group14/16/18-sha256/512
Server host key ssh-ed25519, rsa-sha2-256, rsa-sha2-512, ecdsa-sha2-nistp256/384/521
Cipher aes256-gcm@openssh.com, aes128-gcm@openssh.com, aes256-ctr, aes192-ctr, aes128-ctr
HMAC hmac-sha2-512-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-256

Notably banned: ssh-rsa (SHA1), ssh-dss, all arcfour*, hmac-md5*, hmac-sha1*. Mismatch returns host_key_alg_not_allowed or auth_failed depending on which stage caught it.

Audit Log

Single table: ssh_audit_log. Every SSH operation writes here.

Lifecycle

begin (outcome=pending) [commits before remote call]
   ↓
remote call (ssh2 connect / exec / sftp)
   ↓
complete (outcome=success|failed|denied|aborted) [updates same row]

If the orchestrator crashes between begin and complete, the row stays pending. On next boot, the recovery sweep (src/ssh/recovery.ts) updates pending rows older than 10 minutes to aborted with detail.recovered_at set.

Actions

Action Triggered by
ssh.exec SshExec
ssh.upload SshUpload
ssh.download SshDownload
ssh.connection.upsert User/admin connection create/edit
ssh.connection.delete User/admin connection delete
ssh.connection.host_key.first_observe TOFU first observation
ssh.connection.host_key.mismatch TOFU mismatch
ssh.connection.host_key.tofu_record Internal helper write
ssh.connection.host_key.verify User /verify-host-key
ssh.connection.host_key.replace User /replace-host-key
ssh.connection.disable Admin disable
ssh.connection.enable Admin enable
ssh.abuse.unlock_manual Admin force-unlock
ssh.grant.create Admin grant create
ssh.grant.delete Admin grant delete
ssh.master_key.rotate.start Admin rotation start

Detail column

JSON blob with action-specific fields:

  • ssh.exec: {command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}
  • ssh.upload: {local_path: "...", remote_path: "...", bytes: 4096}
  • ssh.download: same shape
  • ssh.connection.host_key.first_observe: {fingerprint: "SHA256:...", pending_token: "uuid"}
  • All denied: {error: "no_grant" | "abuse_locked" | "disabled" | ...}

The ssh.exec action does not record the command string — only its SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you need to investigate a specific exec, correlate the hash with stdin logs from the LLM activity log (workspace logs/activity.log).

Retention

ssh.audit_retention_days (default 90) controls a lazy sweep. Admin can trigger pruning manually from the UI. There is no hard cap on table size — disk-fill is mitigated by the hashing + truncation strategy above, plus admin-driven cleanup.

Abuse Counters & Lock

Defends against credential spraying, mistyped scripts in loops, and brute-force scans.

Three scopes

Scope Key When
user (user_id,) Any SSH failure by this user
host:user (host, username) Failure on this (host, username) tuple
host (host,) Failure on this host (global connections only)

The host scope intentionally only counts failures on global connections to prevent cross-user DoS: a user repeatedly failing on their own connection cannot lock out other users from a shared host. For user-owned connections, the host counter is updated for admin-notification only — no lock applies.

Algorithm

on failure:
  for each scope:
    increment counter
    if count(within abuseWindowMinutes) >= abuseFailureThreshold:
      lock until now + abuseLockMinutes

on success (user scope only):
  reset user counter
  other scopes age out naturally with the window

Counters are stored in ssh_abuse_counters, with separate columns per scope. All updates are transactional (no UPSERT race).

Force-unlock

Admin can force-unlock from SshGlobalConnectionsForm or per-connection admin page:

POST /api/ssh/admin/connections/:id/force-unlock
  {reason: "Confirmed credentials rotated; user retried with old key"}

Rate-limited to 10/hour total across all admins (admin-rate-limit.ts, token bucket). Audited as ssh.abuse.unlock_manual.

Master Key Rotation

Replaces MCP_ENCRYPTION_KEY and rewraps every row's DEK under the new key. This is the way to rotate the master key — do not edit the env var manually.

Flow

  1. Admin starts via POST /api/ssh/admin/rotate-master-key:
    {"new_key_hex": "<64-hex>", "reason": "Annual rotation"}
    
  2. Maintenance mode engagessshMaintenance.enter() returns 503 for all SSH write endpoints (read endpoints stay alive). The LLM sees SSH subsystem is in maintenance errors for tool calls.
  3. Per-row rewrap:
    • For each ssh_connections row: decrypt DEK under old key, re-encrypt under new key, bump key_version, commit (one tx per row)
    • For each system_deks row: same
  4. New key validated by decrypting a test value
  5. Maintenance exits automatically
  6. Caller polls GET /api/ssh/admin/rotate-master-key/:jobId for status (running / succeeded / failed)

Failure modes

  • Crash mid-rotation: rows have mixed key_version. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key.
  • Wrong old key: the first row decryption fails → job aborts before any change, maintenance exits, audit records ssh.master_key.rotate.start with outcome=failed.
  • Disk write fails mid-row: that single row is rolled back; rotation continues. Operator must re-run.

The rotation job runs in-process (not as a separate worker). For large fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write.

MCP_ENCRYPTION_KEY env var

After successful rotation, the env var must be updated to the new key before the next restart. The orchestrator writes the new key to the audit log (encrypted under the OLD key) and returns it once in the HTTP response — there's no second chance. Update your secrets store immediately.

Operator Runbook

A. Add a global (admin-managed) connection

# Via UI: Settings → SSH → Global Connections → Add
# Or via API (requires admin session cookie):

curl -X POST http://localhost:3000/api/ssh/admin/globals \
  -H 'Content-Type: application/json' \
  -d @- <<'JSON'
{
  "label": "prod-east-bastion",
  "host": "bastion.prod-east.example.com",
  "port": 22,
  "username": "deploy",
  "private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...",
  "passphrase": null,
  "remote_path_prefix": "/srv/deploy",
  "allow_private_addresses": false,
  "deny_patterns": "sudo\n^\\s*rm\\s+",
  "allow_patterns": "",
  "reason": "Production deploy bastion — owned by SRE"
}
JSON

Then verify the host key (next section) and grant access.

B. Grant org access to a global connection

curl -X POST http://localhost:3000/api/ssh/admin/grants \
  -H 'Content-Type: application/json' \
  -d '{
    "connection_id": "<uuid>",
    "subject_type": "org",
    "subject_id": "engineering",
    "piece_name": "prod-deploy",
    "applies_to_all_pieces": false,
    "reason": "Engineering org runs prod-deploy piece"
  }'

C. Verify a TOFU first-observe

  1. From the user's side or admin side, click Test in the SshConnections panel
  2. The response is host_key_first_observe with a SHA-256 fingerprint and pending token
  3. Verify externally that the fingerprint matches the real server:
    ssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \
      | ssh-keygen -lf -
    
    Compare the resulting SHA256:... with what the UI shows
  4. In the dialog, type the fingerprint to confirm and click Verify
  5. Audit row ssh.connection.host_key.verify recorded; subsequent calls succeed

D. Force-unlock a stuck connection

Symptom: user reports "SshExec returns access denied (abuse_locked)"

  1. Settings → SSH → Global Connections → click the row → "Locks" section
  2. Inspect the counter state (which scope is locked, until when)
  3. If genuinely needs early unlock (e.g. user fixed the bad credentials), click Force unlock, enter reason
  4. If suspicious (unexplained 5+ failures), investigate audit log first

E. Rotate the master key

NEW_KEY=$(openssl rand -hex 32)

# Start rotation
JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \
  -H 'Content-Type: application/json' \
  -d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \
  | jq -r .job_id)

# Poll until done
while true; do
  STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \
    | jq -r .status)
  echo "$STATUS"
  [ "$STATUS" = "succeeded" ] && break
  [ "$STATUS" = "failed" ] && { echo FAILED; exit 1; }
  sleep 2
done

# Update env var BEFORE next restart
echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env

F. Prune old audit logs

Settings → SSH → Audit Log → "Prune older than N days" (defaults to the config retention value). Or via API:

curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90'

Troubleshooting

Symptom → cause table

Error Common cause Fix
SSH is disabled (503) ssh.enabled: false Set true, restart not required
SSH subsystem is in maintenance Master key rotation in progress Wait for job to complete, or check rotation log
access denied (no_grant) User lacks grant for connection Admin creates a grant, or user uses an owned connection
access denied (disabled) Admin disabled the connection Admin re-enables, or use different connection
access denied (abuse_locked) Counter triggered Wait for lock window, or admin force-unlocks
piece "X" does not list connection Y allowed_ssh_connections missing UUID Add UUID to the movement's allowed_ssh_connections
host_key_first_observe First time exercising connection Verify fingerprint in UI
host_key_not_verified Key recorded but never verified Click Verify in UI
host_key_mismatch Server key changed Investigate (legitimate rotation? MITM?), then Replace via UI
host_key_alg_not_allowed Server using SHA1-RSA etc. Upgrade server to ed25519 / rsa-sha2-256
auth_failed Wrong key, wrong username Re-check connection settings
connect_timeout Network unreachable, firewall Check from host, check SSRF policy
exec_timeout Long-running command Increase timeout_ms, or run in background and Download results
output_too_large stdout > 32 KiB Filter the command, or write to file and Download
forbidden_address Target is private IP, no opt-in Set allow_private_addresses per-connection or globally
system_dek_verify_failed (log) MCP_ENCRYPTION_KEY changed without rotation flow Stop server, restore old key OR re-rotate via flow

Where to look

Question Source
What did the LLM try to do? logs/activity.log in the job's workspace
What did SSH do? ssh_audit_log (Admin UI or SQL)
Was it actually denied at the SSH layer? Audit row outcome = denied
What was the exit code? Audit detail.exit_code (for ssh.exec)
Did it crash mid-call? Audit outcome = aborted (recovery sweep)
Why was the host key flagged? Audit ssh.connection.host_key.* rows
Who has access to a connection? ssh_connection_grants filtered by connection_id

Security Model Summary

Detailed threat model + risk register: see plan doc §"Security Design Deep-Dive (rev 3)" and §"Risk Register (rev 3)".

Key points operators must understand:

  1. The orchestrator is a credential proxy. Anyone with admin rights can read connection plaintext (via the rotation flow, which decrypts server-side). Treat admin access as production-credential-equivalent.

  2. TOFU is the floor, not the ceiling. First-observe is unauthenticated. For high-stakes targets, pre-populate host_key_b64 from a trusted bootstrap (e.g. baked into the connection at create time via the host_key_b64 field) rather than relying on the orchestrator's first observation.

  3. The deny-list is not a sandbox. Built-in patterns catch obvious misuse. Real isolation requires connection-level configuration (restricted shell account, remote_path_prefix, narrow allow_patterns) and target-side controls.

  4. Audit log is local-only. No HMAC chain (acknowledged limitation R-audit-tamper). For tamper-evidence, ship ssh_audit_log rows to an external SIEM via SQLite hooks or periodic export.

  5. ssh2 internal key retention (R-ssh2-leak): the PEM lives in JS heap for the connection lifetime. Process compromise reveals plaintext credentials. Mitigations: short-lived processes, separate worker per high-stakes connection.

  6. Master key compromise = total compromise. Key rotation invalidates already-leaked encrypted material — if an attacker has both the DB and the old master key, all stored creds are theirs. Rotate keys immediately on suspected compromise AND rotate every stored credential on the target side.

HTTP API Reference

User router: mounted at /api/ssh — requires requireAuth.

Method Path Purpose
GET /connections List own connections + grant-visible globals
POST /connections Create user-owned connection
GET /connections/:id Read
PATCH /connections/:id Edit (owner only)
DELETE /connections/:id Delete (owner only)
POST /connections/:id/test Trigger TOFU observation / verify path
POST /connections/:id/verify-host-key Atomic verify (token + fingerprint)
POST /connections/:id/replace-host-key Atomic replace (token + fingerprint + reason)
GET /connections/:id/audit Owner's view of audit rows for this connection
GET /grants/visible-to-me List grants visible (subject=user or matching org)

Admin router: mounted at /api/ssh/admin — requires requireAdmin.

Method Path Purpose
GET /connections All connections (cross-tenant)
GET /connections/:id Admin read
PATCH /connections/:id/disable Soft-disable (audited; reason required)
PATCH /connections/:id/enable Re-enable
DELETE /connections/:id Hard-delete
POST /connections/:id/force-unlock Clear abuse counter (rate-limited; reason required)
POST /globals Create global connection
PATCH /globals/:id Edit global
DELETE /globals/:id Delete global
GET /grants List all grants
POST /grants Create grant
DELETE /grants/:id Delete grant
POST /rotate-master-key Start master key rotation
GET /rotate-master-key/:jobId Poll rotation status
GET /audit All-tenant audit view (paginated)

All admin write endpoints require:

  • requireAdmin middleware
  • maintenance503() guard (rejects writes during rotation)
  • validateReason() on body.reason (≥ 8 chars)
  • auditRepo.beginAndComplete() for success/failure both

SSH Console (Interactive)

ssh.console.enabled: true で有効化。

  • 1 タスク = 1 PTY セッション。job をまたいで shell state を維持
  • Tab SSH がタスク詳細に出る (piece が SshConsole* を allow している場合)
  • WebSocket: /api/local/tasks/:taskId/console/ws
  • REST status: GET /api/local/tasks/:taskId/console/status
  • 監査: ssh.console.{open,send,snapshot,resize,input_rejected,close}
  • 自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill
  • 同 connection あたり最大 3 sessions (古い順に evict)

Admin: GET /api/admin/ssh/console-sessions で一覧、 POST /api/admin/ssh/console-sessions/:taskId/kill で kill (admin role only)。

See Also