Open-source release of MAESTRO, an agent orchestration platform that runs LLM-driven tasks through sandboxed tools, with a web UI. Apache-2.0. See README.md and docs/ (getting-started, configuration, architecture).
37 KiB
SSH Subsystem (Operator Runbook)
The orchestrator can run shell commands on remote servers (SshExec) and
move files between the workspace and remote hosts (SshUpload/SshDownload)
through a dedicated, audited SSH subsystem. Like the MCP integration, the
feature is off by default and requires a key + config flip to enable.
This document is the operator runbook for setting up, granting access, verifying host keys, rotating the master key, and troubleshooting. For the LLM-facing tool semantics see docs/tools/ssh-tools.md. For the internal design (threat model, risk register, schema, 12-step orchestration flow) see
At a glance
| Aspect | Behavior |
|---|---|
| Default | ssh.enabled: false — tools hidden, panels hidden, API returns 503 |
| Tools exposed when enabled | SshExec, SshUpload, SshDownload |
| Authentication | Public-key only; passwords are not supported |
| Host key trust | TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed |
| Connection ownership | User-owned (private) or Global (admin-managed, shared via grants) |
| Encryption at rest | AES-256-GCM (per-row DEK, master key = MCP_ENCRYPTION_KEY) |
| Audit | Dedicated ssh_audit_log table, pending → success/failed/denied/aborted lifecycle |
| Abuse defense | 3-scope counters (user / host:user / host) with auto-lock |
| Network policy | SSRF strict by default; per-connection opt-in for private IPs |
| Algorithm policy | Strict allowlist (no SHA1-RSA, no weak DH/HMAC) |
Prerequisites
1. MCP_ENCRYPTION_KEY
The SSH subsystem shares the same master key as MCP — there is only one key per orchestrator. All private keys, passphrases, and global-connection DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is wrapped by this master key.
Generate it once (32 bytes = 64 hex chars):
openssl rand -hex 32
Export it before starting the server:
export MCP_ENCRYPTION_KEY=<the 64-hex output>
scripts/server.sh start
If MCP_ENCRYPTION_KEY is not set when ssh.enabled: true, the SSH
subsystem boots fail-soft: a warning is logged, all SSH endpoints
return 503, the tools are hidden from LLM, and the UI panels show a
configuration error banner. Other features (MCP excepted) continue
normally.
⚠ Key rotation invalidates existing encrypted material. There is a built-in master key rotation flow that rewraps every row in maintenance mode. Do not swap the env var manually without using that flow — half-rotated state breaks every connection.
2. ssh.enabled: true
Flip the flag in config.yaml:
ssh:
enabled: true
This is the master switch. With it false:
- HTTP endpoints (
/api/ssh/*,/api/ssh/admin/*) return 503 - Tool defs are not exposed to the LLM (the dispatcher returns null)
- UI panels render an "SSH is disabled" empty state
- Database tables remain present (no destructive change)
Restart is not required — ConfigManager reload picks up the change
and rebuilds the SSH router.
3. system_deks bootstrap
The first time the orchestrator boots with ssh.enabled: true AND a
valid MCP_ENCRYPTION_KEY, it provisions a single row in system_deks
(via INSERT OR IGNORE inside a transaction, CHECK(id=1)). This DEK
encrypts global connections (those without an owner).
On every subsequent boot, verifySystemDek decrypts the stored DEK to
prove the master key still works. If it fails (key rotated outside the
rotation flow, or env var differs from when the DEK was wrapped), SSH
fails closed for the session and a system_dek_verify_failed error
is logged. User-owned connections may still partially work (their DEKs
are wrapped per user), but global connections will all error.
4. Optional: allow_private_addresses
By default, SSH connections are routed through the SSRF strict-check, which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12, 192.168/16, fc00::/7, 169.254/16) addresses. For LAN targets you must opt in.
There are two scopes:
ssh:
enabled: true
allow_private_addresses: true # global default
-- per-connection opt-in (admin-only flag, audited)
UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?;
The per-connection flag is preferred — narrow the blast radius. The global flag exists for trusted dev networks (homelab, isolated VPC). The per-connection flag can only be set on global (admin-managed) connections; for user-owned connections, the global flag applies.
Quickstart
# 1. Set the key
openssl rand -hex 32 > ~/.mcp_encryption_key
export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key)
# 2. Enable SSH + allow LAN
cat >> config.yaml <<'YAML'
ssh:
enabled: true
allow_private_addresses: true # only if you're targeting LAN
YAML
# 3. Restart
scripts/server.sh restart
Then in the UI:
- Settings → User Folder → SSH Connections → Add
- Fill
label,host,port(default 22),username, paste private key (OpenSSH PEM) - Optionally set
remote_path_prefix(default/) — restricts upload/download paths - Click Test → first call returns
host_key_first_observewith a fingerprint - Verify in the dialog (compare fingerprint with what you expect from
ssh-keyscan <host>) - Add the connection's UUID to a piece's
allowed_ssh_connections:
# pieces/example.yaml
name: ssh-example
movements:
- name: deploy
allowed_tools: [SshExec, SshUpload]
allowed_ssh_connections: ["abcd1234-..."]
rules:
- condition: done
next: COMPLETE
instruction: |
Use SshExec to ...
- Test the piece via the normal task UI.
config.yaml Reference
Full SSH section with defaults:
ssh:
# master switch
enabled: false
# SSRF policy — when true, allow private/loopback addresses (global)
allow_private_addresses: false
# wall-clock timeout for connect + handshake + exec/transfer (seconds)
call_timeout_seconds: 30
# stdout/stderr byte cap for SshExec (bytes)
max_output_bytes: 32768 # 32 KiB
# SFTP transfer size caps (MB)
max_upload_size_mb: 100
max_download_size_mb: 100
# ssh_audit_log retention (days). Admin can prune via UI.
audit_retention_days: 90
# When true (default), admins can use any connection without an explicit
# grant. Audited regardless. Set false for stricter least-privilege.
admin_bypasses_grants: true
# Abuse counters
abuse_window_minutes: 10 # rolling window for failure counting
abuse_failure_threshold: 5 # failures within window → lock
abuse_lock_minutes: 30 # lock duration on threshold breach
All keys translate to camelCase in SshRuntimeConfig (src/ssh/config.ts).
The transformKeys helper in src/config.ts handles the conversion.
Connection Model
Owner
Each row in ssh_connections has an owner_id:
| Owner | Visibility | Who creates |
|---|---|---|
User-owned (owner_id = userId) |
Only the owner; admin can also list but not edit | Any authenticated user (POST /api/ssh/connections) |
Global (owner_id IS NULL) |
All users see it in the picker (subject to grants) | Admin only (POST /api/ssh/admin/globals) |
Global connections solve the "team-shared infra account" use case — a single set of credentials that multiple users invoke under their own identity, audited per user, gated by grants.
Encryption
For each connection:
- Generate a fresh 32-byte DEK
- Encrypt
private_key_pem(and optionallypassphrase) with the DEK - Wrap the DEK with the master key (
MCP_ENCRYPTION_KEY) - Store:
private_key_enc,private_key_dek_enc,key_version
key_version allows progressive rewrap during master key rotation (each
row tracks which generation of master key its DEK is wrapped under).
Global connections use the single system_deks row (id=1) rather than a
per-row DEK.
⚠ ssh2's internal key handling is opaque — once the PEM is loaded into the library, it lives in JS heap memory for the lifetime of the connection. We
Buffer.fill(0, 0)our copies in thefinallyblock but cannot reach into ssh2 internals. This is an acknowledged limitation; see the plan doc's "Acknowledged limitations" section.
UI Walkthrough
User: Settings → User Folder → SSH Connections
The SshConnectionsPanel lists the user's connections and any global connections they have a grant for. Each row shows:
- Label + host:port + username
- Host key fingerprint + verify state (verified / pending / first_observe / mismatch)
- Lock state (if abuse counter triggered)
- Actions: Test, Verify host key, Replace host key (with reason), Edit, Delete
The "Add Connection" form (SshConnectionForm) collects:
- Label
- Host, port, username
- Private key (textarea — PEM format; passphrase optional)
- Remote path prefix (default
/) - Custom deny/allow regex patterns (newline-separated, validated at save-time)
SshHostKeyDialog opens on first_observe / mismatch and shows the
observed fingerprint side-by-side with the previously-stored one (if
any). "Trust this key" requires typing the fingerprint to confirm.
Admin: Settings → SSH
Four sub-panels under SshForm:
| Panel | Component | Purpose |
|---|---|---|
| Global Connections | SshGlobalConnectionsForm |
CRUD on global connections (owner_id IS NULL). Includes the allow_remote_unrestricted and per-connection allow_private_addresses flags |
| Grants | SshGrantsForm |
List/create/delete grants. Per-piece or applies_to_all_pieces. Subject: user or org. Reason required |
| Audit Log | SshAuditLog |
All-tenant audit view. Filter by action / outcome / connection / time range. Pagination |
| Master Key Rotation | SshMasterKeyRotationForm |
Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status |
Admin can also force-unlock abuse counters from the per-connection page (requires reason; rate-limited to 10/hour total).
Host Key TOFU Flow
SSH security depends on knowing the right host key. We use Trust-On- First-Use: the first time a connection is exercised, we record the observed key and require explicit user verification before treating it as trusted.
States
ssh_connections carries three host-key columns:
| Column | Meaning |
|---|---|
host_key_b64 |
The observed public key in OpenSSH base64 form. NULL = never observed. |
host_key_fingerprint |
SHA-256 fingerprint for UI display (SHA256:...). |
host_key_verified_at |
ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending. |
host_key_pending_token |
UUID issued at first_observe / mismatch; consumed atomically by /verify-host-key. |
A connection is trusted iff host_key_verified_at IS NOT NULL AND
the observed key during connect matches host_key_b64.
Lifecycle
new connection
│
├─ user clicks Test (or LLM calls SshExec)
│ │
│ ▼
│ sshTest() observes the host key
│ │
│ ▼
│ onFirstObserve hook fires
│ - writes host_key_b64, host_key_fingerprint, host_key_pending_token
│ - audit row: ssh.connection.host_key.first_observe
│ - returns SshSessionError('host_key_first_observe')
│
│ (UI shows the fingerprint + pending token)
│
├─ user clicks Verify (typing fingerprint to confirm)
│ │
│ ▼
│ POST /api/ssh/connections/:id/verify-host-key
│ {token, fingerprint}
│ - atomic compare-and-set: token + fingerprint match → set host_key_verified_at
│ - audit row: ssh.connection.host_key.verify
│
▼
verified — Exec/Upload/Download now work
On host_key_mismatch (server rebuilt, key rotated, or MITM):
├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload
│ │
│ ▼
│ ssh2 observes key ≠ host_key_b64
│ │
│ ▼
│ onMismatch hook fires
│ - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet)
│ - audit row: ssh.connection.host_key.mismatch
│ - returns SshSessionError('host_key_mismatch')
│
│ (UI shows OLD vs NEW fingerprint side-by-side)
│
├─ user investigates externally (ssh-keyscan, IT team, etc.)
│
├─ user clicks "Replace key" with reason
│ │
│ ▼
│ POST /api/ssh/connections/:id/replace-host-key
│ {token, fingerprint, reason}
│ - atomic compare-and-set
│ - writes new host_key_b64, host_key_fingerprint, host_key_verified_at
│ - audit row: ssh.connection.host_key.replace
The pending token mechanism prevents a "verify swap" race: if a second
TOFU observation happens between the user's verify request and its
arrival, the old token is overwritten and the verify endpoint returns
409 stale_token.
Banned algorithms
Even before TOFU completes, the host-key algorithm is checked against
an allowlist. SHA1-RSA and other weak algorithms are rejected before
the key is recorded (host_key_alg_not_allowed). This is hard-coded
in src/ssh/session.ts to avoid misconfiguration.
Per-piece allowed_ssh_connections
A piece's movement must explicitly opt in to SSH usage. The piece-runner enforces three invariants:
- If a movement's
allowed_toolscontains any SSH tool name (SshExec/SshUpload/SshDownload),allowed_ssh_connectionsmust be declared on that movement (even if empty) - The field must be an array of strings
- Each entry must be
*or a lowercase hex+hyphen UUID (≥ 8 chars)
Lint failures abort piece load.
Forms
# Explicit allowlist (most common)
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
# Wildcard (admin-style — use sparingly)
allowed_ssh_connections: ["*"]
# Deny-all (still allows SSH tool in allowed_tools but refuses every UUID)
allowed_ssh_connections: []
The * form skips the per-piece check but does not skip the
access grant check. A user without a grant for a
given connection still cannot use it even when the piece says *.
Example
name: backup-rotation
description: Daily backup rotation on prod servers
movements:
- name: list
allowed_tools: [SshExec]
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
instruction: |
List the existing backup files on each server.
rules:
- condition: ready to rotate
next: rotate
- name: rotate
allowed_tools: [SshExec, SshUpload]
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
instruction: |
Rotate the oldest backup ...
rules:
- condition: done
next: COMPLETE
Access Grants
Grants connect a subject (user or org) to a connection, scoped to a piece (or all pieces, admin-only).
Schema
CREATE TABLE ssh_connection_grants (
id TEXT PRIMARY KEY,
connection_id TEXT NOT NULL,
subject_type TEXT NOT NULL, -- 'user' | 'org'
subject_id TEXT NOT NULL,
piece_name TEXT, -- NULL iff applies_to_all_pieces=1
applies_to_all_pieces INTEGER NOT NULL DEFAULT 0,
granted_by_user_id TEXT NOT NULL,
reason TEXT NOT NULL, -- required, ≥ 8 chars
expires_at TEXT, -- ISO8601 or NULL
created_at TEXT NOT NULL
);
Decision tree
For a given (userId, orgIds, connectionId, pieceName):
- Owner check: if
connection.owner_id == userId→ access granted (owner of own connection) - Admin bypass: if user is admin AND
ssh.admin_bypasses_grants: true→ granted (audited) - Grant lookup:
- find rows where
connection_id = ? - subject matches (
subject_type='user' AND subject_id=userId) OR (subject_type='org' AND subject_id IN orgIds) - piece matches (
applies_to_all_pieces=1ORpiece_name = ?) - not expired (
expires_at IS NULL OR expires_at > now()) - any matching row → granted
- find rows where
- Otherwise → denied (
access denied (no_grant))
Creating grants
Admin-only via POST /api/ssh/admin/grants:
{
"connection_id": "abcd1234-...",
"subject_type": "user",
"subject_id": "alice",
"piece_name": "backup-rotation",
"applies_to_all_pieces": false,
"reason": "Alice owns backups for prod-east cluster",
"expires_at": null
}
For applies_to_all_pieces: true:
piece_namemust be null- the admin endpoint requires explicit
reasoncontaining scope justification - audit row records
action: ssh.grant.createwithdetail.applies_to_all=true - this is the highest-privilege grant — review carefully
Org grants
Same schema with subject_type: "org", subject_id: <gitea org name>.
Membership comes from user_gitea_orgs (populated at login via Gitea
OAuth). A user with multiple org memberships matches grants for any
of those orgs.
Expiration
expires_at is checked at decision time (no background sweep). Expired
rows remain in the table for audit purposes. Admin can delete them via
DELETE /api/ssh/admin/grants/:id.
Path Policy
Local path (workspace)
For SshUpload.local_path and SshDownload.local_path:
- Resolved against
ctx.workspacePath(the job's workspace root) ..traversal → reject- Symlinks: open with
O_NOFOLLOW, lstat every parent → reject if any parent is a symlink leaving the workspace - For download: parent directory must exist; target file must NOT exist (
O_CREAT | O_EXCL)
Remote path
For remote_path on upload/download:
- Must be absolute (starts with
/) - After POSIX normalization (
path.posix.normalize), must start with the connection'sremote_path_prefix ..segments are collapsed by normalize; the post-normalize check catches escape attempts- No glob expansion — exact path only
Example: connection has remote_path_prefix = '/srv/agent'
| Input | Normalized | Result |
|---|---|---|
/srv/agent/file.txt |
/srv/agent/file.txt |
✅ |
/srv/agent/sub/file.txt |
/srv/agent/sub/file.txt |
✅ |
/srv/agent/../etc/passwd |
/etc/passwd |
❌ outside prefix |
/srv/agentish/file |
/srv/agentish/file |
❌ prefix mismatch (not /srv/agent/...) |
file.txt (relative) |
n/a | ❌ not absolute |
Command Filtering
SshExec.command runs through a two-stage filter.
Stage 1: built-in deny-list
Hard-coded patterns in src/ssh/deny-list.ts. Examples (not exhaustive):
rm -rf /and variants- fork bombs (
:(){:|:&};:) mkfs.*,dd if=/dev/zero ...- shutdown / reboot / poweroff
:>/dev/sdastyle block-device writes
If matched, the call is rejected with command rejected by built-in deny-list (matched pattern: ...). and audited as outcome=denied.
The built-in list is not a comprehensive sandbox — it's a tripwire against the most catastrophic typos and worst-case prompt injection payloads. Production deployments should also configure connection-level patterns.
Stage 2: per-connection regex (optional)
Each connection can carry:
deny_patterns: newline-separated regex list. Match → reject.allow_patterns: newline-separated regex list. If set, every command must match at least one allow pattern (after passing both deny stages).
Both are validated at save-time by validateCustomPatterns:
- Each pattern must compile
- Each must pass the
safe-regexReDoS check - Aggregate length capped (no megabyte-blobs of regex)
Example:
# deny_patterns
sudo
^\s*rm\s+
nc\s+-l
# allow_patterns
^(ls|cat|grep|tail|head|systemctl|journalctl)\s
^/srv/agent/scripts/
ReDoS-safe regex is enforced because user-supplied patterns run synchronously on the command string before each call.
SSRF + Algorithms
SSRF (host resolution)
Every connection target goes through ssrfStrict(host, allowPrivate):
- DNS resolve host → list of A/AAAA records
- For each address, check against the IP-policy:
- Reject 0.0.0.0, ::/0
- Reject 127.0.0.0/8, ::1 (loopback)
- Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private)
- Reject 169.254/16 (link-local — including AWS metadata)
- DNS pinning: the resolved address is captured before connect; ssh2 connects to the pinned IP, not to the hostname. This prevents DNS rebinding (round 1: public IP passes check; round 2: returns loopback during connect).
allowPrivate short-circuits step 2. Two opt-in flags compose:
- Global:
ssh.allow_private_addresses: truein config.yaml - Per-connection:
allow_private_addresses=1on the row (admin sets via/api/ssh/admin/globalsor/api/ssh/admin/connections/:id)
Either being true allows private/loopback. Both default false.
Algorithm allowlist
Hard-coded in src/ssh/session.ts:
| Category | Allowed |
|---|---|
| Key exchange | curve25519-sha256, curve25519-sha256@libssh.org, ecdh-sha2-nistp256/384/521, diffie-hellman-group14/16/18-sha256/512 |
| Server host key | ssh-ed25519, rsa-sha2-256, rsa-sha2-512, ecdsa-sha2-nistp256/384/521 |
| Cipher | aes256-gcm@openssh.com, aes128-gcm@openssh.com, aes256-ctr, aes192-ctr, aes128-ctr |
| HMAC | hmac-sha2-512-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-256 |
Notably banned: ssh-rsa (SHA1), ssh-dss, all arcfour*, hmac-md5*,
hmac-sha1*. Mismatch returns host_key_alg_not_allowed or
auth_failed depending on which stage caught it.
Audit Log
Single table: ssh_audit_log. Every SSH operation writes here.
Lifecycle
begin (outcome=pending) [commits before remote call]
↓
remote call (ssh2 connect / exec / sftp)
↓
complete (outcome=success|failed|denied|aborted) [updates same row]
If the orchestrator crashes between begin and complete, the row
stays pending. On next boot, the recovery sweep (src/ssh/recovery.ts)
updates pending rows older than 10 minutes to aborted with
detail.recovered_at set.
Actions
| Action | Triggered by |
|---|---|
ssh.exec |
SshExec |
ssh.upload |
SshUpload |
ssh.download |
SshDownload |
ssh.connection.upsert |
User/admin connection create/edit |
ssh.connection.delete |
User/admin connection delete |
ssh.connection.host_key.first_observe |
TOFU first observation |
ssh.connection.host_key.mismatch |
TOFU mismatch |
ssh.connection.host_key.tofu_record |
Internal helper write |
ssh.connection.host_key.verify |
User /verify-host-key |
ssh.connection.host_key.replace |
User /replace-host-key |
ssh.connection.disable |
Admin disable |
ssh.connection.enable |
Admin enable |
ssh.abuse.unlock_manual |
Admin force-unlock |
ssh.grant.create |
Admin grant create |
ssh.grant.delete |
Admin grant delete |
ssh.master_key.rotate.start |
Admin rotation start |
Detail column
JSON blob with action-specific fields:
ssh.exec:{command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}ssh.upload:{local_path: "...", remote_path: "...", bytes: 4096}ssh.download: same shapessh.connection.host_key.first_observe:{fingerprint: "SHA256:...", pending_token: "uuid"}- All denied:
{error: "no_grant" | "abuse_locked" | "disabled" | ...}
The ssh.exec action does not record the command string — only its
SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you
need to investigate a specific exec, correlate the hash with stdin
logs from the LLM activity log (workspace logs/activity.log).
Retention
ssh.audit_retention_days (default 90) controls a lazy sweep. Admin can
trigger pruning manually from the UI. There is no hard cap on table
size — disk-fill is mitigated by the hashing + truncation strategy
above, plus admin-driven cleanup.
Abuse Counters & Lock
Defends against credential spraying, mistyped scripts in loops, and brute-force scans.
Three scopes
| Scope | Key | When |
|---|---|---|
user |
(user_id,) |
Any SSH failure by this user |
host:user |
(host, username) |
Failure on this (host, username) tuple |
host |
(host,) |
Failure on this host (global connections only) |
The host scope intentionally only counts failures on global
connections to prevent cross-user DoS: a user repeatedly failing on
their own connection cannot lock out other users from a shared host.
For user-owned connections, the host counter is updated for
admin-notification only — no lock applies.
Algorithm
on failure:
for each scope:
increment counter
if count(within abuseWindowMinutes) >= abuseFailureThreshold:
lock until now + abuseLockMinutes
on success (user scope only):
reset user counter
other scopes age out naturally with the window
Counters are stored in ssh_abuse_counters, with separate columns per
scope. All updates are transactional (no UPSERT race).
Force-unlock
Admin can force-unlock from SshGlobalConnectionsForm or per-connection
admin page:
POST /api/ssh/admin/connections/:id/force-unlock
{reason: "Confirmed credentials rotated; user retried with old key"}
Rate-limited to 10/hour total across all admins (admin-rate-limit.ts,
token bucket). Audited as ssh.abuse.unlock_manual.
Master Key Rotation
Replaces MCP_ENCRYPTION_KEY and rewraps every row's DEK under the new
key. This is the way to rotate the master key — do not edit the env
var manually.
Flow
- Admin starts via
POST /api/ssh/admin/rotate-master-key:{"new_key_hex": "<64-hex>", "reason": "Annual rotation"} - Maintenance mode engages —
sshMaintenance.enter()returns 503 for all SSH write endpoints (read endpoints stay alive). The LLM seesSSH subsystem is in maintenanceerrors for tool calls. - Per-row rewrap:
- For each
ssh_connectionsrow: decrypt DEK under old key, re-encrypt under new key, bumpkey_version, commit (one tx per row) - For each
system_deksrow: same
- For each
- New key validated by decrypting a test value
- Maintenance exits automatically
- Caller polls
GET /api/ssh/admin/rotate-master-key/:jobIdfor status (running/succeeded/failed)
Failure modes
- Crash mid-rotation: rows have mixed
key_version. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key. - Wrong old key: the first row decryption fails → job aborts before any change, maintenance exits, audit records
ssh.master_key.rotate.startwithoutcome=failed. - Disk write fails mid-row: that single row is rolled back; rotation continues. Operator must re-run.
The rotation job runs in-process (not as a separate worker). For large fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write.
MCP_ENCRYPTION_KEY env var
After successful rotation, the env var must be updated to the new key before the next restart. The orchestrator writes the new key to the audit log (encrypted under the OLD key) and returns it once in the HTTP response — there's no second chance. Update your secrets store immediately.
Operator Runbook
A. Add a global (admin-managed) connection
# Via UI: Settings → SSH → Global Connections → Add
# Or via API (requires admin session cookie):
curl -X POST http://localhost:3000/api/ssh/admin/globals \
-H 'Content-Type: application/json' \
-d @- <<'JSON'
{
"label": "prod-east-bastion",
"host": "bastion.prod-east.example.com",
"port": 22,
"username": "deploy",
"private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...",
"passphrase": null,
"remote_path_prefix": "/srv/deploy",
"allow_private_addresses": false,
"deny_patterns": "sudo\n^\\s*rm\\s+",
"allow_patterns": "",
"reason": "Production deploy bastion — owned by SRE"
}
JSON
Then verify the host key (next section) and grant access.
B. Grant org access to a global connection
curl -X POST http://localhost:3000/api/ssh/admin/grants \
-H 'Content-Type: application/json' \
-d '{
"connection_id": "<uuid>",
"subject_type": "org",
"subject_id": "engineering",
"piece_name": "prod-deploy",
"applies_to_all_pieces": false,
"reason": "Engineering org runs prod-deploy piece"
}'
C. Verify a TOFU first-observe
- From the user's side or admin side, click Test in the SshConnections panel
- The response is
host_key_first_observewith a SHA-256 fingerprint and pending token - Verify externally that the fingerprint matches the real server:
Compare the resultingssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \ | ssh-keygen -lf -SHA256:...with what the UI shows - In the dialog, type the fingerprint to confirm and click Verify
- Audit row
ssh.connection.host_key.verifyrecorded; subsequent calls succeed
D. Force-unlock a stuck connection
Symptom: user reports "SshExec returns access denied (abuse_locked)"
- Settings → SSH → Global Connections → click the row → "Locks" section
- Inspect the counter state (which scope is locked, until when)
- If genuinely needs early unlock (e.g. user fixed the bad credentials), click Force unlock, enter reason
- If suspicious (unexplained 5+ failures), investigate audit log first
E. Rotate the master key
NEW_KEY=$(openssl rand -hex 32)
# Start rotation
JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \
-H 'Content-Type: application/json' \
-d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \
| jq -r .job_id)
# Poll until done
while true; do
STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \
| jq -r .status)
echo "$STATUS"
[ "$STATUS" = "succeeded" ] && break
[ "$STATUS" = "failed" ] && { echo FAILED; exit 1; }
sleep 2
done
# Update env var BEFORE next restart
echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env
F. Prune old audit logs
Settings → SSH → Audit Log → "Prune older than N days" (defaults to the config retention value). Or via API:
curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90'
Troubleshooting
Symptom → cause table
| Error | Common cause | Fix |
|---|---|---|
SSH is disabled (503) |
ssh.enabled: false |
Set true, restart not required |
SSH subsystem is in maintenance |
Master key rotation in progress | Wait for job to complete, or check rotation log |
access denied (no_grant) |
User lacks grant for connection | Admin creates a grant, or user uses an owned connection |
access denied (disabled) |
Admin disabled the connection | Admin re-enables, or use different connection |
access denied (abuse_locked) |
Counter triggered | Wait for lock window, or admin force-unlocks |
piece "X" does not list connection Y |
allowed_ssh_connections missing UUID |
Add UUID to the movement's allowed_ssh_connections |
host_key_first_observe |
First time exercising connection | Verify fingerprint in UI |
host_key_not_verified |
Key recorded but never verified | Click Verify in UI |
host_key_mismatch |
Server key changed | Investigate (legitimate rotation? MITM?), then Replace via UI |
host_key_alg_not_allowed |
Server using SHA1-RSA etc. | Upgrade server to ed25519 / rsa-sha2-256 |
auth_failed |
Wrong key, wrong username | Re-check connection settings |
connect_timeout |
Network unreachable, firewall | Check from host, check SSRF policy |
exec_timeout |
Long-running command | Increase timeout_ms, or run in background and Download results |
output_too_large |
stdout > 32 KiB | Filter the command, or write to file and Download |
forbidden_address |
Target is private IP, no opt-in | Set allow_private_addresses per-connection or globally |
system_dek_verify_failed (log) |
MCP_ENCRYPTION_KEY changed without rotation flow |
Stop server, restore old key OR re-rotate via flow |
Where to look
| Question | Source |
|---|---|
| What did the LLM try to do? | logs/activity.log in the job's workspace |
| What did SSH do? | ssh_audit_log (Admin UI or SQL) |
| Was it actually denied at the SSH layer? | Audit row outcome = denied |
| What was the exit code? | Audit detail.exit_code (for ssh.exec) |
| Did it crash mid-call? | Audit outcome = aborted (recovery sweep) |
| Why was the host key flagged? | Audit ssh.connection.host_key.* rows |
| Who has access to a connection? | ssh_connection_grants filtered by connection_id |
Security Model Summary
Detailed threat model + risk register: see plan doc §"Security Design Deep-Dive (rev 3)" and §"Risk Register (rev 3)".
Key points operators must understand:
-
The orchestrator is a credential proxy. Anyone with admin rights can read connection plaintext (via the rotation flow, which decrypts server-side). Treat admin access as production-credential-equivalent.
-
TOFU is the floor, not the ceiling. First-observe is unauthenticated. For high-stakes targets, pre-populate
host_key_b64from a trusted bootstrap (e.g. baked into the connection at create time via thehost_key_b64field) rather than relying on the orchestrator's first observation. -
The deny-list is not a sandbox. Built-in patterns catch obvious misuse. Real isolation requires connection-level configuration (restricted shell account,
remote_path_prefix, narrowallow_patterns) and target-side controls. -
Audit log is local-only. No HMAC chain (acknowledged limitation R-audit-tamper). For tamper-evidence, ship
ssh_audit_logrows to an external SIEM via SQLite hooks or periodic export. -
ssh2 internal key retention (R-ssh2-leak): the PEM lives in JS heap for the connection lifetime. Process compromise reveals plaintext credentials. Mitigations: short-lived processes, separate worker per high-stakes connection.
-
Master key compromise = total compromise. Key rotation invalidates already-leaked encrypted material — if an attacker has both the DB and the old master key, all stored creds are theirs. Rotate keys immediately on suspected compromise AND rotate every stored credential on the target side.
HTTP API Reference
User router: mounted at /api/ssh — requires requireAuth.
| Method | Path | Purpose |
|---|---|---|
| GET | /connections |
List own connections + grant-visible globals |
| POST | /connections |
Create user-owned connection |
| GET | /connections/:id |
Read |
| PATCH | /connections/:id |
Edit (owner only) |
| DELETE | /connections/:id |
Delete (owner only) |
| POST | /connections/:id/test |
Trigger TOFU observation / verify path |
| POST | /connections/:id/verify-host-key |
Atomic verify (token + fingerprint) |
| POST | /connections/:id/replace-host-key |
Atomic replace (token + fingerprint + reason) |
| GET | /connections/:id/audit |
Owner's view of audit rows for this connection |
| GET | /grants/visible-to-me |
List grants visible (subject=user or matching org) |
Admin router: mounted at /api/ssh/admin — requires requireAdmin.
| Method | Path | Purpose |
|---|---|---|
| GET | /connections |
All connections (cross-tenant) |
| GET | /connections/:id |
Admin read |
| PATCH | /connections/:id/disable |
Soft-disable (audited; reason required) |
| PATCH | /connections/:id/enable |
Re-enable |
| DELETE | /connections/:id |
Hard-delete |
| POST | /connections/:id/force-unlock |
Clear abuse counter (rate-limited; reason required) |
| POST | /globals |
Create global connection |
| PATCH | /globals/:id |
Edit global |
| DELETE | /globals/:id |
Delete global |
| GET | /grants |
List all grants |
| POST | /grants |
Create grant |
| DELETE | /grants/:id |
Delete grant |
| POST | /rotate-master-key |
Start master key rotation |
| GET | /rotate-master-key/:jobId |
Poll rotation status |
| GET | /audit |
All-tenant audit view (paginated) |
All admin write endpoints require:
requireAdminmiddlewaremaintenance503()guard (rejects writes during rotation)validateReason()onbody.reason(≥ 8 chars)auditRepo.beginAndComplete()for success/failure both
SSH Console (Interactive)
ssh.console.enabled: true で有効化。
- 1 タスク = 1 PTY セッション。job をまたいで shell state を維持
- Tab
SSHがタスク詳細に出る (piece が SshConsole* を allow している場合) - WebSocket:
/api/local/tasks/:taskId/console/ws - REST status:
GET /api/local/tasks/:taskId/console/status - 監査:
ssh.console.{open,send,snapshot,resize,input_rejected,close} - 自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill
- 同 connection あたり最大 3 sessions (古い順に evict)
Admin: GET /api/admin/ssh/console-sessions で一覧、 POST /api/admin/ssh/console-sessions/:taskId/kill で kill (admin role only)。
See Also
- docs/tools/ssh-tools.md — LLM-facing tool semantics
- docs/tools/ssh-console-tools.md — SSH Console tool semantics (Ensure/Send/Snapshot)
- docs/mcp.md — MCP integration (shares
MCP_ENCRYPTION_KEY) - docs/maintenance-checklist.md §12 — checklist for SSH-related code changes