Open-source release of MAESTRO, an agent orchestration platform that runs LLM-driven tasks through sandboxed tools, with a web UI. Apache-2.0. See README.md and docs/ (getting-started, configuration, architecture).
997 lines
37 KiB
Markdown
997 lines
37 KiB
Markdown
# SSH Subsystem (Operator Runbook)
|
|
|
|
The orchestrator can run shell commands on remote servers (`SshExec`) and
|
|
move files between the workspace and remote hosts (`SshUpload`/`SshDownload`)
|
|
through a dedicated, audited SSH subsystem. Like the MCP integration, **the
|
|
feature is off by default** and requires a key + config flip to enable.
|
|
|
|
This document is the **operator runbook** for setting up, granting access,
|
|
verifying host keys, rotating the master key, and troubleshooting. For the
|
|
LLM-facing tool semantics see [docs/tools/ssh-tools.md](./tools/ssh-tools.md).
|
|
For the internal design (threat model, risk register, schema, 12-step
|
|
orchestration flow) see
|
|
|
|
## At a glance
|
|
|
|
| Aspect | Behavior |
|
|
|---|---|
|
|
| Default | `ssh.enabled: false` — tools hidden, panels hidden, API returns 503 |
|
|
| Tools exposed when enabled | `SshExec`, `SshUpload`, `SshDownload` |
|
|
| Authentication | Public-key only; passwords are **not** supported |
|
|
| Host key trust | TOFU (Trust-On-First-Use) with explicit verify; mismatch fails closed |
|
|
| Connection ownership | User-owned (private) **or** Global (admin-managed, shared via grants) |
|
|
| Encryption at rest | AES-256-GCM (per-row DEK, master key = `MCP_ENCRYPTION_KEY`) |
|
|
| Audit | Dedicated `ssh_audit_log` table, `pending → success/failed/denied/aborted` lifecycle |
|
|
| Abuse defense | 3-scope counters (`user` / `host:user` / `host`) with auto-lock |
|
|
| Network policy | SSRF strict by default; per-connection opt-in for private IPs |
|
|
| Algorithm policy | Strict allowlist (no SHA1-RSA, no weak DH/HMAC) |
|
|
|
|
## Prerequisites
|
|
|
|
### 1. `MCP_ENCRYPTION_KEY`
|
|
|
|
The SSH subsystem **shares the same master key as MCP** — there is only one
|
|
key per orchestrator. All private keys, passphrases, and global-connection
|
|
DEKs are encrypted with AES-256-GCM under a per-row DEK, and each DEK is
|
|
wrapped by this master key.
|
|
|
|
Generate it once (32 bytes = 64 hex chars):
|
|
|
|
```bash
|
|
openssl rand -hex 32
|
|
```
|
|
|
|
Export it before starting the server:
|
|
|
|
```bash
|
|
export MCP_ENCRYPTION_KEY=<the 64-hex output>
|
|
scripts/server.sh start
|
|
```
|
|
|
|
If `MCP_ENCRYPTION_KEY` is **not** set when `ssh.enabled: true`, the SSH
|
|
subsystem boots **fail-soft**: a warning is logged, all SSH endpoints
|
|
return 503, the tools are hidden from LLM, and the UI panels show a
|
|
configuration error banner. Other features (MCP excepted) continue
|
|
normally.
|
|
|
|
> ⚠ **Key rotation invalidates existing encrypted material.** There is a
|
|
> built-in [master key rotation flow](#master-key-rotation) that rewraps
|
|
> every row in maintenance mode. Do **not** swap the env var manually
|
|
> without using that flow — half-rotated state breaks every connection.
|
|
|
|
### 2. `ssh.enabled: true`
|
|
|
|
Flip the flag in `config.yaml`:
|
|
|
|
```yaml
|
|
ssh:
|
|
enabled: true
|
|
```
|
|
|
|
This is the master switch. With it `false`:
|
|
- HTTP endpoints (`/api/ssh/*`, `/api/ssh/admin/*`) return 503
|
|
- Tool defs are not exposed to the LLM (the dispatcher returns null)
|
|
- UI panels render an "SSH is disabled" empty state
|
|
- Database tables remain present (no destructive change)
|
|
|
|
Restart is **not** required — `ConfigManager` reload picks up the change
|
|
and rebuilds the SSH router.
|
|
|
|
### 3. `system_deks` bootstrap
|
|
|
|
The first time the orchestrator boots with `ssh.enabled: true` AND a
|
|
valid `MCP_ENCRYPTION_KEY`, it provisions a single row in `system_deks`
|
|
(via `INSERT OR IGNORE` inside a transaction, `CHECK(id=1)`). This DEK
|
|
encrypts **global connections** (those without an owner).
|
|
|
|
On every subsequent boot, `verifySystemDek` decrypts the stored DEK to
|
|
prove the master key still works. If it fails (key rotated outside the
|
|
rotation flow, or env var differs from when the DEK was wrapped), SSH
|
|
**fails closed for the session** and a `system_dek_verify_failed` error
|
|
is logged. User-owned connections may still partially work (their DEKs
|
|
are wrapped per user), but global connections will all error.
|
|
|
|
### 4. Optional: `allow_private_addresses`
|
|
|
|
By default, SSH connections are routed through the SSRF strict-check,
|
|
which blocks loopback (127.0.0.0/8, ::1) and private (10/8, 172.16/12,
|
|
192.168/16, fc00::/7, 169.254/16) addresses. For **LAN targets** you
|
|
must opt in.
|
|
|
|
There are two scopes:
|
|
|
|
```yaml
|
|
ssh:
|
|
enabled: true
|
|
allow_private_addresses: true # global default
|
|
```
|
|
|
|
```sql
|
|
-- per-connection opt-in (admin-only flag, audited)
|
|
UPDATE ssh_connections SET allow_private_addresses=1 WHERE id=?;
|
|
```
|
|
|
|
The per-connection flag is preferred — narrow the blast radius. The
|
|
global flag exists for trusted dev networks (homelab, isolated VPC).
|
|
The per-connection flag can only be set on **global** (admin-managed)
|
|
connections; for user-owned connections, the global flag applies.
|
|
|
|
## Quickstart
|
|
|
|
```bash
|
|
# 1. Set the key
|
|
openssl rand -hex 32 > ~/.mcp_encryption_key
|
|
export MCP_ENCRYPTION_KEY=$(cat ~/.mcp_encryption_key)
|
|
|
|
# 2. Enable SSH + allow LAN
|
|
cat >> config.yaml <<'YAML'
|
|
ssh:
|
|
enabled: true
|
|
allow_private_addresses: true # only if you're targeting LAN
|
|
YAML
|
|
|
|
# 3. Restart
|
|
scripts/server.sh restart
|
|
```
|
|
|
|
Then in the UI:
|
|
|
|
1. **Settings → User Folder → SSH Connections → Add**
|
|
2. Fill `label`, `host`, `port` (default 22), `username`, paste private key (OpenSSH PEM)
|
|
3. Optionally set `remote_path_prefix` (default `/`) — restricts upload/download paths
|
|
4. Click **Test** → first call returns `host_key_first_observe` with a fingerprint
|
|
5. **Verify** in the dialog (compare fingerprint with what you expect from `ssh-keyscan <host>`)
|
|
6. Add the connection's UUID to a piece's `allowed_ssh_connections`:
|
|
|
|
```yaml
|
|
# pieces/example.yaml
|
|
name: ssh-example
|
|
movements:
|
|
- name: deploy
|
|
allowed_tools: [SshExec, SshUpload]
|
|
allowed_ssh_connections: ["abcd1234-..."]
|
|
rules:
|
|
- condition: done
|
|
next: COMPLETE
|
|
instruction: |
|
|
Use SshExec to ...
|
|
```
|
|
|
|
7. Test the piece via the normal task UI.
|
|
|
|
## `config.yaml` Reference
|
|
|
|
Full SSH section with defaults:
|
|
|
|
```yaml
|
|
ssh:
|
|
# master switch
|
|
enabled: false
|
|
|
|
# SSRF policy — when true, allow private/loopback addresses (global)
|
|
allow_private_addresses: false
|
|
|
|
# wall-clock timeout for connect + handshake + exec/transfer (seconds)
|
|
call_timeout_seconds: 30
|
|
|
|
# stdout/stderr byte cap for SshExec (bytes)
|
|
max_output_bytes: 32768 # 32 KiB
|
|
|
|
# SFTP transfer size caps (MB)
|
|
max_upload_size_mb: 100
|
|
max_download_size_mb: 100
|
|
|
|
# ssh_audit_log retention (days). Admin can prune via UI.
|
|
audit_retention_days: 90
|
|
|
|
# When true (default), admins can use any connection without an explicit
|
|
# grant. Audited regardless. Set false for stricter least-privilege.
|
|
admin_bypasses_grants: true
|
|
|
|
# Abuse counters
|
|
abuse_window_minutes: 10 # rolling window for failure counting
|
|
abuse_failure_threshold: 5 # failures within window → lock
|
|
abuse_lock_minutes: 30 # lock duration on threshold breach
|
|
```
|
|
|
|
All keys translate to camelCase in `SshRuntimeConfig` (`src/ssh/config.ts`).
|
|
The `transformKeys` helper in `src/config.ts` handles the conversion.
|
|
|
|
## Connection Model
|
|
|
|
### Owner
|
|
|
|
Each row in `ssh_connections` has an `owner_id`:
|
|
|
|
| Owner | Visibility | Who creates |
|
|
|---|---|---|
|
|
| **User-owned** (`owner_id = userId`) | Only the owner; admin can also list but not edit | Any authenticated user (`POST /api/ssh/connections`) |
|
|
| **Global** (`owner_id IS NULL`) | All users see it in the picker (subject to grants) | Admin only (`POST /api/ssh/admin/globals`) |
|
|
|
|
Global connections solve the "team-shared infra account" use case — a
|
|
single set of credentials that multiple users invoke under their own
|
|
identity, audited per user, gated by grants.
|
|
|
|
### Encryption
|
|
|
|
For each connection:
|
|
|
|
1. Generate a fresh 32-byte DEK
|
|
2. Encrypt `private_key_pem` (and optionally `passphrase`) with the DEK
|
|
3. Wrap the DEK with the master key (`MCP_ENCRYPTION_KEY`)
|
|
4. Store: `private_key_enc`, `private_key_dek_enc`, `key_version`
|
|
|
|
`key_version` allows progressive rewrap during master key rotation (each
|
|
row tracks which generation of master key its DEK is wrapped under).
|
|
Global connections use the single `system_deks` row (id=1) rather than a
|
|
per-row DEK.
|
|
|
|
> ⚠ ssh2's internal key handling is opaque — once the PEM is loaded into
|
|
> the library, it lives in JS heap memory for the lifetime of the
|
|
> connection. We `Buffer.fill(0, 0)` our copies in the `finally` block
|
|
> but cannot reach into ssh2 internals. This is an acknowledged
|
|
> limitation; see the plan doc's "Acknowledged limitations" section.
|
|
|
|
## UI Walkthrough
|
|
|
|
### User: Settings → User Folder → SSH Connections
|
|
|
|
The **SshConnectionsPanel** lists the user's connections and any global
|
|
connections they have a grant for. Each row shows:
|
|
|
|
- Label + host:port + username
|
|
- Host key fingerprint + verify state (verified / pending / first_observe / mismatch)
|
|
- Lock state (if abuse counter triggered)
|
|
- Actions: **Test**, **Verify host key**, **Replace host key** (with reason), **Edit**, **Delete**
|
|
|
|
The "Add Connection" form (`SshConnectionForm`) collects:
|
|
|
|
- Label
|
|
- Host, port, username
|
|
- Private key (textarea — PEM format; passphrase optional)
|
|
- Remote path prefix (default `/`)
|
|
- Custom deny/allow regex patterns (newline-separated, validated at save-time)
|
|
|
|
`SshHostKeyDialog` opens on first_observe / mismatch and shows the
|
|
observed fingerprint side-by-side with the previously-stored one (if
|
|
any). "Trust this key" requires typing the fingerprint to confirm.
|
|
|
|
### Admin: Settings → SSH
|
|
|
|
Four sub-panels under `SshForm`:
|
|
|
|
| Panel | Component | Purpose |
|
|
|---|---|---|
|
|
| **Global Connections** | `SshGlobalConnectionsForm` | CRUD on global connections (`owner_id IS NULL`). Includes the `allow_remote_unrestricted` and per-connection `allow_private_addresses` flags |
|
|
| **Grants** | `SshGrantsForm` | List/create/delete grants. Per-piece or `applies_to_all_pieces`. Subject: user or org. Reason required |
|
|
| **Audit Log** | `SshAuditLog` | All-tenant audit view. Filter by action / outcome / connection / time range. Pagination |
|
|
| **Master Key Rotation** | `SshMasterKeyRotationForm` | Start a rotation job (provides new key, enters maintenance, rewraps rows). Polls status |
|
|
|
|
Admin can also force-unlock abuse counters from the per-connection page
|
|
(requires reason; rate-limited to 10/hour total).
|
|
|
|
## Host Key TOFU Flow
|
|
|
|
SSH security depends on knowing the **right** host key. We use Trust-On-
|
|
First-Use: the first time a connection is exercised, we record the
|
|
observed key and require explicit user verification before treating
|
|
it as trusted.
|
|
|
|
### States
|
|
|
|
`ssh_connections` carries three host-key columns:
|
|
|
|
| Column | Meaning |
|
|
|---|---|
|
|
| `host_key_b64` | The observed public key in OpenSSH base64 form. NULL = never observed. |
|
|
| `host_key_fingerprint` | SHA-256 fingerprint for UI display (`SHA256:...`). |
|
|
| `host_key_verified_at` | ISO8601 timestamp of the user's explicit "trust this key" action. NULL = pending. |
|
|
| `host_key_pending_token` | UUID issued at first_observe / mismatch; consumed atomically by `/verify-host-key`. |
|
|
|
|
A connection is **trusted** iff `host_key_verified_at IS NOT NULL` AND
|
|
the observed key during connect matches `host_key_b64`.
|
|
|
|
### Lifecycle
|
|
|
|
```
|
|
new connection
|
|
│
|
|
├─ user clicks Test (or LLM calls SshExec)
|
|
│ │
|
|
│ ▼
|
|
│ sshTest() observes the host key
|
|
│ │
|
|
│ ▼
|
|
│ onFirstObserve hook fires
|
|
│ - writes host_key_b64, host_key_fingerprint, host_key_pending_token
|
|
│ - audit row: ssh.connection.host_key.first_observe
|
|
│ - returns SshSessionError('host_key_first_observe')
|
|
│
|
|
│ (UI shows the fingerprint + pending token)
|
|
│
|
|
├─ user clicks Verify (typing fingerprint to confirm)
|
|
│ │
|
|
│ ▼
|
|
│ POST /api/ssh/connections/:id/verify-host-key
|
|
│ {token, fingerprint}
|
|
│ - atomic compare-and-set: token + fingerprint match → set host_key_verified_at
|
|
│ - audit row: ssh.connection.host_key.verify
|
|
│
|
|
▼
|
|
verified — Exec/Upload/Download now work
|
|
```
|
|
|
|
On `host_key_mismatch` (server rebuilt, key rotated, or MITM):
|
|
|
|
```
|
|
├─ Exec/Upload/Download calls sshExec/sshUpload/sshDownload
|
|
│ │
|
|
│ ▼
|
|
│ ssh2 observes key ≠ host_key_b64
|
|
│ │
|
|
│ ▼
|
|
│ onMismatch hook fires
|
|
│ - writes new host_key_pending_token (DOES NOT overwrite host_key_b64 yet)
|
|
│ - audit row: ssh.connection.host_key.mismatch
|
|
│ - returns SshSessionError('host_key_mismatch')
|
|
│
|
|
│ (UI shows OLD vs NEW fingerprint side-by-side)
|
|
│
|
|
├─ user investigates externally (ssh-keyscan, IT team, etc.)
|
|
│
|
|
├─ user clicks "Replace key" with reason
|
|
│ │
|
|
│ ▼
|
|
│ POST /api/ssh/connections/:id/replace-host-key
|
|
│ {token, fingerprint, reason}
|
|
│ - atomic compare-and-set
|
|
│ - writes new host_key_b64, host_key_fingerprint, host_key_verified_at
|
|
│ - audit row: ssh.connection.host_key.replace
|
|
```
|
|
|
|
The pending token mechanism prevents a "verify swap" race: if a second
|
|
TOFU observation happens between the user's verify request and its
|
|
arrival, the old token is overwritten and the verify endpoint returns
|
|
`409 stale_token`.
|
|
|
|
### Banned algorithms
|
|
|
|
Even before TOFU completes, the host-key algorithm is checked against
|
|
an allowlist. SHA1-RSA and other weak algorithms are rejected before
|
|
the key is recorded (`host_key_alg_not_allowed`). This is hard-coded
|
|
in `src/ssh/session.ts` to avoid misconfiguration.
|
|
|
|
## Per-piece `allowed_ssh_connections`
|
|
|
|
A piece's movement must explicitly opt in to SSH usage. The
|
|
piece-runner enforces three invariants:
|
|
|
|
1. If a movement's `allowed_tools` contains any SSH tool name
|
|
(`SshExec`/`SshUpload`/`SshDownload`), `allowed_ssh_connections`
|
|
**must be declared** on that movement (even if empty)
|
|
2. The field must be an array of strings
|
|
3. Each entry must be `*` or a lowercase hex+hyphen UUID (≥ 8 chars)
|
|
|
|
Lint failures abort piece load.
|
|
|
|
### Forms
|
|
|
|
```yaml
|
|
# Explicit allowlist (most common)
|
|
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
|
|
|
|
# Wildcard (admin-style — use sparingly)
|
|
allowed_ssh_connections: ["*"]
|
|
|
|
# Deny-all (still allows SSH tool in allowed_tools but refuses every UUID)
|
|
allowed_ssh_connections: []
|
|
```
|
|
|
|
The `*` form skips the per-piece check but **does not** skip the
|
|
[access grant check](#access-grants). A user without a grant for a
|
|
given connection still cannot use it even when the piece says `*`.
|
|
|
|
### Example
|
|
|
|
```yaml
|
|
name: backup-rotation
|
|
description: Daily backup rotation on prod servers
|
|
movements:
|
|
- name: list
|
|
allowed_tools: [SshExec]
|
|
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
|
|
instruction: |
|
|
List the existing backup files on each server.
|
|
rules:
|
|
- condition: ready to rotate
|
|
next: rotate
|
|
|
|
- name: rotate
|
|
allowed_tools: [SshExec, SshUpload]
|
|
allowed_ssh_connections: ["abcd1234-...", "ef567890-..."]
|
|
instruction: |
|
|
Rotate the oldest backup ...
|
|
rules:
|
|
- condition: done
|
|
next: COMPLETE
|
|
```
|
|
|
|
## Access Grants
|
|
|
|
Grants connect a **subject** (user or org) to a **connection**, scoped
|
|
to a **piece** (or all pieces, admin-only).
|
|
|
|
### Schema
|
|
|
|
```sql
|
|
CREATE TABLE ssh_connection_grants (
|
|
id TEXT PRIMARY KEY,
|
|
connection_id TEXT NOT NULL,
|
|
subject_type TEXT NOT NULL, -- 'user' | 'org'
|
|
subject_id TEXT NOT NULL,
|
|
piece_name TEXT, -- NULL iff applies_to_all_pieces=1
|
|
applies_to_all_pieces INTEGER NOT NULL DEFAULT 0,
|
|
granted_by_user_id TEXT NOT NULL,
|
|
reason TEXT NOT NULL, -- required, ≥ 8 chars
|
|
expires_at TEXT, -- ISO8601 or NULL
|
|
created_at TEXT NOT NULL
|
|
);
|
|
```
|
|
|
|
### Decision tree
|
|
|
|
For a given `(userId, orgIds, connectionId, pieceName)`:
|
|
|
|
1. **Owner check**: if `connection.owner_id == userId` → access granted (owner of own connection)
|
|
2. **Admin bypass**: if user is admin AND `ssh.admin_bypasses_grants: true` → granted (audited)
|
|
3. **Grant lookup**:
|
|
- find rows where `connection_id = ?`
|
|
- subject matches (`subject_type='user' AND subject_id=userId`) OR (`subject_type='org' AND subject_id IN orgIds`)
|
|
- piece matches (`applies_to_all_pieces=1` OR `piece_name = ?`)
|
|
- not expired (`expires_at IS NULL OR expires_at > now()`)
|
|
- **any** matching row → granted
|
|
4. Otherwise → denied (`access denied (no_grant)`)
|
|
|
|
### Creating grants
|
|
|
|
Admin-only via `POST /api/ssh/admin/grants`:
|
|
|
|
```json
|
|
{
|
|
"connection_id": "abcd1234-...",
|
|
"subject_type": "user",
|
|
"subject_id": "alice",
|
|
"piece_name": "backup-rotation",
|
|
"applies_to_all_pieces": false,
|
|
"reason": "Alice owns backups for prod-east cluster",
|
|
"expires_at": null
|
|
}
|
|
```
|
|
|
|
For `applies_to_all_pieces: true`:
|
|
- `piece_name` **must be null**
|
|
- the admin endpoint requires explicit `reason` containing scope justification
|
|
- audit row records `action: ssh.grant.create` with `detail.applies_to_all=true`
|
|
- this is the highest-privilege grant — review carefully
|
|
|
|
### Org grants
|
|
|
|
Same schema with `subject_type: "org"`, `subject_id: <gitea org name>`.
|
|
Membership comes from `user_gitea_orgs` (populated at login via Gitea
|
|
OAuth). A user with multiple org memberships matches grants for any
|
|
of those orgs.
|
|
|
|
### Expiration
|
|
|
|
`expires_at` is checked at decision time (no background sweep). Expired
|
|
rows remain in the table for audit purposes. Admin can delete them via
|
|
`DELETE /api/ssh/admin/grants/:id`.
|
|
|
|
## Path Policy
|
|
|
|
### Local path (workspace)
|
|
|
|
For `SshUpload.local_path` and `SshDownload.local_path`:
|
|
|
|
- Resolved against `ctx.workspacePath` (the job's workspace root)
|
|
- `..` traversal → reject
|
|
- Symlinks: open with `O_NOFOLLOW`, lstat every parent → reject if any parent is a symlink leaving the workspace
|
|
- For download: parent directory must exist; target file must NOT exist (`O_CREAT | O_EXCL`)
|
|
|
|
### Remote path
|
|
|
|
For `remote_path` on upload/download:
|
|
|
|
- Must be **absolute** (starts with `/`)
|
|
- After POSIX normalization (`path.posix.normalize`), must start with the connection's `remote_path_prefix`
|
|
- `..` segments are collapsed by normalize; the post-normalize check catches escape attempts
|
|
- No glob expansion — exact path only
|
|
|
|
Example: connection has `remote_path_prefix = '/srv/agent'`
|
|
|
|
| Input | Normalized | Result |
|
|
|---|---|---|
|
|
| `/srv/agent/file.txt` | `/srv/agent/file.txt` | ✅ |
|
|
| `/srv/agent/sub/file.txt` | `/srv/agent/sub/file.txt` | ✅ |
|
|
| `/srv/agent/../etc/passwd` | `/etc/passwd` | ❌ outside prefix |
|
|
| `/srv/agentish/file` | `/srv/agentish/file` | ❌ prefix mismatch (not `/srv/agent/...`) |
|
|
| `file.txt` (relative) | n/a | ❌ not absolute |
|
|
|
|
## Command Filtering
|
|
|
|
`SshExec.command` runs through a two-stage filter.
|
|
|
|
### Stage 1: built-in deny-list
|
|
|
|
Hard-coded patterns in `src/ssh/deny-list.ts`. Examples (not exhaustive):
|
|
|
|
- `rm -rf /` and variants
|
|
- fork bombs (`:(){:|:&};:`)
|
|
- `mkfs.*`, `dd if=/dev/zero ...`
|
|
- shutdown / reboot / poweroff
|
|
- `:>/dev/sda` style block-device writes
|
|
|
|
If matched, the call is rejected with `command rejected by built-in
|
|
deny-list (matched pattern: ...).` and audited as `outcome=denied`.
|
|
|
|
The built-in list is **not** a comprehensive sandbox — it's a tripwire
|
|
against the most catastrophic typos and worst-case prompt injection
|
|
payloads. Production deployments should also configure connection-level
|
|
patterns.
|
|
|
|
### Stage 2: per-connection regex (optional)
|
|
|
|
Each connection can carry:
|
|
|
|
- `deny_patterns`: newline-separated regex list. Match → reject.
|
|
- `allow_patterns`: newline-separated regex list. If set, every command
|
|
must match at least one allow pattern (after passing both deny stages).
|
|
|
|
Both are validated at save-time by `validateCustomPatterns`:
|
|
|
|
- Each pattern must compile
|
|
- Each must pass the `safe-regex` ReDoS check
|
|
- Aggregate length capped (no megabyte-blobs of regex)
|
|
|
|
Example:
|
|
|
|
```
|
|
# deny_patterns
|
|
sudo
|
|
^\s*rm\s+
|
|
nc\s+-l
|
|
|
|
# allow_patterns
|
|
^(ls|cat|grep|tail|head|systemctl|journalctl)\s
|
|
^/srv/agent/scripts/
|
|
```
|
|
|
|
ReDoS-safe regex is enforced because user-supplied patterns run
|
|
synchronously on the command string before each call.
|
|
|
|
## SSRF + Algorithms
|
|
|
|
### SSRF (host resolution)
|
|
|
|
Every connection target goes through `ssrfStrict(host, allowPrivate)`:
|
|
|
|
1. DNS resolve host → list of A/AAAA records
|
|
2. For each address, check against the IP-policy:
|
|
- Reject 0.0.0.0, ::/0
|
|
- Reject 127.0.0.0/8, ::1 (loopback)
|
|
- Reject 10/8, 172.16/12, 192.168/16, fc00::/7 (private)
|
|
- Reject 169.254/16 (link-local — including AWS metadata)
|
|
3. **DNS pinning**: the resolved address is captured before connect;
|
|
ssh2 connects to the pinned IP, not to the hostname. This prevents
|
|
DNS rebinding (round 1: public IP passes check; round 2: returns
|
|
loopback during connect).
|
|
|
|
`allowPrivate` short-circuits step 2. Two opt-in flags compose:
|
|
|
|
- Global: `ssh.allow_private_addresses: true` in config.yaml
|
|
- Per-connection: `allow_private_addresses=1` on the row (admin sets via
|
|
`/api/ssh/admin/globals` or `/api/ssh/admin/connections/:id`)
|
|
|
|
Either being true allows private/loopback. Both default false.
|
|
|
|
### Algorithm allowlist
|
|
|
|
Hard-coded in `src/ssh/session.ts`:
|
|
|
|
| Category | Allowed |
|
|
|---|---|
|
|
| Key exchange | `curve25519-sha256`, `curve25519-sha256@libssh.org`, `ecdh-sha2-nistp256/384/521`, `diffie-hellman-group14/16/18-sha256/512` |
|
|
| Server host key | `ssh-ed25519`, `rsa-sha2-256`, `rsa-sha2-512`, `ecdsa-sha2-nistp256/384/521` |
|
|
| Cipher | `aes256-gcm@openssh.com`, `aes128-gcm@openssh.com`, `aes256-ctr`, `aes192-ctr`, `aes128-ctr` |
|
|
| HMAC | `hmac-sha2-512-etm@openssh.com`, `hmac-sha2-256-etm@openssh.com`, `hmac-sha2-512`, `hmac-sha2-256` |
|
|
|
|
Notably banned: `ssh-rsa` (SHA1), `ssh-dss`, all `arcfour*`, `hmac-md5*`,
|
|
`hmac-sha1*`. Mismatch returns `host_key_alg_not_allowed` or
|
|
`auth_failed` depending on which stage caught it.
|
|
|
|
## Audit Log
|
|
|
|
Single table: `ssh_audit_log`. Every SSH operation writes here.
|
|
|
|
### Lifecycle
|
|
|
|
```
|
|
begin (outcome=pending) [commits before remote call]
|
|
↓
|
|
remote call (ssh2 connect / exec / sftp)
|
|
↓
|
|
complete (outcome=success|failed|denied|aborted) [updates same row]
|
|
```
|
|
|
|
If the orchestrator crashes between `begin` and `complete`, the row
|
|
stays `pending`. On next boot, the recovery sweep (`src/ssh/recovery.ts`)
|
|
updates pending rows older than 10 minutes to `aborted` with
|
|
`detail.recovered_at` set.
|
|
|
|
### Actions
|
|
|
|
| Action | Triggered by |
|
|
|---|---|
|
|
| `ssh.exec` | `SshExec` |
|
|
| `ssh.upload` | `SshUpload` |
|
|
| `ssh.download` | `SshDownload` |
|
|
| `ssh.connection.upsert` | User/admin connection create/edit |
|
|
| `ssh.connection.delete` | User/admin connection delete |
|
|
| `ssh.connection.host_key.first_observe` | TOFU first observation |
|
|
| `ssh.connection.host_key.mismatch` | TOFU mismatch |
|
|
| `ssh.connection.host_key.tofu_record` | Internal helper write |
|
|
| `ssh.connection.host_key.verify` | User `/verify-host-key` |
|
|
| `ssh.connection.host_key.replace` | User `/replace-host-key` |
|
|
| `ssh.connection.disable` | Admin disable |
|
|
| `ssh.connection.enable` | Admin enable |
|
|
| `ssh.abuse.unlock_manual` | Admin force-unlock |
|
|
| `ssh.grant.create` | Admin grant create |
|
|
| `ssh.grant.delete` | Admin grant delete |
|
|
| `ssh.master_key.rotate.start` | Admin rotation start |
|
|
|
|
### Detail column
|
|
|
|
JSON blob with action-specific fields:
|
|
|
|
- `ssh.exec`: `{command_hash: "abc...", exit_code: 0, stdout_bytes: 123, stderr_bytes: 0, truncated: false}`
|
|
- `ssh.upload`: `{local_path: "...", remote_path: "...", bytes: 4096}`
|
|
- `ssh.download`: same shape
|
|
- `ssh.connection.host_key.first_observe`: `{fingerprint: "SHA256:...", pending_token: "uuid"}`
|
|
- All denied: `{error: "no_grant" | "abuse_locked" | "disabled" | ...}`
|
|
|
|
The `ssh.exec` action does **not** record the command string — only its
|
|
SHA-256 hash (16-char hex prefix) to avoid leaking secrets / PII. If you
|
|
need to investigate a specific exec, correlate the hash with stdin
|
|
logs from the LLM activity log (workspace `logs/activity.log`).
|
|
|
|
### Retention
|
|
|
|
`ssh.audit_retention_days` (default 90) controls a lazy sweep. Admin can
|
|
trigger pruning manually from the UI. There is no hard cap on table
|
|
size — disk-fill is mitigated by the hashing + truncation strategy
|
|
above, plus admin-driven cleanup.
|
|
|
|
## Abuse Counters & Lock
|
|
|
|
Defends against credential spraying, mistyped scripts in loops, and
|
|
brute-force scans.
|
|
|
|
### Three scopes
|
|
|
|
| Scope | Key | When |
|
|
|---|---|---|
|
|
| `user` | `(user_id,)` | Any SSH failure by this user |
|
|
| `host:user` | `(host, username)` | Failure on this (host, username) tuple |
|
|
| `host` | `(host,)` | Failure on this host (global connections only) |
|
|
|
|
The `host` scope intentionally **only counts failures on global
|
|
connections** to prevent cross-user DoS: a user repeatedly failing on
|
|
their own connection cannot lock out other users from a shared host.
|
|
For user-owned connections, the `host` counter is updated for
|
|
admin-notification only — no lock applies.
|
|
|
|
### Algorithm
|
|
|
|
```
|
|
on failure:
|
|
for each scope:
|
|
increment counter
|
|
if count(within abuseWindowMinutes) >= abuseFailureThreshold:
|
|
lock until now + abuseLockMinutes
|
|
|
|
on success (user scope only):
|
|
reset user counter
|
|
other scopes age out naturally with the window
|
|
```
|
|
|
|
Counters are stored in `ssh_abuse_counters`, with separate columns per
|
|
scope. All updates are transactional (no UPSERT race).
|
|
|
|
### Force-unlock
|
|
|
|
Admin can force-unlock from `SshGlobalConnectionsForm` or per-connection
|
|
admin page:
|
|
|
|
```
|
|
POST /api/ssh/admin/connections/:id/force-unlock
|
|
{reason: "Confirmed credentials rotated; user retried with old key"}
|
|
```
|
|
|
|
Rate-limited to 10/hour total across all admins (`admin-rate-limit.ts`,
|
|
token bucket). Audited as `ssh.abuse.unlock_manual`.
|
|
|
|
## Master Key Rotation
|
|
|
|
Replaces `MCP_ENCRYPTION_KEY` and rewraps every row's DEK under the new
|
|
key. This is **the** way to rotate the master key — do not edit the env
|
|
var manually.
|
|
|
|
### Flow
|
|
|
|
1. **Admin starts** via `POST /api/ssh/admin/rotate-master-key`:
|
|
```json
|
|
{"new_key_hex": "<64-hex>", "reason": "Annual rotation"}
|
|
```
|
|
2. **Maintenance mode engages** — `sshMaintenance.enter()` returns 503
|
|
for all SSH write endpoints (read endpoints stay alive). The LLM
|
|
sees `SSH subsystem is in maintenance` errors for tool calls.
|
|
3. **Per-row rewrap**:
|
|
- For each `ssh_connections` row: decrypt DEK under old key, re-encrypt under new key, bump `key_version`, commit (one tx per row)
|
|
- For each `system_deks` row: same
|
|
4. **New key validated** by decrypting a test value
|
|
5. **Maintenance exits** automatically
|
|
6. **Caller polls** `GET /api/ssh/admin/rotate-master-key/:jobId` for status (`running` / `succeeded` / `failed`)
|
|
|
|
### Failure modes
|
|
|
|
- **Crash mid-rotation**: rows have mixed `key_version`. Next boot detects this and stays in maintenance until a follow-up rotation completes. The admin must re-issue the rotation with the new key.
|
|
- **Wrong old key**: the first row decryption fails → job aborts before any change, maintenance exits, audit records `ssh.master_key.rotate.start` with `outcome=failed`.
|
|
- **Disk write fails mid-row**: that single row is rolled back; rotation continues. Operator must re-run.
|
|
|
|
The rotation job runs in-process (not as a separate worker). For large
|
|
fleets (>1000 rows) expect 1-2s per row of decrypt+encrypt+write.
|
|
|
|
### `MCP_ENCRYPTION_KEY` env var
|
|
|
|
After successful rotation, **the env var must be updated to the new
|
|
key** before the next restart. The orchestrator writes the new key to
|
|
the audit log (encrypted under the OLD key) and returns it once in the
|
|
HTTP response — there's no second chance. Update your secrets store
|
|
immediately.
|
|
|
|
## Operator Runbook
|
|
|
|
### A. Add a global (admin-managed) connection
|
|
|
|
```bash
|
|
# Via UI: Settings → SSH → Global Connections → Add
|
|
# Or via API (requires admin session cookie):
|
|
|
|
curl -X POST http://localhost:3000/api/ssh/admin/globals \
|
|
-H 'Content-Type: application/json' \
|
|
-d @- <<'JSON'
|
|
{
|
|
"label": "prod-east-bastion",
|
|
"host": "bastion.prod-east.example.com",
|
|
"port": 22,
|
|
"username": "deploy",
|
|
"private_key_pem": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n-----END...",
|
|
"passphrase": null,
|
|
"remote_path_prefix": "/srv/deploy",
|
|
"allow_private_addresses": false,
|
|
"deny_patterns": "sudo\n^\\s*rm\\s+",
|
|
"allow_patterns": "",
|
|
"reason": "Production deploy bastion — owned by SRE"
|
|
}
|
|
JSON
|
|
```
|
|
|
|
Then verify the host key (next section) and grant access.
|
|
|
|
### B. Grant org access to a global connection
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/ssh/admin/grants \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"connection_id": "<uuid>",
|
|
"subject_type": "org",
|
|
"subject_id": "engineering",
|
|
"piece_name": "prod-deploy",
|
|
"applies_to_all_pieces": false,
|
|
"reason": "Engineering org runs prod-deploy piece"
|
|
}'
|
|
```
|
|
|
|
### C. Verify a TOFU first-observe
|
|
|
|
1. From the user's side or admin side, click **Test** in the SshConnections panel
|
|
2. The response is `host_key_first_observe` with a SHA-256 fingerprint and pending token
|
|
3. **Verify externally** that the fingerprint matches the real server:
|
|
```bash
|
|
ssh-keyscan -t ed25519 bastion.prod-east.example.com 2>/dev/null \
|
|
| ssh-keygen -lf -
|
|
```
|
|
Compare the resulting `SHA256:...` with what the UI shows
|
|
4. In the dialog, type the fingerprint to confirm and click **Verify**
|
|
5. Audit row `ssh.connection.host_key.verify` recorded; subsequent calls succeed
|
|
|
|
### D. Force-unlock a stuck connection
|
|
|
|
Symptom: user reports "SshExec returns `access denied (abuse_locked)`"
|
|
|
|
1. Settings → SSH → Global Connections → click the row → "Locks" section
|
|
2. Inspect the counter state (which scope is locked, until when)
|
|
3. If genuinely needs early unlock (e.g. user fixed the bad credentials), click **Force unlock**, enter reason
|
|
4. If suspicious (unexplained 5+ failures), investigate audit log first
|
|
|
|
### E. Rotate the master key
|
|
|
|
```bash
|
|
NEW_KEY=$(openssl rand -hex 32)
|
|
|
|
# Start rotation
|
|
JOB=$(curl -s -X POST http://localhost:3000/api/ssh/admin/rotate-master-key \
|
|
-H 'Content-Type: application/json' \
|
|
-d "{\"new_key_hex\":\"$NEW_KEY\",\"reason\":\"Q2 annual rotation\"}" \
|
|
| jq -r .job_id)
|
|
|
|
# Poll until done
|
|
while true; do
|
|
STATUS=$(curl -s http://localhost:3000/api/ssh/admin/rotate-master-key/$JOB \
|
|
| jq -r .status)
|
|
echo "$STATUS"
|
|
[ "$STATUS" = "succeeded" ] && break
|
|
[ "$STATUS" = "failed" ] && { echo FAILED; exit 1; }
|
|
sleep 2
|
|
done
|
|
|
|
# Update env var BEFORE next restart
|
|
echo "MCP_ENCRYPTION_KEY=$NEW_KEY" >> /etc/orchestrator/secrets.env
|
|
```
|
|
|
|
### F. Prune old audit logs
|
|
|
|
Settings → SSH → Audit Log → "Prune older than N days" (defaults to the
|
|
config retention value). Or via API:
|
|
|
|
```bash
|
|
curl -X DELETE 'http://localhost:3000/api/ssh/admin/audit?older_than_days=90'
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Symptom → cause table
|
|
|
|
| Error | Common cause | Fix |
|
|
|---|---|---|
|
|
| `SSH is disabled` (503) | `ssh.enabled: false` | Set true, restart not required |
|
|
| `SSH subsystem is in maintenance` | Master key rotation in progress | Wait for job to complete, or check rotation log |
|
|
| `access denied (no_grant)` | User lacks grant for connection | Admin creates a grant, or user uses an owned connection |
|
|
| `access denied (disabled)` | Admin disabled the connection | Admin re-enables, or use different connection |
|
|
| `access denied (abuse_locked)` | Counter triggered | Wait for lock window, or admin force-unlocks |
|
|
| `piece "X" does not list connection Y` | `allowed_ssh_connections` missing UUID | Add UUID to the movement's `allowed_ssh_connections` |
|
|
| `host_key_first_observe` | First time exercising connection | Verify fingerprint in UI |
|
|
| `host_key_not_verified` | Key recorded but never verified | Click Verify in UI |
|
|
| `host_key_mismatch` | Server key changed | Investigate (legitimate rotation? MITM?), then Replace via UI |
|
|
| `host_key_alg_not_allowed` | Server using SHA1-RSA etc. | Upgrade server to ed25519 / rsa-sha2-256 |
|
|
| `auth_failed` | Wrong key, wrong username | Re-check connection settings |
|
|
| `connect_timeout` | Network unreachable, firewall | Check from host, check SSRF policy |
|
|
| `exec_timeout` | Long-running command | Increase `timeout_ms`, or run in background and `Download` results |
|
|
| `output_too_large` | stdout > 32 KiB | Filter the command, or write to file and `Download` |
|
|
| `forbidden_address` | Target is private IP, no opt-in | Set `allow_private_addresses` per-connection or globally |
|
|
| `system_dek_verify_failed` (log) | `MCP_ENCRYPTION_KEY` changed without rotation flow | Stop server, restore old key OR re-rotate via flow |
|
|
|
|
### Where to look
|
|
|
|
| Question | Source |
|
|
|---|---|
|
|
| What did the LLM try to do? | `logs/activity.log` in the job's workspace |
|
|
| What did SSH do? | `ssh_audit_log` (Admin UI or SQL) |
|
|
| Was it actually denied at the SSH layer? | Audit row `outcome` = `denied` |
|
|
| What was the exit code? | Audit `detail.exit_code` (for `ssh.exec`) |
|
|
| Did it crash mid-call? | Audit `outcome` = `aborted` (recovery sweep) |
|
|
| Why was the host key flagged? | Audit `ssh.connection.host_key.*` rows |
|
|
| Who has access to a connection? | `ssh_connection_grants` filtered by `connection_id` |
|
|
|
|
## Security Model Summary
|
|
|
|
Detailed threat model + risk register: see plan doc §"Security Design
|
|
Deep-Dive (rev 3)" and §"Risk Register (rev 3)".
|
|
|
|
Key points operators must understand:
|
|
|
|
1. **The orchestrator is a credential proxy.** Anyone with admin rights
|
|
can read connection plaintext (via the rotation flow, which decrypts
|
|
server-side). Treat admin access as production-credential-equivalent.
|
|
|
|
2. **TOFU is the floor, not the ceiling.** First-observe is unauthenticated.
|
|
For high-stakes targets, pre-populate `host_key_b64` from a trusted
|
|
bootstrap (e.g. baked into the connection at create time via the
|
|
`host_key_b64` field) rather than relying on the orchestrator's first
|
|
observation.
|
|
|
|
3. **The deny-list is not a sandbox.** Built-in patterns catch obvious
|
|
misuse. Real isolation requires connection-level configuration
|
|
(restricted shell account, `remote_path_prefix`, narrow `allow_patterns`)
|
|
and target-side controls.
|
|
|
|
4. **Audit log is local-only.** No HMAC chain (acknowledged limitation
|
|
R-audit-tamper). For tamper-evidence, ship `ssh_audit_log` rows to an
|
|
external SIEM via SQLite hooks or periodic export.
|
|
|
|
5. **ssh2 internal key retention** (R-ssh2-leak): the PEM lives in JS
|
|
heap for the connection lifetime. Process compromise reveals plaintext
|
|
credentials. Mitigations: short-lived processes, separate worker
|
|
per high-stakes connection.
|
|
|
|
6. **Master key compromise = total compromise.** Key rotation invalidates
|
|
already-leaked encrypted material — if an attacker has both the DB
|
|
and the old master key, all stored creds are theirs. Rotate keys
|
|
immediately on suspected compromise AND rotate every stored credential
|
|
on the target side.
|
|
|
|
## HTTP API Reference
|
|
|
|
User router: mounted at `/api/ssh` — requires `requireAuth`.
|
|
|
|
| Method | Path | Purpose |
|
|
|---|---|---|
|
|
| GET | `/connections` | List own connections + grant-visible globals |
|
|
| POST | `/connections` | Create user-owned connection |
|
|
| GET | `/connections/:id` | Read |
|
|
| PATCH | `/connections/:id` | Edit (owner only) |
|
|
| DELETE | `/connections/:id` | Delete (owner only) |
|
|
| POST | `/connections/:id/test` | Trigger TOFU observation / verify path |
|
|
| POST | `/connections/:id/verify-host-key` | Atomic verify (token + fingerprint) |
|
|
| POST | `/connections/:id/replace-host-key` | Atomic replace (token + fingerprint + reason) |
|
|
| GET | `/connections/:id/audit` | Owner's view of audit rows for this connection |
|
|
| GET | `/grants/visible-to-me` | List grants visible (subject=user or matching org) |
|
|
|
|
Admin router: mounted at `/api/ssh/admin` — requires `requireAdmin`.
|
|
|
|
| Method | Path | Purpose |
|
|
|---|---|---|
|
|
| GET | `/connections` | All connections (cross-tenant) |
|
|
| GET | `/connections/:id` | Admin read |
|
|
| PATCH | `/connections/:id/disable` | Soft-disable (audited; reason required) |
|
|
| PATCH | `/connections/:id/enable` | Re-enable |
|
|
| DELETE | `/connections/:id` | Hard-delete |
|
|
| POST | `/connections/:id/force-unlock` | Clear abuse counter (rate-limited; reason required) |
|
|
| POST | `/globals` | Create global connection |
|
|
| PATCH | `/globals/:id` | Edit global |
|
|
| DELETE | `/globals/:id` | Delete global |
|
|
| GET | `/grants` | List all grants |
|
|
| POST | `/grants` | Create grant |
|
|
| DELETE | `/grants/:id` | Delete grant |
|
|
| POST | `/rotate-master-key` | Start master key rotation |
|
|
| GET | `/rotate-master-key/:jobId` | Poll rotation status |
|
|
| GET | `/audit` | All-tenant audit view (paginated) |
|
|
|
|
All admin write endpoints require:
|
|
- `requireAdmin` middleware
|
|
- `maintenance503()` guard (rejects writes during rotation)
|
|
- `validateReason()` on `body.reason` (≥ 8 chars)
|
|
- `auditRepo.beginAndComplete()` for success/failure both
|
|
|
|
## SSH Console (Interactive)
|
|
|
|
`ssh.console.enabled: true` で有効化。
|
|
|
|
- 1 タスク = 1 PTY セッション。job をまたいで shell state を維持
|
|
- Tab `SSH` がタスク詳細に出る (piece が SshConsole* を allow している場合)
|
|
- WebSocket: `/api/local/tasks/:taskId/console/ws`
|
|
- REST status: `GET /api/local/tasks/:taskId/console/status`
|
|
- 監査: `ssh.console.{open,send,snapshot,resize,input_rejected,close}`
|
|
- 自動 close: idle 30min / duration 4h / host disconnect / maintenance / admin kill
|
|
- 同 connection あたり最大 3 sessions (古い順に evict)
|
|
|
|
Admin: `GET /api/admin/ssh/console-sessions` で一覧、 `POST /api/admin/ssh/console-sessions/:taskId/kill` で kill (admin role only)。
|
|
|
|
## See Also
|
|
|
|
- [docs/tools/ssh-tools.md](./tools/ssh-tools.md) — LLM-facing tool semantics
|
|
- [docs/tools/ssh-console-tools.md](./tools/ssh-console-tools.md) — SSH Console tool semantics (Ensure/Send/Snapshot)
|
|
- [docs/mcp.md](./mcp.md) — MCP integration (shares `MCP_ENCRYPTION_KEY`)
|
|
- [docs/maintenance-checklist.md](./maintenance-checklist.md) §12 — checklist for SSH-related code changes
|