maestro/docs/bench.md

# ベンチマーク (`npm run bench`)

エージェントの **ツールコール能力 / 命令追従性 / 頭の良さ / チェックリスト使用 / 効率** を、1 つの統合タスクから多軸で計測するためのフレームワーク。

モデル変更・piece 改修・ツール追加などの前後で同じタスクを走らせ、品質回帰を検出する用途を想定している。

---

## 1. 前提

| 項目 | 必須 |
|------|------|
| `scripts/server.sh start`（またはそれ相当）でオーケストレータが起動していること | ✅ |
| `config.yaml` の `provider` が動作する LLM を指している（タスク実行 + judge の両方で使う） | ✅ |
| 初回のみ `npm run bench:fixtures` で `bench/fixtures/sales.xlsx` を生成 | ✅ |

ベンチランナーは外部ネットに依存しない。`fixtures/web/*` はランナー内蔵の **localhost HTTP サーバ** が配信する（起動時にランダムポートを取り、`{WEB_PORT}` トークンで prompt に注入）。

---

## 2. 使い方

```bash
# 全タスク
npm run bench

# 単一タスク
npm run bench -- --task=composite-mini-report

# 別ホスト/ポートのオーケストレータ向け
npm run bench -- --server=http://127.0.0.1:9876

# LLM judge を skip (axis D を 1.0 固定にして programmatic だけで採点)
BENCH_JUDGE=off npm run bench

# judge を別エンドポイント・別モデルにする
BENCH_JUDGE_ENDPOINT=https://api.example.com/v1 \
BENCH_JUDGE_MODEL=gpt-oss:20b \
npm run bench
```

実行が終わると `bench/results/<run_id>/` に書き出される（`run_id` は ISO タイムスタンプ）。

---

## 3. 出力の見方

```
bench/results/2026-05-01T03-22-11Z/
  summary.md                        # ← ここを最初に見る
  composite-mini-report/
    result.json                     # 完全な採点 + raw データ
    workspace/
      logs/activity.log             # エージェントが書いたログ
      output/report.md              # エージェントの成果物
```

`summary.md` の冒頭：

```
# Bench run @ 2026-05-01T03:22:11.000Z

**Overall: 73 / 100**

| Task                   | Status     | Total |   A |   B |   C |   D |
| ---------------------- | ---------- | ----: | --: | --: | --: | --: |
| composite-mini-report  | succeeded  |   73  | 90% | 100%| 70% | 60% |
```

各タスクの詳細セクションでは axis ごとに ✓/✗ 内訳とツールコールの全シーケンスが折りたたみで見られる。

`result.json` は CI から機械可読に扱える形式。

---

## 4. 採点軸（重み 100 点満点）

| 軸 | 重み | 何を見るか | 判定 |
|----|----:|-----------|------|
| **A. Tools** | 30 | `must_use_tools` を呼んだか / `forbidden_tools` を避けたか / `forbidden_tool_for_ext` (例: Read 禁止 .xlsx) | プログラム |
| **B. Checklist** | 15 | `CreateChecklist` / `CheckItem×N` / `GetChecklist` の使用 | プログラム |
| **C. Instructions** | 30 | 出力ファイル名・1 行目固定・セクション順・行数・文字数・禁止パターンなど | プログラム |
| **D. Reasoning** | 25 | 内容の妥当性・統合の質・「次アクション」の具体性など | LLM judge ルーブリック |
| _補助: Efficiency_ | – | duration / prompt tokens（summary に数値表示のみ） | – |

`Total = A×30 + B×15 + C×30 + D×25` を 0..100 に正規化。

`completion_status: [succeeded, waiting_human, failed, aborted, cancelled]` で受理する終了状態を指定。デフォルトは `[succeeded]` のみ。**failure でも grader は走り部分スコアが出る**。

---

## 5. 既存タスク

### `composite-mini-report`

3 ソース（Excel / Web / Markdown）を統合してミニレポートを書かせるタスク。1 本で全軸が動く。

- 必須ツール: `ReadExcel` / `WebFetch` / `Read` / `Write` / `CreateChecklist` / `CheckItem` (≥3 回) / `GetChecklist`
- 禁止: `.xlsx` を `Read` で開く（バイナリ混入防止 — issue #189 と同じ罠）
- 出力 `output/report.md` に厳格な形式制約（1 行目固定、セクション順、各セクション 5 行以内、「次アクション」3 件 40 字以内、画像・HTML 禁止）
- judge ルーブリック: `factual_grounding` / `actions_quality` / `synthesis`

`bench/tasks/composite-mini-report.yaml` を参考実装としてそのまま使える。

---

## 6. タスクを追加する

`bench/tasks/*.yaml` を作るだけで自動的に拾われる。スキーマは `src/bench/types.ts` の `BenchTask` を参照。最小例：

```yaml
id: my-task
title: 短いタスク説明
piece_hint: chat              # piece 名 (省略時は chat)
timeout_minutes: 5

fixtures:                      # 任意
  - source: fixtures/data.txt  # bench/ ルート相対
    dest: input/data.txt       # input/ に置けば attachments としてアップロード
  - source: fixtures/web/page.html
    dest: web/page.html        # web/ に置けば fixture HTTP server が配信

prompt_tokens:                 # 任意。prompt 内の {KEY} を実行時に置換
  CUSTOM_KEY: foo
prompt: |
  http://127.0.0.1:{WEB_PORT}/page.html を読み、… {CUSTOM_KEY} …

expected:
  must_use_tools: [WebFetch, Write]
  forbidden_tools: [Bash]
  forbidden_tool_for_ext:
    Read: ['.xlsx']
  must_produce_files: [output/answer.md]
  completion_status: [succeeded]

checklist:                     # 任意。指定すると軸 B が有効化される
  required_tools: [CreateChecklist, CheckItem, GetChecklist]
  min_check_item_calls: 3

grading:
  programmatic:
    constraints:
      - { type: file_first_line_equals, file: output/answer.md, line: '# Title' }
      - { type: file_must_contain_in_order, file: output/answer.md, sections: ['## A', '## B'] }
      - { type: file_section_max_lines, file: output/answer.md, section: A, max: 5 }
      - { type: file_line_starts_with, file: output/answer.md, prefix: '-', min_lines: 3, section: B }
      - { type: file_line_max_chars, file: output/answer.md, max: 40, section: B }
      - { type: file_no_pattern, file: output/answer.md, pattern: '!\[' }

  llm_judge:                   # 任意。指定しないと軸 D は 1.0 固定
    rubrics:
      - name: relevance
        prompt: 出力が prompt の意図と整合しているか
        max_score: 10
```

### プログラム制約の種類

| `type` | 意味 |
|--------|------|
| `file_first_line_equals` | ファイル 1 行目が完全一致するか |
| `file_must_contain_in_order` | 指定文字列が指定の順序で出現するか |
| `file_line_starts_with` | (任意セクション内で) 指定 prefix で始まる行が `min_lines` 以上あるか |
| `file_line_max_chars` | (任意セクション内で) 各行の文字数が `max` 以下か |
| `file_section_max_lines` | 指定セクションの非空行が `max` 以下か |
| `file_no_pattern` | 正規表現 (multiline) にマッチしないか |

`section` は `## ヘッダ` の `ヘッダ` 部分（`##` は付けない）。指定無しならファイル全体が対象。

---

## 7. 内部構造

```
bench/
  fixtures/
  tasks/
  results/                                 # gitignored
src/bench/
  types.ts          # BenchTask / BenchResult / 制約スキーマ
  fixture-server.ts # localhost HTTP fixture server
  runner.ts         # /api/local/tasks に投入 + ポーリング + ログ収集
  grader.ts         # 軸 A/B/C のプログラム採点
  judge.ts          # 軸 D の LLM judge 呼び出し + JSON parse
  summary.ts        # bench/results/<run_id>/summary.md 書き出し
  grader.test.ts
scripts/
  bench-run.ts             # CLI エントリ (`npm run bench`)
  build-bench-fixtures.ts  # sales.xlsx 生成 (`npm run bench:fixtures`)
```

ベンチランナーは既存の `/api/local/tasks` API を使うだけで、orchestrator 内部とは疎結合になっている。新しい piece やツールを追加してもベンチ側は基本変更不要。

---

## 8. トラブルシューティング

| 症状 | 原因 / 対処 |
|------|-------------|
| `runner failed for ...: fetch failed` | `scripts/server.sh start` が立っていない、または `--server=` で指定したポートが違う |
| 全タスクで axis D が 0.0 | judge LLM のレスポンスが JSON parse 失敗。`bench/results/<run>/summary.md` の reasoning details を確認。OS の事情で短い応答しか返ってこないモデルなら `BENCH_JUDGE_MODEL` を別モデルに切り替える |
| タイムアウトで status が固まる | タスク YAML の `timeout_minutes` を伸ばす。プロバイダ側 `provider.timeoutMinutes` も併せて確認 |
| 同じタスクで毎回スコアが揺れる | LLM judge が確率的なため。programmatic 軸だけ見る・複数回平均を取る運用が無難 |
| `bench/results/` がコミットに乗ってしまった | `.gitignore` 済みだが、過去に追跡されていた場合は `git rm -r --cached bench/results` |

---

## 9. 参考: 関連 issue / 機能

- #156 — このベンチマーク自体
- #189 — `Read` で xlsx を開かない仕様（composite-mini-report の `forbidden_tool_for_ext` 罠と直結）
- #190 — preflight ログ表示の整理（activity.log を grader が読みやすいことの恩恵）