swallow/maestro

Fork 0

oss-sync f5c7666f6b feat: initial public release (MAESTRO)

2026-06-03 05:08:00 +00:00

9.1 KiB

Raw Blame History

ベンチマーク (`npm run bench`)

エージェントの ツールコール能力 / 命令追従性 / 頭の良さ / チェックリスト使用 / 効率 を、1 つの統合タスクから多軸で計測するためのフレームワーク。

モデル変更・piece 改修・ツール追加などの前後で同じタスクを走らせ、品質回帰を検出する用途を想定している。

1. 前提

項目	必須
`scripts/server.sh start`（またはそれ相当）でオーケストレータが起動していること	✅
`config.yaml` の `provider` が動作する LLM を指している（タスク実行 + judge の両方で使う）	✅
初回のみ `npm run bench:fixtures` で `bench/fixtures/sales.xlsx` を生成	✅

ベンチランナーは外部ネットに依存しない。fixtures/web/* はランナー内蔵の localhost HTTP サーバ が配信する（起動時にランダムポートを取り、{WEB_PORT} トークンで prompt に注入）。

2. 使い方

# 全タスク
npm run bench

# 単一タスク
npm run bench -- --task=composite-mini-report

# 別ホスト/ポートのオーケストレータ向け
npm run bench -- --server=http://127.0.0.1:9876

# LLM judge を skip (axis D を 1.0 固定にして programmatic だけで採点)
BENCH_JUDGE=off npm run bench

# judge を別エンドポイント・別モデルにする
BENCH_JUDGE_ENDPOINT=https://api.example.com/v1 \
BENCH_JUDGE_MODEL=gpt-oss:20b \
npm run bench

実行が終わると bench/results/<run_id>/ に書き出される（run_id は ISO タイムスタンプ）。

3. 出力の見方

bench/results/2026-05-01T03-22-11Z/
  summary.md                        # ← ここを最初に見る
  composite-mini-report/
    result.json                     # 完全な採点 + raw データ
    workspace/
      logs/activity.log             # エージェントが書いたログ
      output/report.md              # エージェントの成果物

summary.md の冒頭：

# Bench run @ 2026-05-01T03:22:11.000Z

**Overall: 73 / 100**

| Task                   | Status     | Total |   A |   B |   C |   D |
| ---------------------- | ---------- | ----: | --: | --: | --: | --: |
| composite-mini-report  | succeeded  |   73  | 90% | 100%| 70% | 60% |

各タスクの詳細セクションでは axis ごとに ✓/✗ 内訳とツールコールの全シーケンスが折りたたみで見られる。

result.json は CI から機械可読に扱える形式。

4. 採点軸（重み 100 点満点）

軸	重み	何を見るか	判定
A. Tools	30	`must_use_tools` を呼んだか / `forbidden_tools` を避けたか / `forbidden_tool_for_ext` (例: Read 禁止 .xlsx)	プログラム
B. Checklist	15	`CreateChecklist` / `CheckItem×N` / `GetChecklist` の使用	プログラム
C. Instructions	30	出力ファイル名・1 行目固定・セクション順・行数・文字数・禁止パターンなど	プログラム
D. Reasoning	25	内容の妥当性・統合の質・「次アクション」の具体性など	LLM judge ルーブリック
補助: Efficiency	–	duration / prompt tokens（summary に数値表示のみ）	–

Total = A×30 + B×15 + C×30 + D×25 を 0..100 に正規化。

completion_status: [succeeded, waiting_human, failed, aborted, cancelled] で受理する終了状態を指定。デフォルトは [succeeded] のみ。failure でも grader は走り部分スコアが出る。

5. 既存タスク

`composite-mini-report`

3 ソース（Excel / Web / Markdown）を統合してミニレポートを書かせるタスク。1 本で全軸が動く。

必須ツール: ReadExcel / WebFetch / Read / Write / CreateChecklist / CheckItem (≥3 回) / GetChecklist
禁止: .xlsx を Read で開く（バイナリ混入防止 — issue #189 と同じ罠）
出力 output/report.md に厳格な形式制約（1 行目固定、セクション順、各セクション 5 行以内、「次アクション」3 件 40 字以内、画像・HTML 禁止）
judge ルーブリック: factual_grounding / actions_quality / synthesis

bench/tasks/composite-mini-report.yaml を参考実装としてそのまま使える。

6. タスクを追加する

bench/tasks/*.yaml を作るだけで自動的に拾われる。スキーマは src/bench/types.ts の BenchTask を参照。最小例：

id: my-task
title: 短いタスク説明
piece_hint: chat              # piece 名 (省略時は chat)
timeout_minutes: 5

fixtures:                      # 任意
  - source: fixtures/data.txt  # bench/ ルート相対
    dest: input/data.txt       # input/ に置けば attachments としてアップロード
  - source: fixtures/web/page.html
    dest: web/page.html        # web/ に置けば fixture HTTP server が配信

prompt_tokens:                 # 任意。prompt 内の {KEY} を実行時に置換
  CUSTOM_KEY: foo
prompt: |
  http://127.0.0.1:{WEB_PORT}/page.html を読み、… {CUSTOM_KEY} …

expected:
  must_use_tools: [WebFetch, Write]
  forbidden_tools: [Bash]
  forbidden_tool_for_ext:
    Read: ['.xlsx']
  must_produce_files: [output/answer.md]
  completion_status: [succeeded]

checklist:                     # 任意。指定すると軸 B が有効化される
  required_tools: [CreateChecklist, CheckItem, GetChecklist]
  min_check_item_calls: 3

grading:
  programmatic:
    constraints:
      - { type: file_first_line_equals, file: output/answer.md, line: '# Title' }
      - { type: file_must_contain_in_order, file: output/answer.md, sections: ['## A', '## B'] }
      - { type: file_section_max_lines, file: output/answer.md, section: A, max: 5 }
      - { type: file_line_starts_with, file: output/answer.md, prefix: '-', min_lines: 3, section: B }
      - { type: file_line_max_chars, file: output/answer.md, max: 40, section: B }
      - { type: file_no_pattern, file: output/answer.md, pattern: '!\[' }

  llm_judge:                   # 任意。指定しないと軸 D は 1.0 固定
    rubrics:
      - name: relevance
        prompt: 出力が prompt の意図と整合しているか
        max_score: 10

プログラム制約の種類

`type`	意味
`file_first_line_equals`	ファイル 1 行目が完全一致するか
`file_must_contain_in_order`	指定文字列が指定の順序で出現するか
`file_line_starts_with`	(任意セクション内で) 指定 prefix で始まる行が `min_lines` 以上あるか
`file_line_max_chars`	(任意セクション内で) 各行の文字数が `max` 以下か
`file_section_max_lines`	指定セクションの非空行が `max` 以下か
`file_no_pattern`	正規表現 (multiline) にマッチしないか

section は ## ヘッダ の ヘッダ 部分（## は付けない）。指定無しならファイル全体が対象。

7. 内部構造

bench/
  fixtures/
  tasks/
  results/                                 # gitignored
src/bench/
  types.ts          # BenchTask / BenchResult / 制約スキーマ
  fixture-server.ts # localhost HTTP fixture server
  runner.ts         # /api/local/tasks に投入 + ポーリング + ログ収集
  grader.ts         # 軸 A/B/C のプログラム採点
  judge.ts          # 軸 D の LLM judge 呼び出し + JSON parse
  summary.ts        # bench/results/<run_id>/summary.md 書き出し
  grader.test.ts
scripts/
  bench-run.ts             # CLI エントリ (`npm run bench`)
  build-bench-fixtures.ts  # sales.xlsx 生成 (`npm run bench:fixtures`)

ベンチランナーは既存の /api/local/tasks API を使うだけで、orchestrator 内部とは疎結合になっている。新しい piece やツールを追加してもベンチ側は基本変更不要。

8. トラブルシューティング

症状	原因 / 対処
`runner failed for ...: fetch failed`	`scripts/server.sh start` が立っていない、または `--server=` で指定したポートが違う
全タスクで axis D が 0.0	judge LLM のレスポンスが JSON parse 失敗。`bench/results/<run>/summary.md` の reasoning details を確認。OS の事情で短い応答しか返ってこないモデルなら `BENCH_JUDGE_MODEL` を別モデルに切り替える
タイムアウトで status が固まる	タスク YAML の `timeout_minutes` を伸ばす。プロバイダ側 `provider.timeoutMinutes` も併せて確認
同じタスクで毎回スコアが揺れる	LLM judge が確率的なため。programmatic 軸だけ見る・複数回平均を取る運用が無難
`bench/results/` がコミットに乗ってしまった	`.gitignore` 済みだが、過去に追跡されていた場合は `git rm -r --cached bench/results`

9. 参考: 関連 issue / 機能

#156 — このベンチマーク自体
#189 — Read で xlsx を開かない仕様（composite-mini-report の forbidden_tool_for_ext 罠と直結）
#190 — preflight ログ表示の整理（activity.log を grader が読みやすいことの恩恵）

9.1 KiB Raw Blame History Unescape Escape

ベンチマーク (npm run bench)