LLM 評価エージェント

An LLM Evaluation Agent that you can describe to it what you want to evaluate in natural language — the expert AI agent handles dataset generation, judge configuration, execution, and analysis end-to-end, and hands you back a PDF report. Features Expert agent interface — The agent knows evaluation best practices, recommends criteria and validates configurations before execution. No config files or CLI expertise needed. Jury system — Multiple judges from different model families (e.g. Claude Sonnet, Nova Pro, Nemotron) each evaluate distinct aspects of every response — correctness, reasoning, completeness. Combining diverse judge families reduces self-preference bias, and aggregating weak signals from diverse judges and criteria produces stronger results than any single judge (Verma et al., 2025, Frick et al., 2025). Adaptable binary scoring — Binary pass/fail per criteria rather than subjective numeric scales, shown to produce more reliable results across judges (Chiang et al., 2025). Criteria are tailored by the agent to what you're evaluating. Document-grounded synthetic data — Upload PDFs, knowledge bases, or product docs and generate QA pairs grounded in your actual content, reflecting real customer scenarios. Agentic eval support — Evaluate any agent calling Bedrock (Strands, LangChain, custom boto3) with zero code modification via OpenTelemetry instrumentation.

インストール手順

前提条件

Bedrock モデルアクセスを持つ AWS 認証情報
uv installed
Claude Code、Cursor、Kiro、VS Code、または MCP 互換 IDE

インストール

IDE を選択して貼り付け / クリックしてください。

Claude Code – 1 つの CLI コマンド:

claude mcp add eval -s user -- uvx --from llm-evaluation-system eval-mcp

Cursor – ワンクリックディープリンク: Cursor に eval-mcp をインストール

Kiro – ~/.kiro/settings/mcp.json に追加:

{ "mcpServers": { "eval": { "command": "uvx", "args": ["--from", "llm-evaluation-system", "eval-mcp"] } } }

Codex CLI – ~/.codex/config.toml に追加し、Codex を再起動:

[mcp_servers.eval] command = "uvx"args = ["--from", "llm-evaluation-system", "eval-mcp"]

VS Code (GitHub Copilot MCP を使用) – 1 つの CLI コマンド:

code --add-mcp '{"name":"eval","command":"uvx","args":["--from","llm-evaluation-system","eval-mcp"]}'

コーディングエージェントを使用してインストールする場合は、INSTALL.md をポイントしてください – 設定の編集と、オプションの S3 チーム共有に関する確認を処理します。

アップグレード中

uvx はパッケージごとに解決済みのバージョンをキャッシュします。新しいリリースをプルするには、キャッシュを無効化してください:

uv cache clean llm-evaluation-system

その後、IDE を再起動します。次回の起動時に、最新の公開バージョンが解決され、キャッシュされます。

使用

提供されたデータセット、またはドキュメントやコンテキストから生成されたデータセットを使用して、エージェント、モデル、プロンプトを評価するよう、AI アシスタントに指示します:

「Evaluate my agent at ./my_agent.py」
「Compare Claude Sonnet vs Nova Pro on this dataset」
「Test these three prompt templates against my golden QA set」
「Generate a dataset from this PDF and run an eval」

エージェントは適切なモードを選択し、不足している情報 (データセット、評価者、基準) を自動生成して実行し、ブラウザで結果ビューアを開き、PDF レポートを提供します。