LLM 평가 에이전트

An LLM Evaluation Agent that you can describe to it what you want to evaluate in natural language — the expert AI agent handles dataset generation, judge configuration, execution, and analysis end-to-end, and hands you back a PDF report. Features Expert agent interface — The agent knows evaluation best practices, recommends criteria and validates configurations before execution. No config files or CLI expertise needed. Jury system — Multiple judges from different model families (e.g. Claude Sonnet, Nova Pro, Nemotron) each evaluate distinct aspects of every response — correctness, reasoning, completeness. Combining diverse judge families reduces self-preference bias, and aggregating weak signals from diverse judges and criteria produces stronger results than any single judge (Verma et al., 2025, Frick et al., 2025). Adaptable binary scoring — Binary pass/fail per criteria rather than subjective numeric scales, shown to produce more reliable results across judges (Chiang et al., 2025). Criteria are tailored by the agent to what you're evaluating. Document-grounded synthetic data — Upload PDFs, knowledge bases, or product docs and generate QA pairs grounded in your actual content, reflecting real customer scenarios. Agentic eval support — Evaluate any agent calling Bedrock (Strands, LangChain, custom boto3) with zero code modification via OpenTelemetry instrumentation.

설치 지침

사전 요구 사항

Bedrock 모델 액세스를 통한 AWS 자격 증명
uv 설치됨
Claude Code, Cursor, Kiro, VS Code, 또는 모든 MCP와 호환되는 IDE

설치

IDE를 선택하고 붙여넣기 / 클릭합니다.

Claude Code — 단일 CLI 명령어:

claude mcp add eval -s user -- uvx --from llm-evaluation-system eval-mcp

Cursor — 원 클릭 링크: Cursor에서 eval-mcp 설치

Kiro — ~/.kiro/settings/mcp.json에 추가:

{ "mcpServers": { "eval": { "command": "uvx", "args": ["--from", "llm-evaluation-system", "eval-mcp"] } } }

Codex CLI — ~/.codex/config.toml에 추가한 다음 Codex 재시작:

[mcp_servers.eval] command = "uvx"args = ["--from", "llm-evaluation-system", "eval-mcp"]

VS Code (GitHub Copilot MCP 사용) — 단일 CLI 명령어:

code --add-mcp '{"name":"eval","command":"uvx","args":["--from","llm-evaluation-system","eval-mcp"]}'

코딩 에이전트를 사용하여 설치하시나요? INSTALL.md 파일을 지정해주세요. — 이 파일이 구성 편집을 처리하고, 선택 사항인 S3 팀 공유에 대해 묻습니다.

업그레이드 중

uvx 패키지별로 확인된 버전을 캐시합니다. 최신 릴리스를 가져오려면 캐시를 무효화하세요.

uv cache clean llm-evaluation-system

이후에 IDE를 다시 시작합니다. 다음 시작 시 최근 게시 버전을 불러와 캐시에 저장합니다.

사용

제공하거나 문서 또는 컨텍스트에서 생성된 데이터세트를 사용하여 AI 어시스턴트에게 에이전트, 모델 또는 프롬프트를 평가하도록 요청하세요.

" ./my_agent.py에 있는 내 에이전트를 평가해주세요"
“이 데이터세트에 대하여 Claude Sonnet과 Nova Pro를 비교해 보세요”
“이 세 가지 프롬프트 템플릿을 나의 QA 세트로 테스트해 보세요”
“이 PDF에서 데이터세트를 생성하고 평가를 실행하세요”

에이전트가 적절한 모드를 선택하고, 누락된 요소(데이터세트, 평가자, 평가 기준)를 자동으로 생성한 뒤, 이를 실행하고 브라우저에서 결과 뷰어를 열어 PDF 보고서를 제공합니다.