OpenJudge Built-in Graders#
This tutorial is based on the official OpenJudge documentation page:
OpenJudge Built-in Graders Overview: https://modelscope.github.io/OpenJudge/built_in_graders/overview/
OpenJudge provides 50+ pre-built graders to evaluate AI responses across multiple dimensions (general quality, agent behaviors, multimodal, code & math, text matching, and format validation).
1. What is a “Grader” in OpenJudge?#
In OpenJudge, a Grader is a standardized evaluation module that produces a structured score output (with explanations/metadata when available).
Key design features:
Unified API design: graders follow a consistent interface with an
aevaluate()methodStandard return object includes:
scorereasonmetadata
This makes it easy to compose graders into pipelines and compare results across tasks.
2. Two Implementation Types of Built-in Graders#
OpenJudge organizes graders into two implementation styles:
2.1 LLM-Based graders#
Best for:
subjective evaluation
nuanced response quality judgments
safety or hallucination detection
2.2 Code-Based graders#
Best for:
deterministic evaluation
zero-cost (no judge-model calls)
high-speed batch scoring
3. Full List of Pre-built Graders (From the Official Page)#
Below is the complete list of graders shown on the Built-in Graders Overview page, grouped by category.
3.1 General Graders#
These graders evaluate fundamental response quality such as relevance, hallucination, harmfulness, instruction following, and correctness.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
How relevant a response is to the user query |
LLM-Based |
1–5 |
|
Whether response contains fabricated info unsupported by context |
LLM-Based |
1–5 |
|
Harmful, offensive, inappropriate content |
LLM-Based |
1–5 |
|
Whether response follows given instructions |
LLM-Based |
1–5 |
|
Whether response matches reference answer |
LLM-Based |
1–5 |
Example prompt (RelevanceGrader)#
# English Prompt
RELEVANCE_PROMPT_EN = textwrap.dedent(
"""
You are a professional data annotator responsible for evaluating how relevant the model response is to the user's query. Your task is to score according to the following criteria:
<Scoring Criteria>
A highly relevant response should:
- Directly address the user's question or request.
- Provide information that is on-topic and pertinent to the query.
- Include sufficient detail to satisfy the user's information needs.
- Stay focused without drifting to unrelated topics.
- For multi-turn conversations, maintain context awareness from previous exchanges.
Points should be deducted for:
- Completely off-topic or unrelated responses.
- Vague or superficial answers that lack specific information.
- Partial responses that omit key information requested.
- Responses that acknowledge the query but fail to provide useful content.
- Generic statements that don't specifically address the question.
</Scoring Criteria>
<Guidance>
- Carefully read the query (or conversation history) and model response.
- Determine if the response directly addresses what the user is asking.
- Check if the information provided is complete, partial, or missing.
- Assess whether the response stays on-topic or includes irrelevant content.
- For conversations, consider whether the response maintains context from earlier turns.
- The score should reflect how well the response aligns with the user's information needs.
</Guidance>
<Reminder>
The goal is to evaluate relevance to the query, not overall quality.
A score of 5 means the response is highly relevant and comprehensive.
A score of 1 means the response is completely irrelevant to the query.
</Reminder>
<query>
{query}
</query>
<response>
{response}
</response>
Additional context (ignore if empty):
<context>
{context}
</context>
The following is the correct response for your reference (ignore if empty):
<reference_response>
{reference_response}
</reference_response>
# Output Instructions
**Note**: If a reference response is provided, you may use it as a baseline for comparison to better assess the quality and relevance of the evaluated response.
Provide your evaluation in the following structured JSON format:
{
"score": <integer between 1 and 5, where 5 means highly relevant and 1 means completely irrelevant>,
"reason": "<brief explanation for the assigned score, specifically mentioning how the response addresses or fails to address the query>"
}
Scoring Scale:
- 5: Perfectly relevant: the response completely fulfills the user's search intent, accurately answering the question or providing the required information.
- 4: Highly relevant: the response largely meets the search requirements, possibly lacking some details or having minor inaccuracies, but still a high-quality and directly relevant result.
- 3: Partially relevant: the response has some connection to the query but does not fully meet the requirements; the user may need to further filter or supplement the information.
- 2: Weakly relevant: the response has only a weak connection to the query, possibly covering the same topic but deviating from the core intent, and has low practical value.
- 1: Irrelevant: the response is completely unrelated to the query, or contains misleading or incorrect information.
JSON:
"""
).strip()
3.2 Agent Graders#
Agent graders evaluate an AI agent’s lifecycle, not only final answers.
OpenJudge’s agent evaluation can cover:
actions and alignment
tool usage and tool-call correctness
memory write/retrieval behavior
planning feasibility
reflection quality
trajectory-level reasoning
3.2.1 Action Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Whether agent actions align with goals |
LLM-Based |
{0, 1} |
|
Detects repetitive action loops |
Code-Based |
{0, 1} |
3.2.2 Tool Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Whether tool choice is appropriate |
LLM-Based |
1–5 |
|
Whether tool call is correct |
LLM-Based |
1–5 |
|
Whether tool call sequence matches expectation |
Code-Based |
{0, 1} |
|
Whether tool calls succeeded |
LLM-Based |
{0, 1} |
|
Whether tool parameters are valid |
LLM-Based |
{0, 1} |
3.2.3 Memory Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Accuracy of stored memories |
LLM-Based |
{0, 1} |
|
Whether key details are preserved |
LLM-Based |
{0, 1} |
|
Quality of memory retrieval behavior |
LLM-Based |
{0, 1} |
3.2.4 Plan & Reflection Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Whether plans are executable |
LLM-Based |
{0, 1} |
|
Whether reflections are accurate |
LLM-Based |
{0, 1} |
|
Understanding of outcomes |
LLM-Based |
{0, 1} |
|
Awareness of task progress |
LLM-Based |
{0, 1} |
3.2.5 Observation Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Measures information gain from observations |
Code-Based |
[0, 1] |
3.2.6 Trajectory Graders#
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Comprehensive trajectory evaluation |
LLM-Based |
{0, 1} |
3.3 Text Graders#
Text graders are fast, algorithm-based grading utilities, useful for similarity, string matching, and numeric comparisons.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Text similarity with 15+ algorithms (BLEU, ROUGE, F1, etc.) |
Code-Based |
[0, 1] |
|
String matching (exact/prefix/suffix/regex, etc.) |
Code-Based |
{0, 1} |
|
Numeric comparison with tolerance |
Code-Based |
{0, 1} |
3.4 Code Graders#
These graders evaluate code correctness from execution, syntax constraints, and style.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Executes code against test cases |
Code-Based |
[0, 1] |
|
Validates Python syntax using AST |
Code-Based |
{0, 1} |
|
Checks indentation and naming conventions |
Code-Based |
[0, 1] |
|
Compares code patches using |
Code-Based |
[0, 1] |
3.5 Math Graders#
These graders verify mathematical expressions and computations.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Verifies math expressions (LaTeX & plain text) |
Code-Based |
{0, 1} |
3.6 Format Graders#
Format graders validate structural constraints and penalize invalid formatting.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Validates JSON syntax |
Code-Based |
{0, 1} |
|
Deep comparison of JSON structures |
Code-Based |
{0, 1} |
|
Penalizes too short/long responses |
Code-Based |
≤0 (penalty) |
|
Penalizes repetitive n-grams |
Code-Based |
≤0 (penalty) |
|
Checks |
Code-Based |
{0, 1} |
|
Validates tool call format with JSON |
Code-Based |
{0, 1} |
3.7 Multimodal Graders#
Multimodal graders evaluate vision-language alignment and image-related outputs.
Grader |
What it evaluates |
Type |
Score Range |
|---|---|---|---|
|
Image-text coherence |
LLM-Based |
{0, 1} |
|
Whether images help understanding |
LLM-Based |
{0, 1} |
|
Text-to-image generation quality |
LLM-Based |
{0, 1} |
|
Image editing quality |
LLM-Based |
{0, 1} |
4. How to Choose the Right Built-in Grader#
A quick selection guide:
If you are evaluating “answer quality”#
Start with:
RelevanceGraderInstructionFollowingGraderCorrectnessGraderHallucinationGraderHarmfulnessGrader
If you are evaluating an “agent system”#
Use the Agent graders for:
tool correctness
action alignment
trajectory quality
memory behavior
If you are evaluating deterministic tasks#
Use code-based graders for speed and reproducibility:
SimilarityGraderStringMatchGraderCodeExecutionGraderJsonValidatorGrader
5. Next Steps (Suggested by OpenJudge)#
After reviewing built-in graders, the docs recommend exploring:
Running graders at scale with evaluation tasks
Creating custom graders when built-ins don’t cover your requirements
6. Summary#
OpenJudge’s built-in graders provide a robust library spanning:
General response quality (relevance, hallucination, safety, instruction following, correctness)
Agent lifecycle evaluation (actions, tools, memory, planning, reflection, trajectory)
Text similarity and matching
Code & math correctness checks
Format validation and penalties
Multimodal coherence/helpfulness/generation/editing quality
All graders share a unified interface and are designed for reliable evaluation workflows.