OpenJudge Built-in Graders#

This tutorial is based on the official OpenJudge documentation page:

OpenJudge provides 50+ pre-built graders to evaluate AI responses across multiple dimensions (general quality, agent behaviors, multimodal, code & math, text matching, and format validation).


1. What is a “Grader” in OpenJudge?#

In OpenJudge, a Grader is a standardized evaluation module that produces a structured score output (with explanations/metadata when available).

Key design features:

  • Unified API design: graders follow a consistent interface with an aevaluate() method

  • Standard return object includes:

    • score

    • reason

    • metadata

This makes it easy to compose graders into pipelines and compare results across tasks.


2. Two Implementation Types of Built-in Graders#

OpenJudge organizes graders into two implementation styles:

2.1 LLM-Based graders#

Best for:

  • subjective evaluation

  • nuanced response quality judgments

  • safety or hallucination detection

2.2 Code-Based graders#

Best for:

  • deterministic evaluation

  • zero-cost (no judge-model calls)

  • high-speed batch scoring


3. Full List of Pre-built Graders (From the Official Page)#

Below is the complete list of graders shown on the Built-in Graders Overview page, grouped by category.


3.1 General Graders#

These graders evaluate fundamental response quality such as relevance, hallucination, harmfulness, instruction following, and correctness.

Grader

What it evaluates

Type

Score Range

RelevanceGrader

How relevant a response is to the user query

LLM-Based

1–5

HallucinationGrader

Whether response contains fabricated info unsupported by context

LLM-Based

1–5

HarmfulnessGrader

Harmful, offensive, inappropriate content

LLM-Based

1–5

InstructionFollowingGrader

Whether response follows given instructions

LLM-Based

1–5

CorrectnessGrader

Whether response matches reference answer

LLM-Based

1–5


Example prompt (RelevanceGrader)#

# English Prompt
RELEVANCE_PROMPT_EN = textwrap.dedent(
    """
You are a professional data annotator responsible for evaluating how relevant the model response is to the user's query. Your task is to score according to the following criteria:

<Scoring Criteria>
A highly relevant response should:
- Directly address the user's question or request.
- Provide information that is on-topic and pertinent to the query.
- Include sufficient detail to satisfy the user's information needs.
- Stay focused without drifting to unrelated topics.
- For multi-turn conversations, maintain context awareness from previous exchanges.

Points should be deducted for:
- Completely off-topic or unrelated responses.
- Vague or superficial answers that lack specific information.
- Partial responses that omit key information requested.
- Responses that acknowledge the query but fail to provide useful content.
- Generic statements that don't specifically address the question.
</Scoring Criteria>

<Guidance>
- Carefully read the query (or conversation history) and model response.
- Determine if the response directly addresses what the user is asking.
- Check if the information provided is complete, partial, or missing.
- Assess whether the response stays on-topic or includes irrelevant content.
- For conversations, consider whether the response maintains context from earlier turns.
- The score should reflect how well the response aligns with the user's information needs.
</Guidance>

<Reminder>
The goal is to evaluate relevance to the query, not overall quality.
A score of 5 means the response is highly relevant and comprehensive.
A score of 1 means the response is completely irrelevant to the query.
</Reminder>
<query>
{query}
</query>

<response>
{response}
</response>

Additional context (ignore if empty):
<context>
{context}
</context>

The following is the correct response for your reference (ignore if empty):
<reference_response>
{reference_response}
</reference_response>

# Output Instructions
**Note**: If a reference response is provided, you may use it as a baseline for comparison to better assess the quality and relevance of the evaluated response.

Provide your evaluation in the following structured JSON format:
{
    "score": <integer between 1 and 5, where 5 means highly relevant and 1 means completely irrelevant>,
    "reason": "<brief explanation for the assigned score, specifically mentioning how the response addresses or fails to address the query>"
}

Scoring Scale:
- 5: Perfectly relevant: the response completely fulfills the user's search intent, accurately answering the question or providing the required information.
- 4: Highly relevant: the response largely meets the search requirements, possibly lacking some details or having minor inaccuracies, but still a high-quality and directly relevant result.
- 3: Partially relevant: the response has some connection to the query but does not fully meet the requirements; the user may need to further filter or supplement the information.
- 2: Weakly relevant: the response has only a weak connection to the query, possibly covering the same topic but deviating from the core intent, and has low practical value.
- 1: Irrelevant: the response is completely unrelated to the query, or contains misleading or incorrect information.

JSON:
"""
).strip()

3.2 Agent Graders#

Agent graders evaluate an AI agent’s lifecycle, not only final answers.

OpenJudge’s agent evaluation can cover:

  • actions and alignment

  • tool usage and tool-call correctness

  • memory write/retrieval behavior

  • planning feasibility

  • reflection quality

  • trajectory-level reasoning

3.2.1 Action Graders#

Grader

What it evaluates

Type

Score Range

ActionAlignmentGrader

Whether agent actions align with goals

LLM-Based

{0, 1}

ActionLoopDetectionGrader

Detects repetitive action loops

Code-Based

{0, 1}

3.2.2 Tool Graders#

Grader

What it evaluates

Type

Score Range

ToolSelectionGrader

Whether tool choice is appropriate

LLM-Based

1–5

ToolCallAccuracyGrader

Whether tool call is correct

LLM-Based

1–5

ToolCallSequenceMatchGrader

Whether tool call sequence matches expectation

Code-Based

{0, 1}

ToolCallSuccessGrader

Whether tool calls succeeded

LLM-Based

{0, 1}

ToolParameterCheckGrader

Whether tool parameters are valid

LLM-Based

{0, 1}

3.2.3 Memory Graders#

Grader

What it evaluates

Type

Score Range

MemoryAccuracyGrader

Accuracy of stored memories

LLM-Based

{0, 1}

MemoryDetailPreservationGrader

Whether key details are preserved

LLM-Based

{0, 1}

MemoryRetrievalEffectivenessGrader

Quality of memory retrieval behavior

LLM-Based

{0, 1}

3.2.4 Plan & Reflection Graders#

Grader

What it evaluates

Type

Score Range

PlanFeasibilityGrader

Whether plans are executable

LLM-Based

{0, 1}

ReflectionAccuracyGrader

Whether reflections are accurate

LLM-Based

{0, 1}

ReflectionOutcomeUnderstandingGrader

Understanding of outcomes

LLM-Based

{0, 1}

ReflectionProgressAwarenessGrader

Awareness of task progress

LLM-Based

{0, 1}

3.2.5 Observation Graders#

Grader

What it evaluates

Type

Score Range

ObservationInformationGainGrader

Measures information gain from observations

Code-Based

[0, 1]

3.2.6 Trajectory Graders#

Grader

What it evaluates

Type

Score Range

TrajectoryComprehensiveGrader

Comprehensive trajectory evaluation

LLM-Based

{0, 1}


3.3 Text Graders#

Text graders are fast, algorithm-based grading utilities, useful for similarity, string matching, and numeric comparisons.

Grader

What it evaluates

Type

Score Range

SimilarityGrader

Text similarity with 15+ algorithms (BLEU, ROUGE, F1, etc.)

Code-Based

[0, 1]

StringMatchGrader

String matching (exact/prefix/suffix/regex, etc.)

Code-Based

{0, 1}

NumberAccuracyGrader

Numeric comparison with tolerance

Code-Based

{0, 1}


3.4 Code Graders#

These graders evaluate code correctness from execution, syntax constraints, and style.

Grader

What it evaluates

Type

Score Range

CodeExecutionGrader

Executes code against test cases

Code-Based

[0, 1]

SyntaxCheckGrader

Validates Python syntax using AST

Code-Based

{0, 1}

CodeStyleGrader

Checks indentation and naming conventions

Code-Based

[0, 1]

PatchSimilarityGrader

Compares code patches using SequenceMatcher

Code-Based

[0, 1]


3.5 Math Graders#

These graders verify mathematical expressions and computations.

Grader

What it evaluates

Type

Score Range

MathExpressionVerifyGrader

Verifies math expressions (LaTeX & plain text)

Code-Based

{0, 1}


3.6 Format Graders#

Format graders validate structural constraints and penalize invalid formatting.

Grader

What it evaluates

Type

Score Range

JsonValidatorGrader

Validates JSON syntax

Code-Based

{0, 1}

JsonMatchGrader

Deep comparison of JSON structures

Code-Based

{0, 1}

LengthPenaltyGrader

Penalizes too short/long responses

Code-Based

≤0 (penalty)

NgramRepetitionPenaltyGrader

Penalizes repetitive n-grams

Code-Based

≤0 (penalty)

ReasoningFormatGrader

Checks <think> and <answer> tags

Code-Based

{0, 1}

ReasoningToolCallFormatGrader

Validates tool call format with JSON

Code-Based

{0, 1}


3.7 Multimodal Graders#

Multimodal graders evaluate vision-language alignment and image-related outputs.

Grader

What it evaluates

Type

Score Range

ImageCoherenceGrader

Image-text coherence

LLM-Based

{0, 1}

ImageHelpfulnessGrader

Whether images help understanding

LLM-Based

{0, 1}

TextToImageGrader

Text-to-image generation quality

LLM-Based

{0, 1}

ImageEditingGrader

Image editing quality

LLM-Based

{0, 1}


4. How to Choose the Right Built-in Grader#

A quick selection guide:

If you are evaluating “answer quality”#

Start with:

  • RelevanceGrader

  • InstructionFollowingGrader

  • CorrectnessGrader

  • HallucinationGrader

  • HarmfulnessGrader

If you are evaluating an “agent system”#

Use the Agent graders for:

  • tool correctness

  • action alignment

  • trajectory quality

  • memory behavior

If you are evaluating deterministic tasks#

Use code-based graders for speed and reproducibility:

  • SimilarityGrader

  • StringMatchGrader

  • CodeExecutionGrader

  • JsonValidatorGrader


5. Next Steps (Suggested by OpenJudge)#

After reviewing built-in graders, the docs recommend exploring:

  • Running graders at scale with evaluation tasks

  • Creating custom graders when built-ins don’t cover your requirements


6. Summary#

OpenJudge’s built-in graders provide a robust library spanning:

  • General response quality (relevance, hallucination, safety, instruction following, correctness)

  • Agent lifecycle evaluation (actions, tools, memory, planning, reflection, trajectory)

  • Text similarity and matching

  • Code & math correctness checks

  • Format validation and penalties

  • Multimodal coherence/helpfulness/generation/editing quality

All graders share a unified interface and are designed for reliable evaluation workflows.