Skip to main content

Documentation Index

Fetch the complete documentation index at: https://koreai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations provide a structured framework for testing, scoring, and analyzing agent behavior before production deployment. You can simulate conversations using different user personas and test scenarios, evaluate agent responses using automated evaluators, and track evaluation results over time. Evaluations help you:
  • Test agents across different user behaviors and conversation flows.
  • Validate the quality, safety, and tool usage of the response.
  • Benchmark results against expected outcomes.
  • Identify gaps before production deployment.
  • Improve overall agent reliability and trustworthiness.
The evaluation workflow consists of the following components:
ComponentDescription
PersonasSimulated user profiles with configurable communication styles, goals, behaviors, and constraints.
ScenariosConversation flows and test cases used to evaluate agent behavior.
EvaluatorsScoring mechanisms that assess conversation quality, safety, efficiency, tool usage, and other metrics.
Eval SetsReusable evaluation configurations that combine personas, scenarios, and evaluators.
RunsExecuted evaluation sessions and results.
Navigation: Go to your project and select Evaluate > Evals.
Personas, scenarios, evaluators, and eval sets are reusable across multiple evaluations within the same project.

Evaluation Workflow

You can evaluate agents in two ways:
Evaluation TypeDescription
Quick EvalAutomatically generates personas, scenarios, evaluators, and runs using AI for rapid testing and iteration.
Manual EvalYou can manually create personas, scenarios, evaluators, and eval sets to enable controlled, reusable evaluation workflows.

Quick Eval

Use Quick Eval for:
  • Rapid testing
  • Early-stage validation
  • Smoke testing
  • Fast iteration during development
The platform automatically generates the required personas, scenarios, evaluators, and evaluation runs.

Manual Evaluation Workflow

1

Create Personas

Define simulated user profiles with specific communication styles, goals, behaviors, and constraints.
2

Create Scenarios

Define conversation flows, expected outcomes, milestones, and user intents used to test agent behavior.
3

Create Evaluators

Configure evaluators to measure response quality, safety, efficiency, tool usage, and other evaluation criteria.
4

Create Eval Sets

Combine personas, scenarios, and evaluators into reusable evaluation configurations.
5

Run Evaluations

Execute evaluation runs to simulate conversations and generate evaluator scores and transcripts.
6

Analyze Results

Review scores, evaluator reasoning, regressions, transcripts, traces, and execution metrics to improve agent performance.

Personas

Personas represent different types of users who interact with your agent. Each persona simulates unique communication styles, domain expertise, goals, behaviors, and constraints to help test how the agent performs across varied user interactions.

Create a Persona

  1. Go to Evaluate > Evals > Personas.
  2. Click Create Persona.
  3. Specify persona details such as:
  • Communication style
  • Domain knowledge
  • Behavioral traits
  • Goals and constraints
  • Optional session variables
FieldDescriptionOptions
NameUnique name within the projectFor example, Impatient Business Traveler
Communication StyleDefines how the persona phrases messagesCasual, formal, technical, terse, verbose
Domain KnowledgeDefines how much the persona knows about the topicBeginner, intermediate, expert
Behavior TraitsSpecific behaviors the persona exhibitsFree-text tags, for example, "asks follow-ups", "impatient"
GoalsDefines what the persona is trying to accomplishFree text
ConstraintsRules the persona follows during conversationFree text
  1. Select an adversarial behavior type if you want to simulate edge cases or malicious interactions.
  2. Click Create.

Example Persona

PERSONA:
  name: "Impatient Business Traveler"
  communication_style: terse
  domain_knowledge: expert
  behavior_traits:
    - impatient
    - asks_follow_up_questions
  goal: "Rebook a cancelled flight quickly"
  constraint: "Avoid unnecessary conversation"

Adversarial Persona Types

You can simulate adversarial or edge-case user behaviors using the Adversarial Type field. To test agent safety and robustness:
  1. Enable Adversarial while creating a persona.
  2. Select the adversarial type.
TypePurpose
Prompt InjectionAttempts to override agent instructions
Social EngineeringAttempts to extract sensitive information
Off-topic DerailerRedirects conversations away from the intended agent goal
Abusive UserUses hostile or inappropriate language
Edge Case ExplorerSends unusual or unexpected inputs (empty, very long messages)

Additional Options

  • Edit, duplicate, or delete personas from the Personas page.
  • Create multiple personas for different user behaviors and communication patterns.
  • Reuse personas across multiple eval sets within the same project.

AI-Generated Personas

Instead of defining personas manually, use the Generate with AI option to automatically create personas based on your agent’s domain and objectives. Generated personas typically include a mix of:
  • Communication styles
  • Knowledge levels
  • Behavioral patterns
  • User goals
You can review and edit the generated personas after they’re created.

Troubleshooting Personas

IssueRecommendation
The personas behavior is inconsistentRefine the goals and constraints fields
The personas responses are unrealisticAdd more specific behavioral traits and communication styles
The personas is too passive or aggressiveAdjust goals, constraints, and adversarial settings

Scenarios

Scenarios define the conversation flow, user intent, and expected outcomes used during evaluations. Each scenario represents a conversation flow used to evaluate how the agent handles specific tasks, behaviors, or outcomes.

Create a Test Scenario

  1. Go to Evaluate > Evals > Scenarios.
  2. Click Create Scenario.
  3. Specify the following scenario details.
  4. Click Create.
FieldDescription
NameUnique name within the project
CategoryGrouping label, for example, booking, returns, or auth.
DifficultyDefines the scenario’s complexity level: easy, medium, or hard.
Entry AgentThe agent that starts the conversation (optional).
Initial MessageThe first user message that starts the scenario.
Expected OutcomeDescription of what a successful conversation should achieve.
Max TurnsMaximum number of conversation turns before timeout.
Expected MilestonesKey checkpoints the conversation is expected to reach.
Agent PathExpected sequence of agents for multi-agent projects.

Example Scenario

SCENARIO:
  name: "Flight Rebooking After Cancellation"
  category: booking
  difficulty: medium

  initial_message: >
    My flight was cancelled and I need to rebook for tomorrow.

  expected_outcome: >
    Agent identifies the cancelled booking, offers alternatives,
    and confirms a new flight.

  max_turns: 15

  expected_milestones:
    - "Identify cancelled flight"
    - "Present rebooking options"
    - "Confirm new booking"

  agent_path:
    - "Supervisor"
    - "Booking_Manager"

Bulk Import Personas and Scenarios

Use the API to programmatically create multiple personas or scenarios.
curl -X POST /api/projects/:projectId/eval-personas \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Confused First-Time User",
    "communicationStyle": "verbose",
    "domainKnowledge": "beginner",
    "behaviorTraits": [
      "asks for clarification",
      "repeats questions"
    ],
    "goals": "Complete a simple booking",
    "constraints": "Never provides information upfront"
  }'

Additional Options

  • Edit, duplicate, or delete scenarios from the Scenarios page.
  • Reuse scenarios across multiple eval sets.

Troubleshooting Scenarios

IssueRecommendation
Duplicate name errorPersona and scenario names must be unique within a project. Use a more specific name or delete the existing one.
Persona not behaving as expected in evalsRefine the Goals and Constraints fields. These are used as system prompt instructions for the simulated user LLM.
Scenario timing outIncrease the Max Turns value or simplify the expected conversation path.

Evaluators

Evaluators define how agent conversations are scored and analyzed during the evaluation process. You can use evaluators to assess:
  • Response quality
  • Safety
  • Efficiency
  • Empathy
  • Tool correctness
  • Custom evaluation criteria

Create an Evaluator

  1. Go to Evaluate > Evals > Evaluators.
  2. Click Create Evaluator.
  3. Configure the evaluator:
    1. Select the evaluator type.
    2. Choose the evaluation category.
    3. Define the scoring scale and criteria.
  4. Click Create.
Lower evaluator temperatures generally produce more consistent scoring results.

Evaluator Types

Supported evaluator types include:
TypeDescription
LLM JudgeUses an LLM to evaluate conversations.
Code ScorerUses deterministic programmatic scoring.
TrajectoryEvaluates conversation flow and milestones.
Human ReviewFlags conversations for manual review.

LLM Judge Evaluator

An LLM Judge evaluator uses a separate LLM to assess the quality of agent responses based on a scoring rubric you define.
FieldDescription
Judge ModelDefines which LLM is used as the evaluator judge.
Judge PromptInstructions that define what the judge should evaluate.
TemperatureLLM temperature for the judge. Lower values generally produce more consistent scoring results.
Chain of ThoughtDefines whether the judge explains its reasoning before scoring.

Write Effective Judge Prompts

The judge prompt is one of the most important evaluator configurations. Effective judge prompts:
  • Clearly define evaluation criteria
  • Focus on observable behavior
  • Avoid ambiguous language
  • Include examples when possible

Example Judge Prompt

JUDGE_PROMPT:
  You are evaluating an AI agent's response quality in a customer support context.

  Evaluate the conversation on these criteria:
    1. Did the agent correctly identify the customer's intent?
    2. Did the agent provide accurate information?
    3. Did the agent follow the expected conversation flow?
    4. Was the agent's tone appropriate and professional?

  Score each conversation using the provided rubric.

  Focus on the agent's responses, not the simulated user's messages.

Configure Bias Mitigation

LLM judges can exhibit scoring biases. Use bias mitigation settings to improve evaluation consistency and reliability.
SettingDescriptionDefault
Position SwapEvaluates the conversation in both original and reversed order to reduce positional bias.On
Blind EvaluationRemoves agent or persona identity information before judging.On
Cross-Model JudgeUses a different model family than the agent being evaluated.Off
Evidence-First (RULERS)Requires the judge to cite evidence before assigning scores.On

Trajectory Evaluators

Trajectory evaluators assess the agent’s execution behavior rather than response quality. Use them to validate:
  • Milestone completion — did the conversation hit expected checkpoints?
  • Handoff correctness — did the supervisor route to the right agent?
  • Path efficiency — how many unnecessary steps did the agent take?
  • Tool sequence — did the agent call tools in the right order?

Code Scorer Evaluators

Use Code Scorer evaluators for deterministic validations that do not require an LLM. Typical use cases include:
  • Regex matching
  • Keyword validation
  • Latency or response-time thresholds
  • Structured output validation
Code Scorer evaluators execute custom scoring logic to validate agent responses and runtime behavior using deterministic rules.

Human Review Evaluators

Use Human Review evaluators for subjective or manual quality assessments. Human Review evaluators flag conversations for manual inspection when evaluation scores fall below configured thresholds, allowing reviewers to validate agent behavior, response quality, or policy compliance before approval or release.

Scoring Scale Types

The scoring rubric defines how the evaluator assigns scores to conversations. Supported scale types include:
  • 1 to 5 scale
  • Pass or Fail

1 to 5 Scale

Use a 1 to 5 scale to define detailed evaluation criteria for each score level.
ScoreDescription
5 - ExcellentFully addresses the user’s request with accurate and complete information.
4 - GoodAddresses the request with minor omissions.
3 - AdequatePartially addresses the request but misses important details.
2 - PoorMostly misses the request or provides inaccurate information.
1 - FailingCompletely fails to address the request or provides harmful information.

Pass/Fail Scale

Use pass/fail scoring for binary evaluation criteria.
ScoreDescription
1 - PassThe agent completes the task within the expected flow.
0 - FailThe agent fails to complete the task or deviates from expected behavior.

Troubleshooting Evaluators

IssueRecommendation
Inconsistent scores across runsLower evaluator temperature (try 0.1) and enable evidence-first mode. Run multiple variants per evaluation to get statistical confidence.
Judge ignores rubric criteriaMake rubric instructions more specific using examples.
The Judge model is too expensiveUse a smaller model for initial screening and reserve larger models for detailed analysis. Set appropriate maxTokens limits.
The Evaluation cost is too highUse smaller judge models during development.
Scores appear randomIncrease statistical sample size using variants.

Eval Sets

Eval Sets run evaluation batches to systematically test agents across combinations of personas, scenarios, and evaluators. Eval Sets combine personas, scenarios, and evaluators into reusable evaluation configurations. You can use eval sets to:
  • Reuse evaluation pipelines.
  • Standardize testing across environments.
  • Execute multiple evaluations consistently.
  • Detect regressions over time

Execution Model

During execution:
  • Every selected Persona interacts with every selected Scenario
  • Each conversation is independently executed
  • All configured Evaluators score the resulting conversations
This creates a full evaluation matrix across personas, scenarios, and evaluators. Example Evaluation Matrix
EVAL_SET:
  personas: 3
  scenarios: 4
  evaluators: 2
  variants: 2

TOTAL_CONVERSATIONS:
  formula: "3 Personas × 4 Scenarios × 2 Variants"
  result: 24

TOTAL_EVALUATIONS:
  formula: "24 Conversations × 2 Evaluators"
  result: 48
Each conversation is executed as an independent multi-turn session where the persona LLM simulates the user according to the scenario definition.

Create an Eval Set

  1. Go to Evaluate > Evals > Eval Sets.
  2. Click Create Eval Set.
  3. Specify the following details.
  4. Click Create.
FieldDescription
NameUnique eval set name, for example, Booking Flow Regression Suite.
PersonasSelect one or more personas to simulate users.
ScenariosSelect one or more scenarios to test.
EvaluatorsSelect one or more evaluators to score conversations.
VariantsNumber of times to repeat each evaluation combination for statistical confidence.

Higher variant counts help reduce:
  • Random scoring fluctuations
  • LLM non-determinism
  • Statistical anomalies
Max ConcurrencyDefines how many conversations run in parallel.

Higher concurrency:
  • Reduces overall execution time
  • Increases resource usage and cost
Persona ModelLLM used to simulate the persona (optional override).

Enable CI/CD Integration

Use evaluation runs in CI/CD pipelines to automatically block deployments when regressions are detected. To enable CI integration:
  • Open the Eval Set.
  • Enable CI/CD integration.
  • Trigger evaluation runs using the Eval Run API from your CI pipeline.
  • Check the run result for regressionDetected: true and fail the deployment pipeline accordingly.
# Trigger evaluation run
RUN_ID=$(curl -s -X POST .../eval-runs -d '...' | jq -r '.id')

# Check run result
RESULT=$(curl -s /api/projects/:projectId/eval-runs/$RUN_ID)
REGRESSION=$(echo $RESULT | jq '.regressionDetected')

if [ "$REGRESSION" = "true" ]; then
 echo "Regression detected -- blocking deployment"
 exit 1
fi

Regression Detection

Eval sets support regression detection by comparing new runs against baseline runs. To configure regression detection:
  1. Open the Eval Set.
  2. Under Regression Settings, select a baseline run. Typically, this is the last known-good evaluation run.
  3. Specify the regression threshold. For example, 0.1 means that a 10% drop in score triggers a regression alert.
When a new run completes, the platform compares scores per evaluator and flags regressions with the evaluator name, persona/scenario combination, baseline score, current score, and score delta.

Run a Subset of Scenarios

Instead of running the full evaluation set, use scenario tags to create smaller targeted evaluation batches.
  1. Tag your scenarios (for example, smoke-test, regression, or edge-case).
  2. Create separate eval sets for different test scopes - for example, a lightweight smoke-test set for every commit and a full regression set for release candidates.

Runs

Runs represent executed evaluations and their results. Each run is executed and tracked independently. Each run generates:
  • Conversation transcripts
  • Scores
  • Evaluator outputs
  • Execution metadata
  • Analysis results
Runs help in tracking costs:
  • Estimated execution cost
  • Actual execution cost
  • Model usage
  • Token usage
This helps optimize the evaluation scale and model selection.

Run Evaluations

  1. Select an Eval Set.
  2. Click Start Run.
  3. Monitor evaluation progress from the Runs page.
The system automatically:
  • Executes conversations using the selected personas and scenarios.
  • Applies evaluators to generated conversations.
  • Stores scoring and transcript results.

Run Statuses

StatusDescription
PendingRun is queued and waiting to start.
RunningEvaluations are currently in progress.
CompletedAll evaluations finished successfully.
FailedRun encountered an unrecoverable error.
CancelledRun was manually stopped.

Run via API

Trigger evaluation runs programmatically for CI/CD integration and automated testing workflows.
curl -X POST /api/projects/:projectId/eval-runs \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{
   "evalSetId": "your-eval-set-id",
   "name": "CI Run #42",
   "triggerSource": "ci"
 }'
The triggerSource field tracks how the evaluation run was initiated:
  • manual - Triggered from Studio
  • ci - Triggered through API or CI/CD pipelines
  • scheduled - Triggered by a scheduled job or automation

Troubleshooting Runs

IssueRecommendation
Run is stuck in pending statusCheck that the project has valid LLM credentials configured. The evaluation pipeline requires both an agent model and a persona model.
High execution costs on large eval setsReduce variants to 1 during development, use smaller persona models, and limit max concurrency to control spend.
Inconsistent results between runsIncrease the variant count to 3+ for better statistical significance and lower persona and judge temperatures.
Run fails immediatelyEnsure all referenced personas, scenarios, and evaluators still exist. Deleted components can cause run failures.

Analyze Evaluation Results

After a run completes, you can:
  • Review conversation transcripts.
  • Analyze evaluator scores and reasoning.
  • Identify success and failure patterns.
  • Analyze agent behavior across scenarios
  • Inspect execution traces and tool usage.

View Run Summary

After an eval run completes, open it from Evals > Runs to view the summary dashboard.
MetricDescription
Avg ScoreOverall average across all evaluators and conversations.
Scores by EvaluatorBreakdown of average score per evaluator.
Total ConversationsTotal number of persona-scenario conversations executed.
Total EvaluationsTotal evaluator judgments generated (conversations × evaluators).
DurationTotal time taken for the run.
Estimated CostProjected LLM cost before execution.
Actual CostActual LLM cost tracked during execution.

Understand Statistical Metrics

MetricDescription
Standard DeviationMeasures how much individual scores vary from the average. Lower values indicate more consistent results.
Confidence IntervalReliability range of the average score. Narrow intervals indicate more reliable evaluation results.
Pass@KThe probability that at least one of K attempts passes the evaluation criteria. Useful for creative or open-ended tasks.

Understand Score Distributions

PatternMeaningRecommended Action
High average, low deviationThe agent performs consistently well.Ready for deployment.
High average, high deviationThe agent performs well overall but inconsistently.Investigate low-scoring outliers.
Low average, low deviationThe agent consistently underperforms.Review agent instructions and flow design.
Low average, high deviationThe agent’s behavior is unstable or unpredictable.Investigate failing scenarios and edge cases.

Review Regression Details

If a run detects regressions, the regression panel shows:
  • The evaluator that flagged the regression
  • Persona/scenario combination
  • Baseline score
  • Current score
  • Score delta
Focus first on regressions with the largest negative score delta. Open individual conversations to inspect traces and understand the causes of failures.

Conversation Analysis

Select a conversation to drill into:
  • Full Transcript - Every message exchanged between the persona and the agent
  • Evaluator Scores - Per-evaluator scores with judge reasoning (when Chain of Thought is enabled)
  • Trace Timeline - Execution trace showing tool calls, handoffs, and decisions
  • Milestone Tracking - Expected milestones that were completed or missed
  • Tool Usage and Failures - Tool execution details and runtime failures

Read Evaluator Reasoning

When Chain-of-thought is enabled, evaluator scores include judge’s reasoning explaining how the score was determined. Use evaluator reasoning to identify improvements needed in:
  • Agent instructions
  • Flow design
  • Tool configuration

Example Evaluator Reasoning

EVALUATION_RESULT:
  score: 3/5

  reasoning: >
    The agent correctly identified the customer's intent
    but failed to provide complete rebooking options
    before requesting additional information.

Compare Runs Over Time

Use run history to:
  • Compare evaluator score changes
  • Track regressions
  • Measure improvement trends
  • Validate prompt or workflow updates
To compare runs:
  1. Go to Evals > Runs.
  2. Sort runs by date to view chronological progression.
  3. Compare runs to analyze score deltas across evaluators and scenarios.
Run evaluations regularly after significant changes to agents, prompts, tools, or workflows to identify regressions early.

Acting on Results

IssueRecommended Actions
Low quality scores
  • Refine the agent’s goal and persona to provide clearer behavioral guidance.
  • Add or improve limitations to prevent off-topic responses.
  • For flow-based agents, review step transitions and conditional logic.
Low safety scores
  • Add or tighten guardrails rules for input and output filtering.
  • Create adversarial personas to stress-test edge cases.
Low efficiency scores
  • Reduce unnecessary tool calls by improving agent instructions.
  • Optimize flow step sequences to the number of minimize conversation turns.
  • Check whether the agent is requesting information already available in context.
Handoff correctness issues
  • Review the handoff conditions for the supervisor agent.
  • Verify that when clauses match the intended routing patterns.
  • Validate expected agent paths configured in scenarios.

Troubleshooting

IssueRecommendation
Scores seem randomIncrease the number of variants in the eval set. Using 3–5 variants generally provides better statistical significance. Lower the judge temperature for more consistent scoring.
All scores are perfect (5/5)The scoring rubric may be too lenient. Add more specific failure conditions and use adversarial personas to test edge cases.
Regression detected, but the agent improvedReview the baseline run. It may contain an anomalous high score. Set a more recent and stable run as the new baseline.
Cost higher than expectedReview the selected persona and judge models. Using smaller persona models can significantly reduce evaluation cost.