Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lumiqtrace.com/llms.txt

Use this file to discover all available pages before exploring further.

Simulations let you run your LLM configuration against a prepared dataset of test cases before deploying a change. Instead of finding out a prompt change broke something in production, you run a simulation, review the outputs and evaluation scores, and only promote the change when you’re confident it behaves correctly.
Simulations require the Team or Scale plan.

Core concepts

Dataset

A collection of test cases. Each case has an input (the user message or prompt variables) and optionally an expected output or reference answer.

Scenario

A test configuration: which dataset to use, which prompt version to test, which evaluators to run, and which model to call.

Simulation run

One execution of a scenario — the platform runs each dataset item through your LLM configuration and collects outputs and scores.

Batch run

Multiple simulation runs executed in parallel, typically used to compare different prompt versions side by side.

Creating a dataset

Before running simulations, you need a dataset of test cases.
1

Open the Simulations page

Navigate to Simulations in the sidebar.
2

Click Datasets, then New dataset

Name your dataset (e.g., support-test-cases) and add a description.
3

Add test items

Each item has:
  • Input — the user message or template variables to inject into your prompt
  • Expected output (optional) — a reference answer for similarity scoring
  • Tags — optional labels for grouping items
You can add items manually, paste them from a CSV, or import from a JSON file.
4

Save the dataset

Click Save. The dataset is ready to use in a scenario.

Creating a scenario

1

Click New scenario

On the Simulations page, click New scenario.
2

Choose a dataset

Select the dataset of test cases to run against.
3

Configure the prompt

Either select a prompt from your prompt library (by name and version/label) or paste a prompt directly into the editor.
4

Choose a model

Select the model to call for each test case.
5

Add evaluators

Attach one or more evaluators to score the outputs. You can use any evaluator defined on the Evaluations page, plus built-in ones:
EvaluatorWhat it measures
exact-matchWhether the output exactly matches the expected value
containsWhether the output contains a required substring
similaritySemantic similarity to the expected output (0–1)
lengthWhether the output length is within a specified range
CustomAny LLM-judge evaluator you’ve defined
6

Save the scenario

Click Save. The scenario is ready to run.

Running a simulation

Click Run now on any scenario to start a simulation run. LumiqTrace:
  1. Iterates over every item in the dataset
  2. Calls your chosen model with the prompt + item input
  3. Records the response, latency, token count, and cost
  4. Runs each configured evaluator on the output
  5. Aggregates results into a run summary
Simulation runs appear in the Run history table. Click any run to see per-item results.

Batch runs — comparing versions

A batch run executes the same dataset against multiple configurations simultaneously, making it easy to compare prompt versions head-to-head.
1

Click New batch run

Select two or more scenarios (or one scenario with multiple prompt version variants).
2

Start the batch

Click Run batch. All variants run in parallel.
3

Compare results

When all runs complete, the comparison view shows side-by-side scores, latency, and cost for each variant. Rows with significant differences are highlighted.
Use batch runs before promoting a prompt from staging to production. Run the current production version and the candidate version through the same dataset and only promote if the candidate scores better on your key evaluators.

Reading run results

Each simulation run’s detail view shows:
  • Summary cards — average score per evaluator, total cost, average latency
  • Per-item table — each dataset item with its output, scores, and a link to the full trace
  • Score distribution — histogram of score spread across items
  • Failed items — items where the model returned an error or a score below threshold
Click any item row to see the full output text and all evaluator scores for that item.

Plan limits

PlanDatasetsScenariosMonthly simulation runs
FreeNone
Pro35500 items total
TeamUnlimitedUnlimited10,000 items/month
ScaleUnlimitedUnlimitedUnlimited