Skip to main content

Curating Datasets

Confident AI provides your team a centralized place to create, upload, and edit evaluation datasets online. Think of it like a Google Sheets or Notion CSV editor, but the difference is that each row is already in the structure of an LLMTestCase, and most importantly you can use it directly in code when evaluating with deepeval.

info

An evaluation dataset on Confident AI is a collection of goldens, which is extremely similar to a test case. You can learn more about goldens here.

Create Your First Dataset

To create your first dataset, simply navigate to the Datasets page in your project space. There, you'll see a button that says Create Dataset, and you will be required to name your first dataset by providing it with an alias. This alias will be used to identify which will be used for evaluation later on.

Populate Your Dataset With Golden(s)

Now that you've created a dataset, you can create a "golden" within your dataset that will later be converted to an LLMTestCase during evaluation time (we'll talk more about this later). There are a few ways you can populate your dataset with goldens, which includes:

  1. Creating a golden individually using Confident AI's goldens editor.
  2. Importing a CSV of goldens to Confident AI.
  3. Uploading a list of Goldens to Confident AI via deepeval.
note

If you're not sure what to include in your goldens, simply enter the inputs you're currently prompting your LLM application with when eyeballing outputs. You'll be able to automate this process by creating a list of goldens out of inputs.

A golden is basically a test case that isn't ready for evaluation yet. It holds additional information needed for a better dataset annotation experience, such as the ability to mark it as "ready" for evaluation, the ability to contain empty actual_outputs that will later be populated at evaluation time, and the inclusion of additional columns and metadata that might be useful for you at evaluation time.

caution

We highly recommend AGAINST running evaluations on pre-computed evaluation datasets since you'll want to test your LLM application based on the latest actual_outputs that are generated as a consequence of your iteration, so if you find yourself filling in the actual_output field at any point in time, think again.

Create Individual Golden(s)

You can create goldens manually by clicking on the Create Golden button in the Datasets > Dataset Editor page, which will open an editor for you to fill in your golden information.

Import Golden(s) From CSV

Alternatively, you can also choose to upload list of goldens from CSV files. Simply click on the Upload Goldens button, and you'll have the opportunity to map CSV columns to golden fields when importing.

The golden fields include:

  • input: a string representing the input to prompt your LLM application with during evaluation.
  • [Optional]actual_output: a string representing the generated actual_output of your LLM application for the corresponding input.
  • [Optional]expected_output: a string representing the ideal output based on for the corresponding input.
  • [Optional]retrieval_context a list of strings representing the retrieved text chunks of your LLM application for the corresponding input. This is only for RAG pipelines.
  • [Optional]context: a list of strings representing the ground truth as supporting context.
  • [Optional]comments: a string representing whatever comments your data annotators have for this particular golden (e.g. "Watch out for this expected output! It needs more work.").
  • [Optional]additional_metadata: a freeform JSON which you can use to include as any additional data which you can later make use of in code during evaluation time.

Upload Golden(s) Via DeepEval

Pushing an EvaluationDataset on Confident using deepeval is simply involves creating an EvaluationDataset with a list of Goldens, and pushing it to Confident AI by supplying the dataset alias.

Here's the data structure of a Golden:

from typing import Optional, List, Dict

class Golden:
input: str
actual_output: Optional[str]
expected_output: Optional[str]
retrieval_context: Optional[List[str]]
context: Optional[List[str]]
comments: Optional[str]
additional_metadata: Optional[Dict]
Click to see fake data we'll be uploading

fake_data = [
{
"input": "I have a persistent cough and fever. Should I be worried?",
"expected_output": (
"If your cough and fever persist or worsen, it could indicate a serious condition. "
"Persistent fevers lasting more than three days or difficulty breathing should prompt immediate medical attention. "
"Stay hydrated and consider over-the-counter fever reducers, but consult a healthcare provider for proper diagnosis."
)
},
{
"input": "What should I do if I accidentally cut my finger deeply?",
"expected_output": (
"Rinse the cut with soap and water, apply pressure to stop bleeding, and elevate the finger. "
"Seek medical care if the cut is deep, bleeding persists, or your tetanus shot isn't up to date."
),
}
]

And here's a quick example of how to push Goldens within an EvaluationDataset to Confident AI:

from deepeval.dataset import EvaluationDataset, Golden

# See above for contents of fake_data
fake_data = [...]

goldens = []
for fake_datum in fake_data:
golden = Golden(
input=fake_datum["input"],
expected_output=fake_datum["expected_output"],
)
goldens.append(golden)

dataset = EvaluationDataset(goldens=goldens)

After creating your EvaluationDataset, all you have to do is push it to Confident by providing an alias as an unique identifier.

...

# Provide an alias when pushing a dataset
dataset.push(alias="QA Dataset")

You can also choose to overwrite or append to an existing dataset if an existing dataset with the same alias already exist.

...

# Overwrite existing datasets
dataset.push(alias="QA Dataset", overwrite=True)
note

deepeval will prompt you in the terminal if no value for overwrite is provided.

What is a Golden?

A "Golden" is what makes up an evaluation dataset and is very similar to a test case in deepeval, but they:

  • Does not require an actual_output, so whilst test cases are always ready for evaluation, a golden isn't.
  • Only exists within an EvaluationDataset(), while test cases can be defined anywhere.
  • Contains an extra additional_metadata field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLM actual_output based on some variables in additional_metadata) before evaluation.

We introduced the concept of goldens because it allows you to create evaluation datasets on Confident without needing pre-computed actual_outputs. This is especially helpful if you are looking to generate responses from your LLM application at evaluation time.