Share via


Choose evaluation methods

[This article is prerelease documentation and is subject to change.]

When you create test sets, choose from different test methods to evaluate your agent's responses. Each test method has its own strengths and suits different types of evaluations.

Test method Measures Scoring Configurations
General quality How good is test case's answer based on specific qualities Scored out of 100% None
Compare meaning How well the meaning of the test case's answer matches the expected answer Scored out of 100% Pass score, expected answer
Capability use Whether the test case used the expected resources Pass/fail Expected capabilities
Keyword match Whether the test case used all or any of the expected keywords or phrases Pass/fail Expected keywords or phrases
Text similarity How well the text of the test case's answer matches the expected answer Scored out of 100% Pass score, expected answer
Exact match Whether the test case's answer matches the expected answer exactly Pass/fail Expected answer

To add test methods to a test set:

  1. When creating or editing a test set, select Add test method.
  2. Select all the methods you want to test with, then select OK. You can add multiple methods.
  3. Some methods require a pass score. The pass score determines what score results in a pass or a failure. Set the score, then select OK.
  4. Some test methods require additional criteria.
  5. Select Save to save your changes to the test set.

Select an existing test method to edit that method's criteria or delete that method.

General quality

General quality helps you decide whether your agent's responses meet your standards. It uses a language model to assess how effectively an agent answers user questions.

General quality is especially helpful when there's no exact answer expected. It offers a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

It uses these key criteria and applies a consistent prompt to guide scoring:

  • Relevance: To what extent the agent's response addresses the question. For example, does the agent's response stay on the subject and directly answer the question?

  • Groundedness: To what extent the agent's response is based on the provided context. For example, does the agent's response reference or rely on the information given in the context, rather than introducing unrelated or unsupported information?

  • Completeness: To what extent the agent's response provides all necessary information. For example, does the agent's response cover all aspects of the question and provide sufficient detail?

  • Abstention: Whether the agent attempted to answer the question.

To be considered high quality, a response must meet all these key criteria. If one criterion isn't met, the response is flagged for improvement. This scoring method ensures that only responses that are both complete and well-supported receive top marks. In contrast, answers that are incomplete or lack supporting evidence receive lower scores.

When adding or editing test methods, select General quality. All test sets start with this method by default.

You don't need to add expected answers to test cases to complete a general quality evaluation.

Compare meaning

Compare meaning evaluates how well the agent's answer reflects the intended meaning of the expected response. Instead of focusing on exact wording, it uses intent similarity, meaning it compares the ideas and meaning behind the words, to judge how closely the response aligns with what you expected.

Like general quality, compare meaning is especially helpful when there's no exact answer expected. It offers a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

You can set a passing score threshold to determine what constitutes a passing score for an answer. The default passing score is 50. The compare meaning test method is useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

  1. When adding or editing test methods, select Compare meaning.

  2. Set the pass score for this method.

  3. Add the expected answers. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Add the answer you expect.
    3. Select Apply to save the expected answer.
    4. Repeat for all the test cases you want to test by using this method.

Capability use

Capability use tests if the agent used specific tools or topics to generate an answer. If it did, it passes. If it doesn’t, it fails.

You can select if a pass requires Any of the tools or topics or All of them. Choosing Any means that if the agent called at least one, the test case passes. Choosing All means that all the expected tools or topics must match for a test case to pass.

  1. When adding or editing test methods, select Capability use.

  2. Select whether a test case needs Any or All tools or topics to match.

  3. Add the expected tools or topics. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Select the capabilities you expect that case's answer to have.
    3. Select Ok.
    4. Select Apply to save changes.
    5. Repeat for all the test cases you want to test for capability use.
  4. Set the pass score for this method.

  5. Add the expected answers. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Add the tools or topics you expect.
    3. Select Apply to save.
    4. Repeat for all the test cases you want to test by using this method.

Keyword match

Keyword match checks whether the agent’s answer contains some or all of the words or phrases from the expected response that you define. If it does, it passes. If it doesn’t, it fails.

You can select if a pass requires Any of the keywords or All of them. Choosing Any means that if at least one word or phrase matches, the test case passes. Choosing All means that all expected words or phrases must match for a test case to pass.

Keyword match is useful when an answer can be phrased in different correct ways, but key terms or ideas still need to be included in the response.

  1. When adding or editing test methods, select Keyword match.

  2. Select whether a test case needs Any or All keywords to match.

  3. Add the expected keywords. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Add a keyword or phrase you expect that case's answer to have.
    3. Select + to add more keywords or phrases. Select Delete
    4. Select Apply to save the expected keywords.
    5. Repeat for all the test cases you want to test for keyword matching.

Text similarity

The similarity test method compares the similarity of the agent’s responses to the expected responses you define in your test set. It's useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

It uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response and determines a score. The score ranges between 0 and 1, where 1 indicates the answer closely matches and 0 indicates it doesn't. You can set a passing score threshold to determine what constitutes a passing score for an answer.

  1. When adding or editing test methods, select Text similarity.

  2. Set the pass score for this method.

  3. Add the expected answers. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Add the answer you expect.
    3. Select Apply to save the expected answer.
    4. Repeat for all the test cases you want to test by using this method.

Exact match

Exact match checks whether the agent’s answer exactly matches the expected response in the test: character for character, word for word. If it's the same, it passes. If anything differs, it fails. Exact match is useful for short, precise answers such as numbers, codes, or fixed phrases. It doesn't suit answers that people can phrase in multiple correct ways.

  1. When adding or editing test methods, select Exact match.

  2. Add the expected answers. Any test case without one produces an Invalid result for this test method.

    1. Select a test case.
    2. Add the answer you expect.
    3. Select Apply to save the expected answer.
    4. Repeat for all the test cases you want to test by using this method.