Relevant, robust evaluation data is essential for effective evaluations. This data can be generated manually, can include production data, or can be assembled with the help of AI. There are two main types of evaluation data:
- Bring Your Own Data: You can create and update a “golden dataset” with realistic customer questions or inputs paired with expert answers, ensuring quality for generative AI experiences. This dataset can also include samples from production data, offering a realistic evaluation dataset derived from actual queries your AI application has encountered.
- Simulators: If evaluation data is not available, simulators can play a crucial role in generating evaluation data by creating both topic-related and adversarial queries.
- Context-related simulators test the AI system’s ability to handle relevant interactions within a specific context, ensuring it performs well under typical use scenarios.
- Adversarial simulators, on the other hand, generate queries designed to challenge the AI system, mimicking potential security threats or attempting to provoke undesirable behaviors. This approach helps identify the model’s limitations and prepares it to perform well in unexpected or hostile conditions.
Leave a Reply