Statement 12: Define success criteria
Agencies must
Criterion 41: Identify, assess, and select metrics appropriate to the AI system.
Relying on a single metric could lead to false confidence, while tracking irrelevant metrics could lead to false incidents. To mitigate these risks, analyse the capabilities and limitations of each metric, select multiple complementary metrics, and implement methods to test assumptions and to find missing information.
Considerations for metrics includes:
- value-proposition metrics – benefits realisation, social outcomes, financial measures, or productivity measures
- performance metrics – precision and recall for classification models, mean absolute error for regression models, or bilingual evaluation understudy (BLEU) for text generation. This can include summarisation tasks, inception score for image generation models, or mean opinion score for audio generation
- training data metrics – data diversity and data quality related measures
- bias-related metrics – demographic parity to measure group fairness, fairness through awareness to measure individual fairness, counterfactual fairness to measure causality-based fairness
- safety metrics – likelihood of harmful outputs, adversarial robustness, or potential data leakage measures
- reliability metrics – availability, latency, mean time between failures (MTBF), mean time to failure (MTTF), or response time
- citation metrics – measures related to proper acknowledgement and references to direct content and specialised ideas
- adoption-related metrics – adoption rate, frequency of use, daily active users, session length, abandonment rate, or sentiment analysis
- human-machine teaming metrics – total time or effort taken to complete a task, reaction time when human control is needed, or number of times human intervention is needed
- qualitative measures – checking the well-being of the humans operating or using the AI system, or interviewing participants and observing them while using the AI system to identify usability issues
- drift in AI system inputs and outputs - changes in input distribution, outputs, and performance over time.
After metrics have been identified, understand and assess the trade-offs between the metrics.
This includes:
- assessing trade-offs between different success criteria
- determining the possible harms with incorrect output, such as a false positive or false negative
- analysing how the output of the AI system could be used. For example, determine which instance would have greater consequences: a false negative that would fail to detect a cyberattack; or a false positive that incorrectly flags a legitimate user as a threat
- assessing the trade-offs among the performance metrics
- understanding the trade-offs with costs, explainability, reliability, and safety
- understanding the limitations of the selected metric and ensure measures are considered when building the AI system, such as selecting data and training methods
- trade-offs are documented, understood by stakeholders, and accounted for in selecting AI models and systems
- optimising the metrics appropriate to the use case.
Agencies should
- Criterion 42: Reevaluate the selection of appropriate success metrics as the AI system moves through the AI lifecycle.
Criterion 43: Continuously verify correctness of the metrics.
Before relying on the metrics, verify the following:
- metrics are accurately reflected when the AI system does not have enough information
- metrics correctly reflect errors, failures, and successful task performance.