Statement 26: Adapt test strategies and practices for AI systems
Agencies must
Criterion 91: Mitigate bias in the testing process.
This includes:
- differentiating test data for formal testing from the data used during model development
- ensuring test subjects are not involved in the development of the SUT
- providing testers and developers a degree of independence from each other
- using test-driven development, such as designing test cases based on the requirements, and prior to implementing data, models, and non AI components
- conducting peer reviews, bias awareness training, and documenting test-related decisions and processes.
Criterion 92: Define test criteria approaches.
The test criteria determines whether a test case has passed or failed. It involves comparing an actual output with an expected output.
This includes:
- considering statistical techniques – in probabilistic and non-deterministic systems, a single test case run may not be sufficient to determine success or failure. It may involve repeating a test case multiple times, defining thresholds, minimum values, and average efficiency. An example of statistical approach in regulatory setting is EU’s General Safety Regulation on Driver Drowsiness
- performing baseline testing against a reference system or operation – this is useful in cases where there are no comprehensive specifications to define expected output. This entails comparing the AI system’s behaviour and performance against a reference system. The reference system may be non-AI, manual processing, robotic process automation, or earlier system versions
- considering metamorphic testing – for GenAI or systems where an exact output cannot be specified for a given input. Metamorphic testing involves verifying relations, which are rules about how the output should change relative to how the input is changed. Examples include semantic consistency, intended style and tone, and preservation of facts. This can be done computationally through vectors or user acceptance testing.
Agencies should
Criterion 93: Define how test coverage will be measured.
This includes:
- identifying the limitations of existing static and dynamic coverage measures
- ensuring there is a method for measuring code coverage. This may include when code has been added, such as pre-processing, post-processing, prompts, components being integrated, and model-specific coverage. Code coverage tools can surface untested code and reduce defects or harms
- tracing test cases against requirements, design, and risks to check for gaps and demonstrate test coverage.
Criterion 94: Define a strategy to ensure test adequacy.
Achieving full test coverage for AI systems may be challenging and not viable. To maximise test coverage, consider:
- performing combinatorial testing where comprehensive testing of inputs, preconditions, and corresponding outputs is not feasible due to the complexity of real-world combinations. Combinatorial testing involves deriving test data by sampling from an extremely large set of possible inputs, preconditions, and outputs
- automating as much as practicable. With testing needs required to validate an AI system during development and maintenance phases, performing manual testing with sufficient coverage will not be feasible.
- performing appropriate virtual and real-world testing. This includes:
- using a mix of data sources such as real-world data and synthetic data
- understanding limitations when using synthetic data to test AI systems
- considering requirements for embedded systems, such as model-in-the-loop, software-in-the-loop, processor-in-the-loop, and hardware-in-the-loop
- understanding limitations of the scope and data sampling size for real-world testing
- documenting virtual environment assumptions and correlations with the real world
- documenting the quantity of real-world testing and the rationale for selecting the real-world test cases
- determine the statistical significance of the test results while considering limitations.