Criterion 79: Set assessment criteria for the AI model, with respect to pre-defined metrics for the AI system.
These criteria should address:
Considerations for modelling include:
Criterion 80: Identify and address situations when AI outputs should not be provided.
These situations include:
For GenAI, implementing techniques such as threshold settings or content filtering could address these situations.
Criterion 81: Apply considerations for reusing existing agency models, off-the-shelf, and pre-trained models.
These include:
Criterion 82: Create or fine-tune models optimised for target domain environment.
This includes:
Criterion 83: Create and train using multiple model architectures and learning strategies.
Systematically track model training implementations to speed up the discovery and development of models. This will help select a more optimal trained model.
Criterion 84: Set techniques to validate AI trained models.
There are multiple qualitative and quantitative techniques and tools for model validation, informed by the AI system success criteria (see Design section), including:
Criterion 85: Evaluate the model against training boundaries.
Evaluation considerations include:
Criterion 86: Evaluate the model for bias, implement and test bias mitigations.
This includes:
Criterion 87: Identify relevant model refinement methods.
These considerations may trigger model refinement or retirement and can include:
Criterion 88: Assess a pool of trained models against acceptance metrics to select a model for the AI system.
This involves:
Criterion 89: Establish interface tools and feedback channels for machines and humans.
This also involves providing appropriate human-machine interface tools for human interrogation and oversight.
Criterion 90: Perform model version control.
AI model versioning and tracking is key to comparing performance over time, identifying factors affecting performance and updating the model when needed.
AI model tracking and versioning can involve:
Set-up rollback options to historical models to help develop safety nets in the AI system, reduce risk of deploying new models, and provide deployment flexibility.
Criterion 91: Mitigate bias in the testing process.
This includes:
Criterion 92: Define test criteria approaches.
The test criteria determines whether a test case has passed or failed. It involves comparing an actual output with an expected output.
This includes:
Criterion 93: Define how test coverage will be measured.
This includes:
Criterion 94: Define a strategy to ensure test adequacy.
Achieving full test coverage for AI systems may be challenging and not viable. To maximise test coverage, consider:
Criterion 95: Undertake human verification of test design and implementation for correctness, consistency, and completeness.
Criterion 96: Conduct functional performance testing to verify the correctness of the SUT as per the pre-defined metrics.
This includes testing for fairness and bias to inform affirmative actions.
For off-the-shelf systems, consider benchmark testing using industry-standard benchmark suites and comparisons with competing AI systems.
Criterion 97: Perform controllability testing to verify human oversight and control, and system control requirements.
Criterion 98: Perform explainability and transparency testing as per the requirements.
This involves:
Criterion 99: Perform calibration testing as per the requirements.
This involves:
Criterion 100: Perform logging tests as per the requirements.
This involves verifying that the system records:
Criterion 101: Test the computational performance of the system.
This includes:
Criterion 102: Test safety measures through negative testing methods, failure testing, and fault injection.
This includes:
Criterion 103: Test reliability of the AI output, through stress testing over an extended period, simulating edge cases, and operating under extreme conditions.
Criterion 104: Undertake adversarial testing (red team testing), attempting to break security and privacy measures to identify weaknesses.
AI-specific attacks can be executed before, during, and after training.
Examples of attacks that can be made before and during training includes:
Examples of attacks that can be made after training includes:
Criterion 105: Verify compliance with relevant policies, frameworks, and legislation.
Criterion 106: Verify conformance against organisation and industry-specific coding standards.
This includes static and dynamic source code analysis. While agencies may use traditional analysis tools for the whole system, it is important to note their limitations with respect to AI models and consider finding tools built specifically for AI models.
Criterion 107: Perform vulnerability testing to identify any well-known vulnerabilities.
This includes: