-
Statement 25: Implement continuous improvement frameworks
-
Statement 25: Implement continuous improvement frameworks
-
Agencies must
Criterion 89: Establish interface tools and feedback channels for machines and humans.
This also involves providing appropriate human-machine interface tools for human interrogation and oversight.
Criterion 90: Perform model version control.
AI model versioning and tracking is key to comparing performance over time, identifying factors affecting performance and updating the model when needed.
AI model tracking and versioning can involve:
- dataset tracking and versioning
- model tracking and versioning – each trained model can have the following details:
- algorithm, learning type and hyperparameter settings
- compile-time parameters
- the tools or version of tools used to compile the model.
Set-up rollback options to historical models to help develop safety nets in the AI system, reduce risk of deploying new models, and provide deployment flexibility.
-
Statement 26: Adapt test strategies and practices for AI systems
-
Statement 26: Adapt test strategies and practices for AI systems
-
Agencies must
Criterion 91: Mitigate bias in the testing process.
This includes:
- differentiating test data for formal testing from the data used during model development
- ensuring test subjects are not involved in the development of the SUT
- providing testers and developers a degree of independence from each other
- using test-driven development, such as designing test cases based on the requirements, and prior to implementing data, models, and non AI components
- conducting peer reviews, bias awareness training, and documenting test-related decisions and processes.
Criterion 92: Define test criteria approaches.
The test criteria determines whether a test case has passed or failed. It involves comparing an actual output with an expected output.
This includes:
- considering statistical techniques – in probabilistic and non-deterministic systems, a single test case run may not be sufficient to determine success or failure. It may involve repeating a test case multiple times, defining thresholds, minimum values, and average efficiency. An example of statistical approach in regulatory setting is EU’s General Safety Regulation on Driver Drowsiness
- performing baseline testing against a reference system or operation – this is useful in cases where there are no comprehensive specifications to define expected output. This entails comparing the AI system’s behaviour and performance against a reference system. The reference system may be non-AI, manual processing, robotic process automation, or earlier system versions
- considering metamorphic testing – for GenAI or systems where an exact output cannot be specified for a given input. Metamorphic testing involves verifying relations, which are rules about how the output should change relative to how the input is changed. Examples include semantic consistency, intended style and tone, and preservation of facts. This can be done computationally through vectors or user acceptance testing.
Agencies should
Criterion 93: Define how test coverage will be measured.
This includes:
- identifying the limitations of existing static and dynamic coverage measures
- ensuring there is a method for measuring code coverage. This may include when code has been added, such as pre-processing, post-processing, prompts, components being integrated, and model-specific coverage. Code coverage tools can surface untested code and reduce defects or harms
- tracing test cases against requirements, design, and risks to check for gaps and demonstrate test coverage.
Criterion 94: Define a strategy to ensure test adequacy.
Achieving full test coverage for AI systems may be challenging and not viable. To maximise test coverage, consider:
- performing combinatorial testing where comprehensive testing of inputs, preconditions, and corresponding outputs is not feasible due to the complexity of real-world combinations. Combinatorial testing involves deriving test data by sampling from an extremely large set of possible inputs, preconditions, and outputs
- automating as much as practicable. With testing needs required to validate an AI system during development and maintenance phases, performing manual testing with sufficient coverage will not be feasible
- performing appropriate virtual and real-world testing. This includes:
- using a mix of data sources such as real-world data and synthetic data
- understanding limitations when using synthetic data to test AI systems
- considering requirements for embedded systems, such as model-in-the-loop, software-in-the-loop, processor-in-the-loop, and hardware-in-the-loop
- understanding limitations of the scope and data sampling size for real-world testing
- documenting virtual environment assumptions and correlations with the real world
- documenting the quantity of real-world testing and the rationale for selecting the real-world test cases
- determine the statistical significance of the test results while considering limitations.
-
Statement 27: Test for specified behaviour
-
Statement 27: Test for specified behaviour
-
Agencies must
Criterion 95: Undertake human verification of test design and implementation for correctness, consistency, and completeness.
Criterion 96: Conduct functional performance testing to verify the correctness of the SUT as per the pre-defined metrics.
This includes testing for fairness and bias to inform affirmative actions.
For off-the-shelf systems, consider benchmark testing using industry-standard benchmark suites and comparisons with competing AI systems.
Criterion 97: Perform controllability testing to verify human oversight and control, and system control requirements.
Criterion 98: Perform explainability and transparency testing as per the requirements.
This involves:
- testing that AI outputs are understandable for the target audience, ensuring diversity of test subjects and representativeness of the target population
- testing that the right information is available for the right user.
Criterion 99: Perform calibration testing as per the requirements.
This involves:
- measuring functional performance across various operating or installation conditions
- testing that changes in calibration parameters are detected
- testing that any out-of-range calibration parameters are rejected by the AI system in a transparent and explainable way.
Criterion 100: Perform logging tests as per the requirements.
This involves verifying that the system records:
- system warnings and errors
- relevant system changes with corresponding details of who made the change, timestamp, and system version.
-
Statement 28: Test for safety, robustness, and reliability
-
Statement 28: Test for safety, robustness, and reliability
-
Agencies must
Criterion 101: Test the computational performance of the system.
This includes:
- testing for response times, latency, and resource usage under various loads
- network and hardware load testing.
Criterion 102: Test safety measures through negative testing methods, failure testing, and fault injection.
This includes:
- testing for incorrect or harmful inputs.
Criterion 103: Test reliability of the AI output, through stress testing over an extended period, simulating edge cases, and operating under extreme conditions.
Agencies should
Criterion 104: Undertake adversarial testing (red team testing), attempting to break security and privacy measures to identify weaknesses.
AI-specific attacks can be executed before, during, and after training.
Examples of attacks that can be made before and during training includes:
- dataset poisoning
- algorithm poisoning
- model poisoning
- backdoor attacks.
Examples of attacks that can be made after training includes:
- input attack and evasion
- reverse engineering the model and data.
-
Statement 29: Test for conformance and compliance
-
Statement 29: Test for conformance and compliance
-
Agencies must
Criterion 105: Verify compliance with relevant policies, frameworks, and legislation.
Criterion 106: Verify conformance against organisation and industry-specific coding standards.
This includes static and dynamic source code analysis. While agencies may use traditional analysis tools for the whole system, it is important to note their limitations with respect to AI models and consider finding tools built specifically for AI models.
Criterion 107: Perform vulnerability testing to identify any well-known vulnerabilities.
This includes:
- testing for entire AI system.
- testing for entire AI system.
-
Statement 30: Test for intended and unintended consequences
-
Statement 30: Test for intended and unintended consequences
-
Agencies must
Criterion 108: Perform user acceptance testing (UAT) and scenario testing, validating the system with a diversity of end-users in their operating contexts and real-world scenarios.
Agencies should
Criterion 109: Perform robust regression testing to mitigate the heightened risk of escaped defects resulting from changes, such as a step change in parameters.
Traditional software regression testing is insufficient.
This may include:
- back-to-back testing to compare two versions of system or software using historical data
- A/B software testing to simultaneously compare multiple versions in a real-world setting. This allows agencies to assess the impact of a specific model or software package on the overall system in its intended operating environment.
- performance regression, checking for any degradation in model accuracy, fairness, or other key metrics.
-
Statement 31: Undertake integration planning
-
Statement 31: Undertake integration planning
-
Agencies should
Criterion 110: Ensure the AI system meets architecture and operational requirements with the Australian Government Security Authority to Operate (SATO).
This aspect of integration planning includes:
- assessing the AI system and its third-party dependencies against the agency’s requirements to identify risks
- assessing the AI system against the agency’s architecture principles
- identifying any gaps between the agency’s current and target infrastructure to support the AI system
- ensuring the AI system meets security and privacy requirements for handling classified data.
Criterion 111: Identify suitable tests for integration with the operational environment, systems, and data.
This includes:
- ensuring robust test methods are selected
- incorporating auto testing processes
- ensuring that environment controls satisfy security and privacy requirements for the data in the AI system.
-
Statement 32: Manage integration as a continuous practice
Connect with the digital community
Share, build or learn digital experience and skills with training and events, and collaborate with peers across government.