-
Agencies must
Criterion 76: Establish compute resources and infrastructure for the training environment.
This allows for infrastructure and computational constraints to be considered in relation to business needs and supports configuration of learning strategies best optimised for the infrastructure environment.
Criterion 77: Secure the infrastructure.
Implement required security and access controls for infrastructure used for training, validating, and testing the AI model which are dependent on the security classification of the data. For details, see the Information security manual (ISM), Essential Eight maturity model, Protective Security Policy Framework and Strategies to mitigate cyber security incidents.
Agencies should
Criterion 78: Reuse approved AI modelling frameworks, libraries, and tools.
-
Statement 22: Implement model creation, tuning, and grounding
-
Statement 22: Implement model creation, tuning, and grounding
-
Agencies must
Criterion 79: Set assessment criteria for the AI model, with respect to pre-defined metrics for the AI system.
These criteria should address:
- success factors specific to user stories
- model quality thresholds and performance of the AI system
- explainability and interpretability requirements
- security and privacy requirements
- ethics requirements
- tolerance for error for model outputs
- tolerance for negative impacts
- error rates by scale and similar processing by humans.
Considerations for modelling include:
- model training, maintenance, and support costs
- data and compute infrastructure constraints
- likelihood of the AI models becoming outdated
- whether the model can be legally used for the intended use case
- whether methods can be implemented to mitigate risk of new harms being introduced into the AI system
- bias, security, and ethical concerns
- whether the model meets the explainability and interpretability requirements
- use of model interpretability tools to analyse important features and decision logic.
Criterion 80: Identify and address situations when AI outputs should not be provided.
These situations include:
- low confidence scores
- when user input and context are ambiguous or lack reliable sources
- complex questions as input
- limited knowledge base
- privacy concerns and potential breach of safety
- harmful content
- unlawful content
- misleading content.
For GenAI, implementing techniques such as threshold settings or content filtering could address these situations.
Criterion 81: Apply considerations for reusing existing agency models, off-the-shelf, and pre-trained models.
These include:
- whether the model can be adapted to meet the KPIs for the AI system
- suitability of pre-defined AI architecture
- availability of AI specialist skills or skills required for configuration and integration
- whether the model is relevant to the target operating domain or can be adapted to it, such as fine-tuning, retrieval-augmented generation (RAG), and pre-processing and post-processing techniques
- cybersecurity assessment in line with Australian Government policies and guidance (see Whole of AI Lifecycle for more details).
Criterion 82: Create or fine-tune models optimised for target domain environment.
This includes:
- model testing on target operating environment and infrastructure
- using pre-processing and post-processing techniques
- addressing input and output filtering requirements for safety and reliability
- grounding such as RAG, which can augment a large language model (LLM) with trusted data from a database or knowledge base internal to an agency
- for GenAI, prompt engineering or establishing a prompt library, which can streamline and improve interactions with an AI model
- consider cost and performance implications associated with the adaptation techniques
- perform unit testing for the training algorithm, pre-processing, and post-processing algorithms
- track model training implementations systematically to speed up the discovery and development of models.
Agencies should
Criterion 83: Create and train using multiple model architectures and learning strategies.
Systematically track model training implementations to speed up the discovery and development of models. This will help select a more optimal trained model.
-
Statement 23: Validate, assess, and update model
-
Statement 23: Validate, assess, and update model
-
Agencies must
Criterion 84: Set techniques to validate AI trained models.
There are multiple qualitative and quantitative techniques and tools for model validation, informed by the AI system success criteria (see Design section), including:
- correct classifications, predictions or forecasts, and factual correctness and relevance
- identify between positive and negative instances, and distinguish between classes
- benchmarking
- consistency in responses, clarity and coherence
- source attribution
- data-centric validation approaches for GenAI models.
Criterion 85: Evaluate the model against training boundaries.
Evaluation considerations include:
- poor or degraded performance of the model
- change of AI context or operational setting
- data retention policies
- model retention policies.
Criterion 86: Evaluate the model for bias, implement and test bias mitigations.
This includes:
- using suitable tools that test and discover unwarranted associations between an algorithm’s protected input features and its output
- evaluating performance across suitable and intersectional dimensions
- checking if bias could be managed through updating the training data (see Statement 18)
- implementing bias mitigation thresholds that can be configured post-deployment
- implementing pre-processing or post-processing techniques such as disparate impact remover, equalised odds post-processing, content filtering, and RAG.
Agencies should
Criterion 87: Identify relevant model refinement methods.
These considerations may trigger model refinement or retirement and can include:
- model parameter or weight adjustments – further training or re-training the model on a new set of observations, or additional training data
- adjusting data pre-processing or post-processing components
- model pruning – to reduce redundant mathematical calculations and speed up operations.
-
Statement 24: Select trained models
-
Statement 24: Select trained models
-
Agencies should
Criterion 88: Assess a pool of trained models against acceptance metrics to select a model for the AI system.
This involves:
- defining clear needs and expectations
- comparing multiple trained models, usually generated based on different configurations
- prioritising based on metrics such as ‘simplest’ or ‘most effective’
- documenting the rationale for selection based on results from training models with various model architectures, learning strategies, and configurations
- any risk and mitigation plans
- a model refresh and re-training plan and register
- implementing mechanisms for explainability of model outputs to system users
- feedback channels and mechanisms implemented for monitoring and managing model performance
- an audit plan
- documenting a method for retiring the model.
-
Criterion 3 - Protect users
Establish and maintain a safe digital environment that counters scams and misinformation, and provides transparency, and a feedback mechanism.
Off -
Statement 25: Implement continuous improvement frameworks
-
Statement 25: Implement continuous improvement frameworks
-
Agencies must
Criterion 89: Establish interface tools and feedback channels for machines and humans.
This also involves providing appropriate human-machine interface tools for human interrogation and oversight.
Criterion 90: Perform model version control.
AI model versioning and tracking is key to comparing performance over time, identifying factors affecting performance and updating the model when needed.
AI model tracking and versioning can involve:
- dataset tracking and versioning
- model tracking and versioning – each trained model can have the following details:
- algorithm, learning type and hyperparameter settings
- compile-time parameters
- the tools or version of tools used to compile the model.
Set-up rollback options to historical models to help develop safety nets in the AI system, reduce risk of deploying new models, and provide deployment flexibility.
-
Statement 26: Adapt test strategies and practices for AI systems
-
Statement 26: Adapt test strategies and practices for AI systems
-
Agencies must
Criterion 91: Mitigate bias in the testing process.
This includes:
- differentiating test data for formal testing from the data used during model development
- ensuring test subjects are not involved in the development of the SUT
- providing testers and developers a degree of independence from each other
- using test-driven development, such as designing test cases based on the requirements, and prior to implementing data, models, and non AI components
- conducting peer reviews, bias awareness training, and documenting test-related decisions and processes.
Criterion 92: Define test criteria approaches.
The test criteria determines whether a test case has passed or failed. It involves comparing an actual output with an expected output.
This includes:
- considering statistical techniques – in probabilistic and non-deterministic systems, a single test case run may not be sufficient to determine success or failure. It may involve repeating a test case multiple times, defining thresholds, minimum values, and average efficiency. An example of statistical approach in regulatory setting is EU’s General Safety Regulation on Driver Drowsiness
- performing baseline testing against a reference system or operation – this is useful in cases where there are no comprehensive specifications to define expected output. This entails comparing the AI system’s behaviour and performance against a reference system. The reference system may be non-AI, manual processing, robotic process automation, or earlier system versions
- considering metamorphic testing – for GenAI or systems where an exact output cannot be specified for a given input. Metamorphic testing involves verifying relations, which are rules about how the output should change relative to how the input is changed. Examples include semantic consistency, intended style and tone, and preservation of facts. This can be done computationally through vectors or user acceptance testing.
Agencies should
Criterion 93: Define how test coverage will be measured.
This includes:
- identifying the limitations of existing static and dynamic coverage measures
- ensuring there is a method for measuring code coverage. This may include when code has been added, such as pre-processing, post-processing, prompts, components being integrated, and model-specific coverage. Code coverage tools can surface untested code and reduce defects or harms
- tracing test cases against requirements, design, and risks to check for gaps and demonstrate test coverage.
Criterion 94: Define a strategy to ensure test adequacy.
Achieving full test coverage for AI systems may be challenging and not viable. To maximise test coverage, consider:
- performing combinatorial testing where comprehensive testing of inputs, preconditions, and corresponding outputs is not feasible due to the complexity of real-world combinations. Combinatorial testing involves deriving test data by sampling from an extremely large set of possible inputs, preconditions, and outputs
- automating as much as practicable. With testing needs required to validate an AI system during development and maintenance phases, performing manual testing with sufficient coverage will not be feasible
- performing appropriate virtual and real-world testing. This includes:
- using a mix of data sources such as real-world data and synthetic data
- understanding limitations when using synthetic data to test AI systems
- considering requirements for embedded systems, such as model-in-the-loop, software-in-the-loop, processor-in-the-loop, and hardware-in-the-loop
- understanding limitations of the scope and data sampling size for real-world testing
- documenting virtual environment assumptions and correlations with the real world
- documenting the quantity of real-world testing and the rationale for selecting the real-world test cases
- determine the statistical significance of the test results while considering limitations.
-
Statement 27: Test for specified behaviour
-
Statement 27: Test for specified behaviour
-
Agencies must
Criterion 95: Undertake human verification of test design and implementation for correctness, consistency, and completeness.
Criterion 96: Conduct functional performance testing to verify the correctness of the SUT as per the pre-defined metrics.
This includes testing for fairness and bias to inform affirmative actions.
For off-the-shelf systems, consider benchmark testing using industry-standard benchmark suites and comparisons with competing AI systems.
Criterion 97: Perform controllability testing to verify human oversight and control, and system control requirements.
Criterion 98: Perform explainability and transparency testing as per the requirements.
This involves:
- testing that AI outputs are understandable for the target audience, ensuring diversity of test subjects and representativeness of the target population
- testing that the right information is available for the right user.
Criterion 99: Perform calibration testing as per the requirements.
This involves:
- measuring functional performance across various operating or installation conditions
- testing that changes in calibration parameters are detected
- testing that any out-of-range calibration parameters are rejected by the AI system in a transparent and explainable way.
Criterion 100: Perform logging tests as per the requirements.
This involves verifying that the system records:
- system warnings and errors
- relevant system changes with corresponding details of who made the change, timestamp, and system version.
-
Statement 28: Test for safety, robustness, and reliability
-
-
-
Statement 28: Test for safety, robustness, and reliability
-
Agencies must
Criterion 101: Test the computational performance of the system.
This includes:
- testing for response times, latency, and resource usage under various loads
- network and hardware load testing.
Criterion 102: Test safety measures through negative testing methods, failure testing, and fault injection.
This includes:
- testing for incorrect or harmful inputs.
Criterion 103: Test reliability of the AI output, through stress testing over an extended period, simulating edge cases, and operating under extreme conditions.
Agencies should
Criterion 104: Undertake adversarial testing (red team testing), attempting to break security and privacy measures to identify weaknesses.
AI-specific attacks can be executed before, during, and after training.
Examples of attacks that can be made before and during training includes:
- dataset poisoning
- algorithm poisoning
- model poisoning
- backdoor attacks.
Examples of attacks that can be made after training includes:
- input attack and evasion
- reverse engineering the model and data.
-
Statement 29: Test for conformance and compliance
Connect with the digital community
Share, build or learn digital experience and skills with training and events, and collaborate with peers across government.