-
Agencies must
Criterion 44: Create and collect data for the AI system and identify the purpose for its use.
It is important to identify:
- what data will be used and is fit-for-purpose for the AI system
- the sensitivity of the data, such as personal, protected, or otherwise sensitive
- consent provided on usage including when to retain or destroy data, ensuring the proposed uses in the AI system align with the original limits of the consent
- speed and mode of the data supply
- how the data will be used at each stage of the AI system
- where the data will be stored at each stage of the AI system
- changes to the data at different points of the AI system
- methods to manage and monitor data access
- methods to manage any real-time data changes
- data retention policies
- cross-agency or cross-border data governance, if relevant
- any risks and challenges associated with data elements of off-the-shelf AI models, products, or services in the AI system
- cyber supply chain management
- data quality monitoring and remediation
- comprehensive documentation at each stage of the AI system to facilitate traceability and accountability
- adherence to relevant legislation.
- The consent framework for use of data across the AI system should satisfy the following:
- clear framework
- kept up to date
- individuals are provided with informed consent for how their data will be used
- a dedicated team to own and maintain a register on how data is being used and to show compliance with the terms of the consent
- The data should be thought of in groupings or packages, including:
- the data within the organisation
- the data surrounding the algorithm, APIs, and user interface
- the data used to train the AI system
- the data used for testing and integration
- data inputted at regular intervals in monitoring the data
- the data used at deployment, including input and output data from and to users.
Criterion 45: Plan for data archival and destruction.
Consider the following:
- will data be made available for future use, and what data
- restrictions and access controls in place
- will data be restricted until a specific date
- file formats to ensure data remains available during the archival period
- alignment with data sharing arrangements
- arrangements for data used to train and test AI models, and associated model management arrangements
- clear criteria for data archival and destruction for the data used at each stage of the AI lifecycle
- guidelines in the Information management for records created using Artificial Intelligence (AI) technologies | naa.gov.au.
Agencies should:
Criterion 46: Analyse data for use by mapping the data supply chain and ensuring traceability.
Mapping the data supply chain to the AI system involves capturing how data will be stored, shared, and processed, particularly at the training and testing stages, which involve regular injections of data. When mapping the data account for:
- how data was sourced
- what data is required by the system, ensuring that excess data or data irrelevant to the functioning of system is not consumed by the system
- the amount and type of data the system will use
- what could affect the reliable accessibility of data
- how data will be fused and transformed
- how will the data be secured at rest and in transit
- how will the data be used by the system.
Ensuring traceability entails maintaining awareness of the flow of data across the AI system.
This includes:
- data sovereignty controls and considerations including legal implications for geographic locations for data (including its metadata and logs) when at rest, in transit, or in use. For classified data processing on cloud platforms, it is recommended to use cloud service providers and cloud services located in Australia, as per Cloud assessment and authorisation | Cyber.gov.au
- providing the level of detail for debugging data errors and troubleshooting
- enforcing organisational policies on information management
- enhancing visibility over changes to the data occurring during migrations, system updates, or other errors
- supporting users to identify and fix data issues with a clear information audit trail
- supporting diagnosis for bias
- managing the quality of data to maintain availability and consistency.
Criterion 47: Implement practices to maintain and reuse data.
This involves determining ongoing mechanisms for ensuring data is protected, accessible, and available for use in line with the original consent parameters.
Any changes in data scope, including expansion in scope and usage patterns, would need to be monitored and addressed.
-
Statement 14: Implement data orchestration processes
-
Statement 14: Implement data orchestration processes
-
Agencies must
Criterion 48: Implement processes to enable data access and retrieval, encompassing the sharing, archiving, and deletion of data.
Considerations include:
- security classifications and permissions of the data
- speed or mode of the data, such as streaming or batch data
- alignment to Guidelines for data transfers | Cyber.gov.au.
Agencies should
Criterion 49: Establish standard operating procedures for data orchestration.
This includes:
- defining responsibilities between business areas and identifying mutual outcomes to be managed across teams. This is particularly important for business areas that are owners of datasets
- considering inclusion of infrastructure arrangements and use of cloud arrangements for data storage or processing.
Practices to be defined include:
- data governance
- data testing
- security and access controls.
Criterion 50: Configure integration processes to integrate data in increments.
This includes:
- enabling agencies to better manage incident identification and intervention during data integration
- ensuring risks of creating personal identifiable information from data integration are managed appropriately.
Criterion 51: Implement automation processes to orchestrate the reliable flow of data between systems and platforms.
Criterion 52: Perform oversight and regular testing of task dependencies.
This should involve having comprehensive backup plans in place to handle potential outages or incidents.
The following should be considered:
- regular backups of critical data
- failover mechanisms
- detailed recovery procedures to minimise downtime and data loss.
Criterion 53: Establish and maintain data exchange processes.
This includes:
- how often will data need to be accessed by the system
- at what points will the frequency, magnitude, or speed of access change
- how will security processes adapt when data is exposed to new risks across the AI system
- how will data be monitored for changes to accessibility or completeness
- will the sensitivity of the data change once processed or analysed
- how to validate data trust and authenticity.
-
-
-
Statement 15: Implement data transformation and feature engineering practices
-
Statement 15: Implement data transformation and feature engineering practices
-
Agencies should
Criterion 54: Establish data cleaning procedures to manage any data issues.
Data cleaning involves appropriately treating data errors, inconsistencies, or missing values to improve performance of the AI system. Data cleaning should be documented, and possibly included in the metadata, each time it is conducted to manage issues such as:
- blanks, nulls, or trailing spaces
- structural errors or unwanted formatting
- missing data
- spelling mistakes
- repetition of words
- irrelevant characters
- content or observations irrelevant to the purpose of the AI system.
For open-source data, or data that has not yet been validated or can be trusted, consider using a sandbox environment.
Criterion 55: Define data transformation processes to convert and optimise data for the AI system.
This could leverage existing Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT) processes.
Consider the following data transformation techniques:
- data standardisation – convert data from various sources into a consistent format
- data reorganisation – organise data to make it easier to query and analyse
- data integration – combine data from different sources for a single unified view
- discretisation – convert continuous data into discrete intervals
- missing value imputation – analyse what values need to be imputed and the method
- convert data from one source to another, such as log transformation
- smoothing – to even out fluctuations
- convert unstructured data to structured data
- Optical Character Recognition (OCR) – convert images of text into machine readable format
- object labelling and tracking – in images, audio, and video
- signal processing and transformation
- point in time of data – a snapshot of data at a specific point in time.
Criterion 56: Map the points where transformation occurs between datasets and across the AI system.
Consider:
- security checks.
Criterion 57: Identify fit-for-purpose feature engineering techniques.
Feature engineering techniques include:
- feature creation and extraction – deriving features from existing data to help the AI system produce better quality outputs
- feature selection – selecting attributes or fields that provide relevant context to the AI model
- encoding – converting data into a format that can be better used in AI algorithms
- binning – grouping data into categories
- specific conversion – changing data from one format to another for AI compatibility
- scaling – mapping all data to a specific range to help improve AI outputs.
Criterion 58: Apply consistent data transformation and feature engineering methods to support data reuse and extensibility.
Consider:
- metadata and tagging of the data
- data transformation not limited to AI models and processes.
-
Statement 16: Ensure data quality is acceptable
-
Statement 16: Ensure data quality is acceptable
-
Agencies must
Criterion 59: Define quality assessment criteria for the data used in the AI system.
Data quality can be measured across a variety of dimensions (in line with the ABS Data Quality Framework) by identifying institutional environment, relevance, timeliness, accuracy, coherence, interpretability, and accessibility.
A report on data quality can include:
- data quality statement (see ABS Data Quality Statement Checklist)
- metrics for measuring data quality, including its correctness and credibility
- frequency of reporting on data quality
- delegating ownership to a business area to be responsible for managing data quality
- monitoring for changes in quality across the supply chain
- intervening and addressing data quality issues as they arise
Consider:
- any existing data standard frameworks that are used by the agency.
Agencies should:
Criterion 60: Implement data profiling activities and remediate any data quality issues.
This involves analysing the structure, content, and quality of the data to determine its fitness for purpose for an AI system.
Data profiling can investigate the following characteristics:
- frequency
- volume, range, and distribution
- invalid entry identification
- error detection
- duplicates identification
- noise identification
- specific pattern identification.
- Methods that can be used to explore and analyse the data include:
- descriptive statistics, such as mean, median, mode, or frequencies
- business rules – apply business knowledge
- clustering or dendrogram – group similar observations together
- visualisation – to get a visual representation of the data from various types of graphs and charts, such as histograms, bar-plot, boxplots, density plots, or heatmaps
- correlation analysis – measure relationships between variables, usually between numerical variables
- scatter plots – visualise relationships between two numerical variables
- cross-tabulations – analyse relationships between multiple categorical variables
- principal component analysis – analyse variables with the most variance
- factor analysis – helps reveal hidden patterns.
Criterion 61: Define processes for labelling data and managing the quality of data labels.
Data labelling can be done for the purposes of managing and storing data, audit purposes and AI model training purposes. Humans with appropriate skills and knowledge can perform the data labelling or it could be supported by automated labelling tools.
Setting data labelling practices can help optimise performance across the AI system by describing the context, categories, and relationships between data types, creating lineage of data through the AI system via versioning, distinguishing between pre-deployment and live data, and identifying what data will be reused, archived, or destroyed.
These include:
- establishing naming schemes, taxonomy, tagging, and data labelling practices
- considering different techniques such as manual or automated labelling, crowdsourcing, and quality checks
- defining quality control methods to improve consistency of labelling and assist in reducing bias
- considering changes to the raw data and data imputations, and associated impact
- providing data labels for AI training approaches or testing the AI models. Labels can provide the ground truth data for AI models and can influence AI validation. Different types of data labelling include:
- classification
- regression
- visual object labels
- audio labels
- entity tagging.
- applying quality assurance measures to data labels, labelling personnel, and automated data-labelling support tools
- implementing bias mitigation practices in labelling:
- establishing a review process. Diverse people could independently label the same data so correlation could be analysed. Final labels could go through spot check review by subject matter experts
- establishing feedback loops. Labellers should be able to report issues and suggest improvements, and automated systems should be updated to be consistent with corrections made by human labellers
- establishing performance management for staff. Data labellers should undergo periodic training, performance reviews, and random audits for quality control
- implementing metadata labelling techniques that capture the type of data categories within the system and the relationship between these categories. Metadata labels can be prepared for model bias evaluation by annotating metadata with suitable dimensions. Ensure the metadata labelling aligns to the Guide on Metadata Attributes | Office of the National Data Commissioner and Australian Government Recordkeeping Metadata Standard | naa.gov.au.
- assessing and monitoring quality of all automated data labelling support tools. Determine the regularity and criteria for these quality checks and report on findings
- updating and maintaining the labelling tools and processes to adapt to new data types, and labelling requirements
- considering potential harm to data labellers who may need to access sensitive or distressing content. This can occur when training an AI model to prevent responses including violence, hate speech, or sexual abuse.
-
Statement 17: Validate and select data
-
Statement 17: Validate and select data
-
Agencies must
Criterion 62: Perform data validation activities to ensure data meets the requirements for the AI system’s purpose.
This involves including AI-specific validations in schema migrations to ensure data pipelines and feature stores remain functional. Suitable data validation techniques include:
- type validation – ensuring data is in the correct data type
- format validation – ensuring data aligns to a predefined pattern
- range validation – checking whether data falls within a specific range
- outlier detection – checking for data points that significantly deviate from the general data pattern
- completeness – verifying that all required fields are filled
- diversity – ensuring the data represents a variety of data points
Considerations include:
- a quality framework
- online near real-time and offline batch data validation mechanisms to support the purpose and operations of the AI system.
Criterion 63: Select data for use that is aligned with the purpose of the AI system.
This includes:
- alignment with the agency’s business intent and the goals of the AI system, as well as ensuring data meets the data quality criteria previously established
- maintaining a live test dataset to test the AI system in production, to help monitor and maintain the operational integrity of the AI system.
-
Statement 18: Enable data fusion, integration and sharing
-
Statement 18: Enable data fusion, integration and sharing
-
Agencies should:
Criterion 64: Analyse data fusion and integration requirements.
This includes:
- datasets, their sources and their owners
- purpose of the datasets for the AI system and intended outcomes
- data interdependencies
- risks associated with the datasets and mitigation plans
- data fusion and integration methodology for the AI system
- metrics to assess the quality of the fusion and data integration process and its outputs
- security, storage, and access requirements
- scalability intentions
- documentation and traceability
- regular audits and reviews
- data sharing principles and the risk management framework data as per the Data Availability and Transparency Act 2022 (DATA Scheme)
- compliance with the Guidelines on data matching in Australian Government administration guidelines
- ethical considerations and guidance on data use as per the Data Ethics Framework | Department of Finance.
Data fusion is a method to integrate or combine data from multiple sources and this can help an AI system create a more comprehensive, reliable, and accurate output. Meaningful data sharing practices across the agency can build interoperability between systems and datasets. Data sharing also promotes reuse, reducing resources for collection and analysis.
Criterion 65: Establish an approach to data fusion and integration.
This approach should involve one or more of the following processes:
- ETL (Extract, Transform and Load) – batch movements of data
- ELT (Extract, Load and Transform) – batch movements of data
- Application programming interface (API) – allowing the movement and syncing of data across multiple applications
- data streaming – moving data in or near real-time from source to target
- data virtualisation – combining streaming data virtually from different sources on demand
- chaining of AI models – linking multiple AI models in a sequence where the output from one model becomes the input for another.
Consider:
- data migration guidelines and any agency data management agreements, if relevant.
Agencies can optimise data fusion and integration processes by automating scheduling and data integration tasks and by deploying intuitive interfaces to diagnose and resolve errors.
Criterion 66: Identify data sharing arrangements and processes to maintain consistency.
Data sharing considerations include:
- whether other systems could leverage the data analysed by the AI system
- which areas within the agency would benefit from analysed data being shared with them
- what data containers could improve with access to the system’s data sources
- whether data on how the system was trained could be used to train other systems
- documentation such as a memorandum of understanding, or similar, for data sharing arrangements intra-agency, inter-agency, or with external parties
- addressing risks of creating personal identifiable information
- what can be published for public, government, or internal benefit
- any legislative implications.
-
Statement 19: Establish the model and context dataset
-
Statement 19: Establish the model and context dataset
-
Agencies must:
Criterion 67: Measure how representative the model dataset is.
Key considerations for measuring and selecting a model dataset include:
- whether it is representative of the true population relevant to the purpose of the AI system – this will improve model generalisation and minimise overfitting
- ensuring the dataset has the required features, volumes, distribution, representation and demographics, including people with lived experience and intersectional dimensions. For example, someone with cultural or linguistic diversity, may also be a person with disability, the dataset must consider how multiple dimensions of a person intersect and create unique experiences or challenges
- for GenAI, assess data quality thresholds and mechanisms in the data setup for modelling to help avoid unwanted bias and hallucinations.
Criterion 68: Separate the model training dataset from the validation and testing datasets.
Agencies must maintain the separation between these datasets to avoid any misleading evaluation for trained models.
Agencies can refresh these datasets to account for timeframes, degradation in AI performance during operation, and compute resource constraints.
Criterion 69: Manage bias in the data.
Techniques for agencies to manage and mitigate problematic bias in their model dataset includes:
- data collection analysis – examining how data was generated and verified, and checking the methodologies used to ensure the data is diverse and represents the real population
- data source analysis – investigating limitations and assumptions around the origin of the data
- data diversity – determining various demographics, sources and types of data, inclusion and exclusion considerations
- statistical testing – determining the likelihood of the population being accurately represented in the data
- class imbalance – analysing data for class imbalance before using it to train classification models, and applying relevant data and algorithm techniques and metrics, such as precision or F1-score, to address this
- outlier detection – identifying outliers or unusual data points in the data and ensuring they are handled appropriately
- exploratory data analysis – using descriptive statistics and data visualisation tools to identify patterns and discrepancies
- removing any irrelevant data from the training data that does not improve the performance of the model
- ensuring that any sensitive and protected data are retained in the test datasets for the purpose of evaluating for bias
- data augmentation – deploying measures to address the completeness of the model dataset, through supplementary data collection or synthetic data generation
- transparency – identifying bias and where it originated from through transparency on data sourcing and processing
- domain knowledge – ensuring practitioners have relevant domain knowledge on the datasets the AI system uses to serve the scope of the AI, including an understanding of the data characteristics and what it represents for the organisation
- documentation of data use – documenting the use of data by the AI system and any potential change of use, providing an audit trail of any incidence and causation of bias.
Agencies should
Criterion 70: For Generative AI, build reference or contextual datasets to improve the quality of AI outputs.
A reference or a contextual dataset for GenAI, can be in the form of (and not limited to) a retrieval-augmented generation (RAG) dataset or a prompt dataset.
Key considerations include:
- building high-quality reference or contextual datasets to support more accurate and context aware AI outputs, and reduce hallucinations
- implementing pre-defined prompts tailored to ensure consistent and reliable responses from GenAI models
- establish workflows for prompt engineering and data preparation to streamline development and deployment of GenAI systems.
-
Statement 20: Plan the model architecture
-
Statement 20: Plan the model architecture
-
Agencies must
Criterion 71: Establish success criteria that covers any AI training and operational limitations for infrastructure and costs.
Ensure alignment with AI system metrics selected at the design stage.
Consider:
- AI system purpose and requirements including explainability
- pre-defined AI system metrics including AI performance metrics
- impact and treatment for false positives and false negatives
- AI operational environment including scalability intentions
- frequency of change in context
- limitations on compute infrastructure
- cost constraints
- operational models such as ModelOps, MLOps, LLMOps, DataOps, and DevOps (see Statement 1 ).
AI training can occur in offline mode, or in online or real-time mode. This is dependent on the business case and the maturity of the data and infrastructure architecture. The risk of the model becoming stale is higher in offline mode, while the risk of the model exhibiting unverified behaviour is higher in online mode.
The training process is interdependent to the infrastructure in the training environment. Complex model architectures with highly specialised learning strategies and large model datasets generally require tailored infrastructure to manage costs.
Criterion 72: Define a model architecture for the use case suitable to the data and AI system operation.
The following will influence the choice of the model architecture and algorithms:
- business requirements – risk thresholds or performance criteria
- purpose of the system – identified stakeholders and the intended outcomes, safety, reproducibility level of AI model outputs, or explainability level for AI outputs
- data – bias, quality, and managing the supply of data to the system
- supporting infrastructure – computational demands, costs, and speed with respect to business needs
- resourcing – the capabilities involved with documentation, oversight, and intervention in training the AI model or reusable assets
- design – the training process will include necessary human oversight and intervention, to ensure responsible AI practices are in place. Consider embedding flexible architecture practices to avoid vendor lock-in.
The model architecture will highlight the variables that will impact the intended outcomes for the system. These variables will include the model dataset, use case application and scalability intentions. These variables will influence which algorithms and learning strategies are chosen to train the AI model.
An AI scientist can test and analyse the model architecture and dataset to identify what is needed to effectively train the system. Additionally, they can outline requirements for the model architecture to comply with data, privacy, and ethical expectations.
Consider starting off with simple and small architectures, and add complexity progressively depending on the purpose of the system to simplify debugging and reduce errors. Note that an AI system can contain a combination of multiple models which can add to the complexity.
Generally, a single type of algorithm and training process may not be sufficient to determine optimal models for the AI system. It is usually good practice to train multiple models with various algorithms and training methodologies.
There are options to develop a chain of AI models, or add more complexity, if that better meets the intent of the AI system. Each model could use a different type of algorithm and training process.
Analysis of support and maintenance of the AI system in operation can influence the model architecture. For some use cases, a complete model refresh may be required, noting cost considerations. Alternatives such as updates to pre or post processing could be considered, including updates to the configuration or knowledge repository for RAG for GenAI.
It may not be necessary to retrain models every time new information becomes available, and this should be considered when defining the model architecture. For example, for GenAI, adding new information in RAG can help the AI system remain up to date without the need to retrain the AI model, saving on costs without impacting AI accuracy.
Criterion 73: Select algorithms aligned with the purpose of the AI system and the available data.
There are various forms of algorithms to train an AI model, and it is important to select them based on the AI system requirements, model success criteria, and the available model dataset. A learning strategy is a method to train an AI model and dictates the mathematical computations that will be required during the training process.
Depending on use case, some examples of the types of training processes may include:
- supervised learning – training an AI model with a dataset, made up of observations, that has desired outputs or labels, such as support vector machines or tree-based models
- unsupervised learning – training a model to learn patterns in the dataset itself, where the training dataset does not have desired outputs or labels, such as anomaly detection or transformer LLMs
- reinforcement learning – training a model to maximise pre-defined goals, such as Monte Carlo tree search or fine-tuning models
- transfer learning – a model trained on one task, such as a pre-trained model, is reused as a starting point to enhance model performance on a related, yet different, task
- parameter tuning – optimising a model’s performance by adjusting parameters or hyperparameters of a model, usually adjusted automatically
- model retraining – updating a model with new data
- online or real-time mode – continuously train the model using live data (note that this can significantly increase vulnerability of the AI system, such as data poisoning attacks).
Like traditional software, there are options to reuse, reconfigure, buy, or build models. An agency could reuse off-the-shelf models as-is, fine-tune pre-trained models, use pre-built algorithms, or create new models. The approach taken to training will vary across model types.
Criterion 74: Set training boundaries in relation to any infrastructure, performance, and cost limitations.
Agencies should
Criterion 75: Start small, scale gradually.
Consider:
- starting off with simple and small architectures and add complexity progressively, depending on the purpose of the system to simplify debugging and reduce errors
- that an AI system can contain a combination of multiple models which can add to the complexity.
-
Statement 21: Establish training environment
-
Statement 21: Establish the training environment
Connect with the digital community
Share, build or learn digital experience and skills with training and events, and collaborate with peers across government.