Criterion 48: Implement processes to enable data access and retrieval, encompassing the sharing, archiving, and deletion of data.
Considerations include:
Criterion 49: Establish standard operating procedures for data orchestration.
This includes:
Practices to be defined include:
Criterion 50: Configure integration processes to integrate data in increments.
This includes:
Criterion 51: Implement automation processes to orchestrate the reliable flow of data between systems and platforms.
Criterion 52: Perform oversight and regular testing of task dependencies.
This should involve having comprehensive backup plans in place to handle potential outages or incidents.
The following should be considered:
Criterion 53: Establish and maintain data exchange processes.
This includes:
Criterion 54: Establish data cleaning procedures to manage any data issues.
Data cleaning involves appropriately treating data errors, inconsistencies, or missing values to improve performance of the AI system. Data cleaning should be documented, and possibly included in the metadata, each time it is conducted to manage issues such as:
For open-source data, or data that has not yet been validated or can be trusted, consider using a sandbox environment.
Criterion 55: Define data transformation processes to convert and optimise data for the AI system.
This could leverage existing Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT) processes.
Consider the following data transformation techniques:
Criterion 56: Map the points where transformation occurs between datasets and across the AI system.
Consider:
Criterion 57: Identify fit-for-purpose feature engineering techniques.
Feature engineering techniques include:
Criterion 58: Apply consistent data transformation and feature engineering methods to support data reuse and extensibility.
Consider:
Criterion 59: Define quality assessment criteria for the data used in the AI system.
Data quality can be measured across a variety of dimensions (in line with the ABS Data Quality Framework) by identifying institutional environment, relevance, timeliness, accuracy, coherence, interpretability, and accessibility.
A report on data quality can include:
Consider:
Criterion 60: Implement data profiling activities and remediate any data quality issues.
This involves analysing the structure, content, and quality of the data to determine its fitness for purpose for an AI system.
Data profiling can investigate the following characteristics:
Criterion 61: Define processes for labelling data and managing the quality of data labels.
Data labelling can be done for the purposes of managing and storing data, audit purposes and AI model training purposes. Humans with appropriate skills and knowledge can perform the data labelling or it could be supported by automated labelling tools.
Setting data labelling practices can help optimise performance across the AI system by describing the context, categories, and relationships between data types, creating lineage of data through the AI system via versioning, distinguishing between pre-deployment and live data, and identifying what data will be reused, archived, or destroyed.
These include:
Criterion 62: Perform data validation activities to ensure data meets the requirements for the AI system’s purpose.
This involves including AI-specific validations in schema migrations to ensure data pipelines and feature stores remain functional. Suitable data validation techniques include:
Considerations include:
Criterion 63: Select data for use that is aligned with the purpose of the AI system.
This includes:
Criterion 64: Analyse data fusion and integration requirements.
This includes:
Data fusion is a method to integrate or combine data from multiple sources and this can help an AI system create a more comprehensive, reliable, and accurate output. Meaningful data sharing practices across the agency can build interoperability between systems and datasets. Data sharing also promotes reuse, reducing resources for collection and analysis.
Criterion 65: Establish an approach to data fusion and integration.
This approach should involve one or more of the following processes:
Consider:
Agencies can optimise data fusion and integration processes by automating scheduling and data integration tasks and by deploying intuitive interfaces to diagnose and resolve errors.
Criterion 66: Identify data sharing arrangements and processes to maintain consistency.
Data sharing considerations include:
Criterion 67: Measure how representative the model dataset is.
Key considerations for measuring and selecting a model dataset include:
Criterion 68: Separate the model training dataset from the validation and testing datasets.
Agencies must maintain the separation between these datasets to avoid any misleading evaluation for trained models.
Agencies can refresh these datasets to account for timeframes, degradation in AI performance during operation, and compute resource constraints.
Criterion 69: Manage bias in the data.
Techniques for agencies to manage and mitigate problematic bias in their model dataset includes:
Criterion 70: For Generative AI, build reference or contextual datasets to improve the quality of AI outputs.
A reference or a contextual dataset for GenAI, can be in the form of (and not limited to) a retrieval-augmented generation (RAG) dataset or a prompt dataset.
Key considerations include:
Criterion 71: Establish success criteria that covers any AI training and operational limitations for infrastructure and costs.
Ensure alignment with AI system metrics selected at the design stage.
Consider:
AI training can occur in offline mode, or in online or real-time mode. This is dependent on the business case and the maturity of the data and infrastructure architecture. The risk of the model becoming stale is higher in offline mode, while the risk of the model exhibiting unverified behaviour is higher in online mode.
The training process is interdependent to the infrastructure in the training environment. Complex model architectures with highly specialised learning strategies and large model datasets generally require tailored infrastructure to manage costs.
Criterion 72: Define a model architecture for the use case suitable to the data and AI system operation.
The following will influence the choice of the model architecture and algorithms:
The model architecture will highlight the variables that will impact the intended outcomes for the system. These variables will include the model dataset, use case application and scalability intentions. These variables will influence which algorithms and learning strategies are chosen to train the AI model.
An AI scientist can test and analyse the model architecture and dataset to identify what is needed to effectively train the system. Additionally, they can outline requirements for the model architecture to comply with data, privacy, and ethical expectations.
Consider starting off with simple and small architectures, and add complexity progressively depending on the purpose of the system to simplify debugging and reduce errors. Note that an AI system can contain a combination of multiple models which can add to the complexity.
Generally, a single type of algorithm and training process may not be sufficient to determine optimal models for the AI system. It is usually good practice to train multiple models with various algorithms and training methodologies.
There are options to develop a chain of AI models, or add more complexity, if that better meets the intent of the AI system. Each model could use a different type of algorithm and training process.
Analysis of support and maintenance of the AI system in operation can influence the model architecture. For some use cases, a complete model refresh may be required, noting cost considerations. Alternatives such as updates to pre or post processing could be considered, including updates to the configuration or knowledge repository for RAG for GenAI.
It may not be necessary to retrain models every time new information becomes available, and this should be considered when defining the model architecture. For example, for GenAI, adding new information in RAG can help the AI system remain up to date without the need to retrain the AI model, saving on costs without impacting AI accuracy.
Criterion 73: Select algorithms aligned with the purpose of the AI system and the available data.
There are various forms of algorithms to train an AI model, and it is important to select them based on the AI system requirements, model success criteria, and the available model dataset. A learning strategy is a method to train an AI model and dictates the mathematical computations that will be required during the training process.
Depending on use case, some examples of the types of training processes may include:
Like traditional software, there are options to reuse, reconfigure, buy, or build models. An agency could reuse off-the-shelf models as-is, fine-tune pre-trained models, use pre-built algorithms, or create new models. The approach taken to training will vary across model types.
Criterion 74: Set training boundaries in relation to any infrastructure, performance, and cost limitations.
Criterion 75: Start small, scale gradually.
Consider:
Criterion 76: Establish compute resources and infrastructure for the training environment.
This allows for infrastructure and computational constraints to be considered in relation to business needs and supports configuration of learning strategies best optimised for the infrastructure environment.
Criterion 77: Secure the infrastructure.
Implement required security and access controls for infrastructure used for training, validating, and testing the AI model which are dependent on the security classification of the data. For details, see the Information security manual (ISM), Essential Eight maturity model, Protective Security Policy Framework and Strategies to mitigate cyber security incidents.
Criterion 78: Reuse approved AI modelling frameworks, libraries, and tools.