AI Technical Standard: Statement 19

Statement 19: Establish the model and context dataset

Agencies must:

Criterion 67: Measure how representative the model dataset is.
Key considerations for measuring and selecting a model dataset include:
- whether it is representative of the true population relevant to the purpose of the AI system – this will improve model generalisation and minimise overfitting
- ensuring the dataset has the required features, volumes, distribution, representation and demographics, including people with lived experience and intersectional dimensions. For example, someone with cultural or linguistic diversity, may also be a person with disability, the dataset must consider how multiple dimensions of a person intersect and create unique experiences or challenges
- for GenAI, assess data quality thresholds and mechanisms in the data setup for modelling to help avoid unwanted bias and hallucinations.
Criterion 68: Separate the model training dataset from the validation and testing datasets.
Agencies must maintain the separation between these datasets to avoid any misleading evaluation for trained models.
Agencies can refresh these datasets to account for timeframes, degradation in AI performance during operation, and compute resource constraints.
Criterion 69: Manage bias in the data.
Techniques for agencies to manage and mitigate problematic bias in their model dataset includes:
- data collection analysis – examining how data was generated and verified, and checking the methodologies used to ensure the data is diverse and represents the real population
- data source analysis – investigating limitations and assumptions around the origin of the data
- data diversity – determining various demographics, sources and types of data, inclusion and exclusion considerations
- statistical testing – determining the likelihood of the population being accurately represented in the data
- class imbalance – analysing data for class imbalance before using it to train classification models, and applying relevant data and algorithm techniques and metrics, such as precision or F1-score, to address this
- outlier detection – identifying outliers or unusual data points in the data and ensuring they are handled appropriately
- exploratory data analysis – using descriptive statistics and data visualisation tools to identify patterns and discrepancies
- removing any irrelevant data from the training data that does not improve the performance of the model
- ensuring that any sensitive and protected data are retained in the test datasets for the purpose of evaluating for bias
- data augmentation – deploying measures to address the completeness of the model dataset, through supplementary data collection or synthetic data generation
- transparency – identifying bias and where it originated from through transparency on data sourcing and processing
- domain knowledge – ensuring practitioners have relevant domain knowledge on the datasets the AI system uses to serve the scope of the AI, including an understanding of the data characteristics and what it represents for the organisation
- documentation of data use – documenting the use of data by the AI system and any potential change of use, providing an audit trail of any incidence and causation of bias.

Agencies should

Criterion 70: For Generative AI, build reference or contextual datasets to improve the quality of AI outputs.
A reference or a contextual dataset for GenAI, can be in the form of (and not limited to) a retrieval-augmented generation (RAG) dataset or a prompt dataset.
Key considerations include:
- building high-quality reference or contextual datasets to support more accurate and context aware AI outputs, and reduce hallucinations
- implementing pre-defined prompts tailored to ensure consistent and reliable responses from GenAI models
- establish workflows for prompt engineering and data preparation to streamline development and deployment of GenAI systems.

Technical standard for government’s use of artificial intelligence: Data statements

Statement 19: Establish the model and context dataset

Agencies must:

Agencies should

Statement 20: Plan the model architecture

Connect with the digital community