Data: statements 13 - 19
The data stage involves establishing the processes and responsibilities for managing data across the AI lifecycle. This stage includes data used in experimenting, training, testing, and operating AI systems.
Data used by an AI system can be classified into development and deployment data.
Development data includes all inputs and outputs (and reference data for GenAI) used to develop the AI system. The dataset is made up of smaller datasets – train dataset, validation dataset, and test dataset.
- Train dataset – this dataset is used to train the AI system. The AI system learns patterns in the train dataset. The train dataset is the largest subset of the modelling dataset. For GenAI, the train dataset may also include reference or contextual datasets such as retrieval-augmented generation (RAG) datasets and prompt datasets
- Validation dataset – this dataset is used to evaluate the model's performance during model training. It is used to fine-tune and select the best-performing model, such as through cross validation
- Test dataset – this dataset is used to evaluate the final model's performance on previously unseen data. This dataset helps provide unbiased evaluation of model performance.
Deployment data includes AI system inputs such as live production data, user input data, configuration data, and AI system outputs such as predictions, recommendations, classifications, logs, and system health data. Deployment stage inputs are new and previously unseen by the AI system.
The performance of an AI system is dependent on robust management of data quality and the availability of data.
Key workstreams within this stage include:
- data orchestration – establishing central oversight of and planning the flow of data to an AI system from across datasets
- data transformation – converting and optimising data for use by the AI system
- feature engineering – methods to improve AI model training to better identify and learn patterns in the data
- data quality – measuring dimensions of a dataset associated with greater performance and reliability
- data validation – testing the consistency, accuracy, and reliability of the data to ensure it meets the requirements of the AI system
- data integration and fusion – combining data from multiple sources to synchronise the flow of data to the AI system
- data sharing – promoting reuse, reducing resources required for collection and analysis, and helping to build interoperability between systems and datasets
- model dataset establishment – using real-world production data to build, refine, and contextualise a high-quality AI model.
Notes:
Requirements for handling personal and sensitive data within AI systems are included in the Privacy Act, the Australian Privacy Principles, Privacy and Other Legislation Amendment Act 2024 and the Handling personal information guidance.
Data archival and destruction must comply with the Information management legislation.
The Framework for the Governance of Indigenous Data provides guidelines on Indigenous data sovereignty.
The Office of the Australian Information Centre (OAIC) provides Guidelines on data matching in Australian Government administration, which agencies must consider prior to data integration and fusion activities.
The Information management for records created using Artificial Intelligence (AI) technologies | naa.gov.au provides guidelines to manage data for AI.
The Data Availability and Transparency Act 2022 (DATA Scheme) requires agencies to identify data as open, shared, or closed.
The Guidelines for data transfers | Cyber.gov.au provide guidance on the processes and procedures for data transfers and transmissions.
The APS Data Ethics Use Cases provide guidance for agencies to manage and mitigate data bias.
The report on Responding to societal challenges with data | OECD provides guidance on data access, sharing, and reuse of data.