Data: statements 13 - 19

The data stage involves establishing the processes and responsibilities for managing data across the AI lifecycle. This stage includes data used in experimenting, training, testing, and operating AI systems.

Data used by an AI system can be classified into development and deployment data.

Development data includes all inputs and outputs (and reference data for GenAI) used to develop the AI system. The dataset is made up of smaller datasets – train dataset, validation dataset, and test dataset.

  • Train dataset – this dataset is used to train the AI system. The AI system learns patterns in the train dataset. The train dataset is the largest subset of the modelling dataset. For GenAI, the train dataset may also include reference or contextual datasets such as retrieval-augmented generation (RAG) datasets and prompt datasets
  • Validation dataset – this dataset is used to evaluate the model's performance during model training. It is used to fine-tune and select the best-performing model, such as through cross validation
  • Test dataset – this dataset is used to evaluate the final model's performance on previously unseen data. This dataset helps provide unbiased evaluation of model performance.

Deployment data includes AI system inputs such as live production data, user input data, configuration data, and AI system outputs such as predictions, recommendations, classifications, logs, and system health data. Deployment stage inputs are new and previously unseen by the AI system. 

The performance of an AI system is dependent on robust management of data quality and the availability of data. 

Key workstreams within this stage include: 

  • data orchestration – establishing central oversight of and planning the flow of data to an AI system from across datasets
  • data transformation – converting and optimising data for use by the AI system
  • feature engineering – methods to improve AI model training to better identify and learn patterns in the data
  • data quality – measuring dimensions of a dataset associated with greater performance and reliability
  • data validation – testing the consistency, accuracy, and reliability of the data to ensure it meets the requirements of the AI system
  • data integration and fusion – combining data from multiple sources to synchronise the flow of data to the AI system
  • data sharing – promoting reuse, reducing resources required for collection and analysis, and helping to build interoperability between systems and datasets
  • model dataset establishment – using real-world production data to build, refine, and contextualise a high-quality AI model.
     

Notes: 

Connect with the digital community

Share, build or learn digital experience and skills with training and events, and collaborate with peers across government.