Statement 15: Implement data transformation and feature engineering practices
Agencies should
Criterion 54: Establish data cleaning procedures to manage any data issues.
Data cleaning involves appropriately treating data errors, inconsistencies, or missing values to improve performance of the AI system. Data cleaning should be documented, and possibly included in the metadata, each time it is conducted to manage issues such as:
- blanks, nulls, or trailing spaces
- structural errors or unwanted formatting
- missing data
- spelling mistakes
- repetition of words
- irrelevant characters
- content or observations irrelevant to the purpose of the AI system.
For open-source data, or data that has not yet been validated or can be trusted, consider using a sandbox environment.
Criterion 55: Define data transformation processes to convert and optimise data for the AI system.
This could leverage existing Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT) processes.
Consider the following data transformation techniques:
- data standardisation – convert data from various sources into a consistent format
- data reorganisation – organise data to make it easier to query and analyse
- data integration – combine data from different sources for a single unified view
- discretisation – convert continuous data into discrete intervals
- missing value imputation – analyse what values need to be imputed and the method
- convert data from one source to another, such as log transformation
- smoothing – to even out fluctuations
- convert unstructured data to structured data
- Optical Character Recognition (OCR) – convert images of text into machine readable format
- object labelling and tracking – in images, audio, and video
- signal processing and transformation
- point in time of data – a snapshot of data at a specific point in time.
Criterion 56: Map the points where transformation occurs between datasets and across the AI system.
Consider:
- security checks.
Criterion 57: Identify fit-for-purpose feature engineering techniques.
Feature engineering techniques include:
- feature creation and extraction – deriving features from existing data to help the AI system produce better quality outputs
- feature selection – selecting attributes or fields that provide relevant context to the AI model
- encoding – converting data into a format that can be better used in AI algorithms
- binning – grouping data into categories
- specific conversion – changing data from one format to another for AI compatibility
- scaling – mapping all data to a specific range to help improve AI outputs.
Criterion 58: Apply consistent data transformation and feature engineering methods to support data reuse and extensibility.
Consider:
- metadata and tagging of the data
- data transformation not limited to AI models and processes.