Statement 16: Ensure data quality is acceptable
Agencies must
Criterion 59: Define quality assessment criteria for the data used in the AI system.
Data quality can be measured across a variety of dimensions (in line with the ABS Data Quality Framework) by identifying institutional environment, relevance, timeliness, accuracy, coherence, interpretability, and accessibility.
A report on data quality can include:
- data quality statement (see ABS Data Quality Statement Checklist)
- metrics for measuring data quality, including its correctness and credibility
- frequency of reporting on data quality
- delegating ownership to a business area to be responsible for managing data quality
- monitoring for changes in quality across the supply chain
- intervening and addressing data quality issues as they arise
Consider:
- any existing data standard frameworks that are used by the agency.
Agencies should:
Criterion 60: Implement data profiling activities and remediate any data quality issues.
This involves analysing the structure, content, and quality of the data to determine its fitness for purpose for an AI system.
Data profiling can investigate the following characteristics:
- frequency
- volume, range, and distribution
- invalid entry identification
- error detection
- duplicates identification
- noise identification
- specific pattern identification.
- Methods that can be used to explore and analyse the data include:
- descriptive statistics, such as mean, median, mode, or frequencies
- business rules – apply business knowledge
- clustering or dendrogram – group similar observations together
- visualisation – to get a visual representation of the data from various types of graphs and charts, such as histograms, bar-plot, boxplots, density plots, or heatmaps
- correlation analysis – measure relationships between variables, usually between numerical variables
- scatter plots – visualise relationships between two numerical variables
- cross-tabulations – analyse relationships between multiple categorical variables
- principal component analysis – analyse variables with the most variance
- factor analysis – helps reveal hidden patterns.
Criterion 61: Define processes for labelling data and managing the quality of data labels.
Data labelling can be done for the purposes of managing and storing data, audit purposes and AI model training purposes. Humans with appropriate skills and knowledge can perform the data labelling or it could be supported by automated labelling tools.
Setting data labelling practices can help optimise performance across the AI system by describing the context, categories, and relationships between data types, creating lineage of data through the AI system via versioning, distinguishing between pre-deployment and live data, and identifying what data will be reused, archived, or destroyed.
These include:
- establishing naming schemes, taxonomy, tagging, and data labelling practices
- considering different techniques such as manual or automated labelling, crowdsourcing, and quality checks
- defining quality control methods to improve consistency of labelling and assist in reducing bias
- considering changes to the raw data and data imputations, and associated impact
- providing data labels for AI training approaches or testing the AI models. Labels can provide the ground truth data for AI models and can influence AI validation. Different types of data labelling include:
- classification
- regression
- visual object labels
- audio labels
- entity tagging.
- applying quality assurance measures to data labels, labelling personnel, and automated data-labelling support tools
- implementing bias mitigation practices in labelling:
- establishing a review process. Diverse people could independently label the same data so correlation could be analysed. Final labels could go through spot check review by subject matter experts
- establishing feedback loops. Labellers should be able to report issues and suggest improvements, and automated systems should be updated to be consistent with corrections made by human labellers
- establishing performance management for staff. Data labellers should undergo periodic training, performance reviews, and random audits for quality control
- implementing metadata labelling techniques that capture the type of data categories within the system and the relationship between these categories. Metadata labels can be prepared for model bias evaluation by annotating metadata with suitable dimensions. Ensure the metadata labelling aligns to the Guide on Metadata Attributes | Office of the National Data Commissioner and Australian Government Recordkeeping Metadata Standard | naa.gov.au.
- assessing and monitoring quality of all automated data labelling support tools. Determine the regularity and criteria for these quality checks and report on findings
- updating and maintaining the labelling tools and processes to adapt to new data types, and labelling requirements
- considering potential harm to data labellers who may need to access sensitive or distressing content. This can occur when training an AI model to prevent responses including violence, hate speech, or sexual abuse.