Executive summary
In the few years since its public introduction, generative AI (artificial intelligence) has become available and accessible to millions of people. The growing availability and speed of uptake in publicly available tools, such as ChatGPT, meant the Australian Public Service (APS) had to respond quickly to allow its workforce to experiment with generative AI in a safe, responsible and integrated way. To facilitate this experimentation, an appropriate generative AI tool needed to be selected.
Microsoft 365 Copilot (formerly Copilot for Microsoft 365) was one of the solutions available to enable the APS to undertake safe and responsible generative AI experimentation. On 16 November 2023, the Australian Government announced a 6-month whole-of-government trial of Copilot.
This decision was predicated on how swiftly and seamlessly Copilot, as a capability nested within existing whole-of-government contracting arrangements with Microsoft, could be deployed for rapid APS experimentation purposes. Further, as Copilot is a supplementary product that integrates with the existing applications within the Microsoft 365 suite, it also allowed staff to experiment and learn within the context of applications that were already familiar to them.
The trial involved the distribution of over 5,765 Copilot licenses between January to June 2024. The trial was non-randomised, with agencies nominating staff to be allocated a license. Trial participants comprised a range of APS classifications, job families, experience levels with generative AI, and expectations of generative AI capabilities.
Further detail on the background of the evaluation can be found in Appendix A.
More broadly, this trial and evaluation tested the extent to which much of the wider promise of generative AI capabilities would translate into real-world adoption by workers. The results will help the government consider future opportunities and challenges related to adopting generative AI.
This was just the first trial of a generative AI tool within the Australian Government and the future brings exciting opportunities to understand what other tools are available to explore a broad landscape of use cases.
Overview of the evaluation
Nous Group (Nous) was engaged by the Digital Transformation Agency (DTA) to assist with the evaluation of the trial. The Australian Centre of Evaluation was consulted on methodology and approach to ensure best practise. The evaluation was guided by 4 objectives and key lines of enquiry (KLEs) outlined in the table below.
Evaluation objective | Key lines of enquiry | |
---|---|---|
Employee related outcomes | Determine whether Copilot, as an example of generative AI, benefits APS productivity in terms of efficiency, output quality, process improvements and agency ability to deliver on priorities. | What are the perceived effects of Copilot on APS employees? |
Productivity | Evaluate APS staff sentiment about the use of Copilot. | What are the perceived productivity benefits of Copilot? |
Whole-of-government adoption of generative AI | Determine whether and to what extent Copilot, as an example of generative AI, can be implemented in a safe and responsible way across government. | What are the identified adoption challenges of Copilot, as an example of generative AI, in the APS in the short and long term? |
Unintended outcomes | Identify and understand unintended benefits, consequences, or challenges of implementing Copilot as an example of generative AI and the implications on adoption of generative AI in the APS. | Are there any perceived unintended outcomes from the adoption of Copilot? Are there broader generative AI effects on the APS? |
The findings of the evaluation and the resulting implications are outlined at a high level in this executive summary. Further details are provided in the body of the report.
Overarching findings
Generative AI is a disruptive technology that could transform APS’ productivity and ways of working. However, agencies will need to carefully weigh the potential benefits of efficiency and quality improvements against the costs, risks and suitability of generative AI to meet their agency’s needs.
There are perceived improvements to efficiency and quality for summarisation, preparing a first draft of a document and information searches.
From the trial, Copilot users from both this evaluation and agency-specific evaluations consistently reported quality and efficiency improvements in 3 key activities: summarisation of content, creating first drafts and information searches.
Trial participants estimated efficiency gains of around an hour when completing one of these 3 activities. Trial participants in junior levels (APS3-6), EL1s and information and communications technology (ICT) -related roles perceived the most efficiencies in these activities. In addition, 40% of post-use survey respondents reported they were able to reallocate their time to higher value activities such as staff engagement, culture building and mentoring, and building relationships with end users and stakeholders.
Overall, trial participants across all job classifications and job families were satisfied with Copilot and the majority wish to continue using it.
However, the adoption of generative AI requires a concerted effort to address technical, cultural and capability barriers and to improve usage.
Agencies faced adoption challenges during the trial. Technical barriers to adoption included needing to ensure information systems and processes were configured to safely accommodate Copilot.
Capability challenges were highlighted as a key barrier to adoption as trial participants needed both tailored training that provided agency-specific use cases as well as general generative AI training in prompt engineering. In addition, there were also cultural barriers with perceived stigma in using generative AI and discomfort with the use of meeting transcriptions. Some focus group participants reported feeling uncomfortable about being recorded and transcribed and perceived they were being pressured to consent.
Trial participants also noted the need for clear guidance and information regarding their accountabilities and the security of prompt information which in turn affected their use of Copilot. Finally, focus group participants also acknowledged the need to have change management supports in place including identifying ‘champions’ to illustrate generative AI's benefits to drive adoption.
These adoption challenges have contributed to the moderate use of Copilot during the Trial - only a third of trial participants used Copilot daily with its use concentrated in summarising meetings and information and re-writing content. There is currently a small number of identified novel or specific use cases across job families, and a limited use of broader Copilot functionalities.
There are a range of longer-term costs and risks that agencies will need to monitor and account for.
Interviews with government agencies highlighted that generative AI may have a large impact on the composition of APS jobs and skills, especially for women and junior staff who are perceived to be at a greater risk of job displacement by generative AI.
Trial participants also highlighted a range of broader concerns regarding the use of generative AI – from vendor lock-in to its environmental impact – that reflect the general unknown nature of generative AI but also its potentially wide-ranging impacts that will require close monitoring.
Findings summary
Employee related outcomes
77% were optimistic about Microsoft 365 Copilot at the end of the trial.
1 in 3 used Copilot daily.
Over 70% of used Microsoft Teams and Word during the trial, mainly for summarising and re-writing content
75% of participants who received 3 or more forms of training were confident in their ability to use Copilot, 28 percentage points higher than those who received one form of training.
Most trial participants were positive about Copilot and wish to continue using it
86% of trial participants wished to continue to use Copilot.
Senior Executive Service (SES) staff (93%) and Corporate (81%) roles had the highest positive sentiment towards Copilot.
Despite the positive sentiment, use of Copilot was moderate
Moderate usage was consistent across classifications and job families but specific use cases varied. For example, a higher proportion of SES and Executive Level (EL) 2 staff used meeting summarisation features, compared to other APS classifications.
Microsoft Teams and Word were used most frequently and met participants’ needs. Poor Excel functionality and access issues in Outlook hampered use.
Content summarisation and re-writing were the most used Copilot functions.
Other generative AI tools may be more effective at meeting users’ needs in reviewing or writing code, generating images or searching research databases.
Tailored training and propagation of high-value use cases could drive adoption
Training significantly enhanced confidence in Copilot use and was most effective when it was tailored to an agency’s context.
Identifying specific use cases for Copilot could lead to greater use of Copilot.
Productivity
69% of survey respondents agreed that Copilot improved the speed at which they could complete tasks.
61% agreed that Copilot improved the quality of their work.
40% of survey respondents reported reallocating their time for:
- mentoring / culture building
- strategic planning
- engaging with stakeholders
- product enhancement.
Most trial participants believed Copilot improved the speed and quality of their work
Improvements in efficiency and quality were perceived to occur in a few tasks with perceived time savings of around an hour a day for these tasks. These tasks include:
- summarisation
- preparing a first draft of a document
- information searches.
Copilot had a negligible impact on certain activities such as communication.
APS 3-6 and EL1 classifications and ICT-related roles experienced the highest time savings of around an hour a day on summarisation, preparing a first draft of a document and information searches.
Around 65% of managers observed an uplift in productivity across their team.
Around 40% of trial participants were able to reallocate their time to higher value activities.
Copilot’s inaccuracy reduced the scale of productivity benefits.
Quality gains were more subdued relative to efficiency gains.
Up to 7% of trial participants reported Copilot added time to activities.
Copilot’s potential unpredictability and lack of contextual knowledge required time spent on output verification and editing which negated some of the efficiency savings.
Whole-of-government adoption of generative AI
61% of managers in the pulse survey could not confidently identify Copilot outputs.
There is a need for agencies to engage in adaptive planning while ensuring governance structures and processes appropriately reflect their risk appetites.
Adoption of generative AI requires a concerted effort to address key barriers.
Technical
There were integration challenges with non-Microsoft 365 applications, particularly JAWS and Janusseal, however it should be noted that such integrations were out of scope for the trial. Note: JAWS is a software product designed to improve the accessibility of written documents. Jannusseal is a data classification tool used to easily distinguish between sensitive and non-sensitive information.
Copilot may magnify poor data security and information management practices.
Capability
Prompt engineering, identifying relevant use cases and understanding the information requirements of Copilot across Microsoft Office products were significant capability barriers.
Legal
Uncertainty regarding the need to disclose Copilot use, accountability for outputs and lack of clarity regarding the applicability of Freedom of Information requirements were barriers to Copilot use – particularly for meeting transcriptions.
Cultural
Negative stigmas and ethical concerns associated with generative AI adversely impacted its adoption.
Governance
Adaptive planning is needed to reflect the rolling release cycle nature of generative AI tools, alongside relevant governance structures aligned to agencies’ risk appetites.
Unintended outcomes
There are both benefits and concerns that will need to be actively monitored.
Benefits
Generative AI could improve inclusivity and accessibility in the workplace particularly for people who are neurodiverse, with disability or from a culturally and linguistically diverse background.
The adoption of Copilot and generative AI more broadly in the APS could help the APS attract and retain employees.
Concerns
There are concerns regarding the potential impact of generative AI on APS jobs and skills needs in the future. This is particularly true for administrative roles, which then have a disproportionate flow on impact to marginalised groups, entry-level positions and women who tend to have greater representation in these roles as pathways into the APS.
Copilot outputs may be biased towards western norms and may not appropriately use cultural data and information such as misusing First Nations images and misspelling First Nations words.
The use of generative AI might lead to a loss of skill in summarisation and writing. Conversely a lack of adoption of generative AI may result in a false assumption that people who use it may be more productive than those that do not.
Participants expressed concerns relating to vendor lock-in, however the realised benefits were limited to specific features and use cases.
Participants were also concerned with the APS’ increased impact on the environment resulting from generative AI use.
Recommendations
Detailed and adaptive implementation
1. Product selection
Agencies should consider which generative AI solution are most appropriate for their overall operating environment and specific use cases, particularly for AI Assistant Tools.
2. System configuration
Agencies must configure their information systems, permissions, and processes to safely accommodate generative AI products.
3. Specialised training
Agencies should offer specialised training reflecting agency-specific use cases and develop general generative AI capabilities, including prompt training.
4. Change management
Effective change management should support the integration of generative AI by identifying ‘Generative AI Champions’ to highlight the benefits and encourage adoption.
5. Clear guidance
The APS must provide clear guidance on using generative AI, including when consent and disclaimers are needed, such as in meeting recordings, and a clear articulation of accountabilities.
Encourage greater adoption
6. Workflow analysis
Agencies should conduct detailed analyses of workflows across various job families and classifications to identify further use cases that could improve generative AI adoption.
7. Use case sharing
Agencies should share use cases in appropriate whole-of-government forums to facilitate the adoption of generative AI across the APS.
Proactive risk management
8. Impact monitoring
The APS should proactively monitor the impacts of generative AI, including its effects on the workforce, to manage current and emerging risks effectively.
Glossary
Term | Meaning |
---|---|
AI in Government Taskforce | Co-led by the DTA and the Department of Industry, Science and Resources (DISR), the AI in Government Taskforce aimed to deliver policies, standards, and guidance for the safe, ethical and responsible use of AI technologies by government. |
Confidence interval | A confidence interval is a statistical concept used to estimate a population parameter based on sample data. It provides a range of values that likely contain the true population parameter with a certain level of confidence. |
Generative AI | Generative AI is a branch of artificial intelligence focused on designing algorithms that generate novel outputs, such as text, images or sounds, based on learned patterns from data. |
Hallucinations | Large Language Models (LLMs) such as Microsoft 365 Copilot are trained to predict patterns rather than understand facts, sometimes leading to it returning plausible sounding but inaccurate information, which is referred to as a ‘hallucination’. |
Large Language Model (LLM) | Large language models are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks. |
Microsoft 365 | A cloud-based suite of productivity and collaboration tools offered by Microsoft, including Office applications, email and other services. |
Microsoft Office | A suite of desktop productivity applications from Microsoft, including Word, Excel, PowerPoint and others. |
Microsoft Graph | A Microsoft application programming interface (API) that provides access to data and intelligence across Microsoft 365 services, enabling developers to build apps that interact with organisational data. |
Mixed methods | Combined use of both qualitative and quantitative research approaches to provide a more comprehensive understanding of the subject being evaluated. |
Microsoft 365 Copilot | AI-enabled functionality embedded into the Microsoft 365 application suite. Formerly called Copilot for Microsoft 365. |
P-value | A statistical measure that indicates the probability of observing a result as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. |
Trial participant | An Australian Public Service staff member who participated in the whole-of-government Microsoft 365 Copilot trial, between January and July 2024. |
T-test | A statistical test used to compare the means of 2 groups to determine if they are significantly different from each other, accounting for the variability in the data and sample size. |