The statements below are intended as an addendum to the AI technical standard for Australian Government. These updates build upon the current framework to address the specific considerations associated with agentic AI. All existing statements, criteria, and general guidance outlined in the AI technical standard still apply. Some criteria in this standard may also apply to non-agentic forms of AI. Agencies exploring or using agentic AI should use both standards.

Evaluate

Continuous evaluation is essential in agentic AI systems which operate with a degree of automation, relying on humans to intervene when necessary. It is crucial to have ongoing evaluation mechanisms in place to ensure the reliability, safety, and alignment of agent actions with intended outcomes. Ongoing monitoring helps identify and address unintended behaviours or errors that may occur at every step without direct human validation. This helps maintain trust and accountability.

Statement AGT.6: Implement continuous evaluation mechanisms for AI agents 

Agencies must:

Criterion AGT.6.1: Test agents for robustness against unexpected inputs or scenarios

This includes: 

  • assessing explainability and traceability of decision-making in agentic AI systems 
  • ensuring actions performed by agents are compliant and safe
  • ensuring agents operate within their pre-defined constraints and access controls
  • testing the scalability of the system and how efficient it becomes when task complexity increases
  • allowing agents to evaluate their output by adopting a secondary reasoning stage
  • measuring each agent’s success rates for task completion
  • measuring adaptability and efficiency of individual agents
  • ensuring periodic review and re-validation to ensure ongoing compliance and efficacy
  • evaluating the effectiveness of prompts when using LLMs.

Evaluation methods that can be used include:

  • ground truth evaluation: involves creating a ground truth set that identifies the criteria for success. Tests should have input and output sets that align with the ground truth. Results generated by the AI agent or system are then compared to the ground truth. ML evaluation metrices such as completeness, accuracy, precision, and recall can be used to provide a score for the agent’s response based on the ground truth criteria
  • robustness testing: evaluates how well the system operates when agents are faced with adversarial situations. This ensures measures are considered against data leakage and misuse of personal data at the agent and system level. Robustness testing should provide simple and easy to measure metrics.

Criterion AGT.6.2: Evaluate the interaction between agents and integrated tools

This includes:

  • evaluating whether the agent is selecting the appropriate tool for a given task
  • verifying whether parameters passed to tools are accurate
  • ensuring tools return consistent responses
  • evaluating each tools cost, latency, and consistency
  • evaluating the reliability of tools and assessing their fault recovery capabilities
  • evaluating agent–tool interaction, and ensuring responsibilities and controls are not duplicated across the agentic workflow and orchestration layer.

Criterion AGT.6.3: Evaluate the interaction between agents and its environment 

In the context of agent–environment interaction, the environment includes the set of systems, data sources, infrastructure, integrations, and governance constraints that collectively constitute the operational context within which an AI system operates, influencing its perception, decision-making, and actions. The environment also defines the systems trust boundaries, control constraints, and points of exposure to risk.

This includes:

  • ensuring the environment does not negatively influence AI agent behaviour 
  • checking for environmental constraints and boundaries to ensure agents act within limits to mitigate unintended or harmful consequences.

Criterion AGT.6.4: Evaluate the interaction between agents and memory

This includes:

  • evaluating how well agents update and manage stored information
  • evaluating whether agents are accurately retrieving information from memory
  • evaluating how long it takes agents to access information.

Next statement

Integrate

Connect with the digital community

Share, build or learn digital experience and skills with training and events, and collaborate with peers across government.