Agentic AI Addendum statements: Evaluate

The statements below are intended as an addendum to the AI technical standard for Australian Government. These updates build upon the current framework to address the specific considerations associated with agentic AI. All existing statements, criteria, and general guidance outlined in the AI technical standard still apply. Some criteria in this standard may also apply to non-agentic forms of AI. Agencies exploring or using agentic AI should use both standards.

Evaluate

Continuous evaluation is essential in agentic AI systems which operate with a degree of automation, relying on humans to intervene when necessary. It is crucial to have ongoing evaluation mechanisms in place to ensure the reliability, safety, and alignment of agent actions with intended outcomes. Ongoing monitoring helps identify and address unintended behaviours or errors that may occur at every step without direct human validation. This helps maintain trust and accountability.

Statement AGT.6: Implement continuous evaluation mechanisms for AI agents

Agencies must:

Criterion AGT.6.1: Test agents for robustness against unexpected inputs or scenarios

This includes:

assessing explainability and traceability of decision-making in agentic AI systems
ensuring actions performed by agents are compliant and safe
ensuring agents operate within their pre-defined constraints and access controls
testing the scalability of the system and how efficient it becomes when task complexity increases
allowing agents to evaluate their output by adopting a secondary reasoning stage
measuring each agent’s success rates for task completion
measuring adaptability and efficiency of individual agents
ensuring periodic review and re-validation to ensure ongoing compliance and efficacy
evaluating the effectiveness of prompts when using LLMs.

Evaluation methods that can be used include:

ground truth evaluation: involves creating a ground truth set that identifies the criteria for success. Tests should have input and output sets that align with the ground truth. Results generated by the AI agent or system are then compared to the ground truth. ML evaluation metrices such as completeness, accuracy, precision, and recall can be used to provide a score for the agent’s response based on the ground truth criteria
robustness testing: evaluates how well the system operates when agents are faced with adversarial situations. This ensures measures are considered against data leakage and misuse of personal data at the agent and system level. Robustness testing should provide simple and easy to measure metrics.

Criterion AGT.6.2: Evaluate the interaction between agents and integrated tools

This includes:

evaluating whether the agent is selecting the appropriate tool for a given task
verifying whether parameters passed to tools are accurate
ensuring tools return consistent responses
evaluating each tools cost, latency, and consistency
evaluating the reliability of tools and assessing their fault recovery capabilities
evaluating agent–tool interaction, and ensuring responsibilities and controls are not duplicated across the agentic workflow and orchestration layer.

Criterion AGT.6.3: Evaluate the interaction between agents and its environment

In the context of agent–environment interaction, the environment includes the set of systems, data sources, infrastructure, integrations, and governance constraints that collectively constitute the operational context within which an AI system operates, influencing its perception, decision-making, and actions. The environment also defines the systems trust boundaries, control constraints, and points of exposure to risk.

This includes:

ensuring the environment does not negatively influence AI agent behaviour
checking for environmental constraints and boundaries to ensure agents act within limits to mitigate unintended or harmful consequences.

Criterion AGT.6.4: Evaluate the interaction between agents and memory

This includes:

evaluating how well agents update and manage stored information
evaluating whether agents are accurately retrieving information from memory
evaluating how long it takes agents to access information.

Next statement

Integrate

Agentic AI Addendum statements: Evaluate

Evaluate

Statement AGT.6: Implement continuous evaluation mechanisms for AI agents

Criterion AGT.6.1: Test agents for robustness against unexpected inputs or scenarios

Criterion AGT.6.2: Evaluate the interaction between agents and integrated tools

Criterion AGT.6.3: Evaluate the interaction between agents and its environment

Criterion AGT.6.4: Evaluate the interaction between agents and memory

Next statement

Connect with the digital community