Using AI Agents in Integration Testing

Isaiah Weaver
Jan 30
5 min read

Updated: Apr 8

Our team has been developing a Java application with SpringBoot and the Langchain4j framework using LLM (Large Language Model) agents. This application has multiple agents, but there is one primary agent, the conversation agent, to reason with what the end user types into a chat to direct to the appropriate functions to call. This agent is free to do what it wants, within its limits, to accomplish the task. That being said, our agents are a key piece of the logic in our application, so we want to have some unit tests and integration tests for pieces of code that use the LLM calls. The problem we encountered was that we needed to test the output of an agent and evaluate its logical reasoning to either pass or fail the test. This led us to the llm-as-judge design pattern for our tests, which we will explore in this blog post.

Using AI to Test Outputs

Llm-as-judge is simply writing tests that use an LLM to test or evaluate something in order to determine if the test will pass or fail. In our case, we set up an agent to test the response of the primary conversation agent. In the langchain4j framework, we have an evaluator agent, using a class to define the config for the agent, and an interface to define the actual agent, which has the purpose of evaluating the conversation agent’s response. We then define our evaluate function in the agent interface that the evaluator agent has access to in order to make a call to the LLM after we get our response from the conversation agent. This will be the function we call to evaluate the response from the conversation agent. Finally, we have a class called MajorityVoteEvaluator, which has its own evaluate function that runs the agent’s evaluate function a configurable number of times and returns the majority of pass/fail evaluations from the LLM judge.

The diagram below shows all these steps in sequence, a step-by-step unit test process.

We create the necessary mocks and call our application code.
Then, we call the majority vote evaluator,
Which then calls the evaluator agent x number of times.
The majority vote evaluator determines if the majority passed/failed and returns that result for the unit test to pass/fail the test.

Let’s walk through each of these steps and look at the accompanying code.

We already had a language model config setup (advancedModel), so we used that to create the EvaluatorAgentConfig class below.

@Configuration
public class EvaluatorAgentConfig {

	@Bean
	public EvaluatorAgent evaluatorAgent(@Qualifier("advancedModel") ChatLanguageModel advancedModel) {
		return AiServices.builder(EvaluatorAgent.class)
				.chatLanguageModel(advancedModel)
				.build();
	}

}

After we have the config setup for our new testing agent we define the agent as an interface for the langchain4j framework to use.

public interface EvaluatorAgent {

	@SystemMessage("""
		Your purpose is to evaluate the results of a test. You will be employed in a unit 
		testing environment, and must critically evaluate the provided condition against 
		the provided result to determine if the test has passed or failed. Your return value
		should only be a string "true" or "false" nothing else. Double check to ensure that 
		there are no extra characters or symbols.
		""")
	@UserMessage("""
		Evaluate the following:
		Condition: {{condition}}
		Result: {{result}}
		-----
		Here are some examples:
		{{examples}}
		""")
	public boolean evaluate(@V("condition") String condition, @V("result") String result, @V("examples") String examples);

}

Let’s break down what is being done here. The system message annotation defines the evaluator agent’s base instructions. The user message annotation defines, most importantly, the condition and result. The result is the text output of our conversation agent. The condition is the desired state for a true evaluation. An example of this would be “an error did not occur”. We also have the option to define some examples to help the agent make a decision.

Let’s now take a look at the MajorityVoteEvaluator class, then we can look at an example of a test using all these pieces.

@Builder
public class MajorityVoteEvaluator {

	private final EvaluatorAgent evaluatorAgent;
	private final int invocationCount;
	
	/**
	 * Returns true if the majority of the votes are true.
	 * @param condition
	 * @param result
	 * @param examples
	 * @return
	 */
	public boolean evaluate(String condition, String result, String examples) {
		int trueCount = 0;
        int falseCount = 0;
        
        // Invoke the function x times and count true/false occurrences
        for (int i = 0; i < this.invocationCount; i++) {
            if (evaluatorAgent.evaluate(condition, result, examples)) {
                trueCount++;
            } else {
                falseCount++;
            }
        }
        
        return trueCount > falseCount;
	}

}

The purpose of this class is to call the evaluate function on the agent x number of times and gather a majority vote to get the best idea of the outcome. We found that a voting mechanism of some kind is needed just in case the evaluation agent hallucinates on one run and thinks our response is a failure. Adding even something like 5 runs and taking a majority vote adds significant stability to the tests.

The code block below shows how we built our MajorityVoteEvaluator in our testing class. We give it an invocation count and provide the agent it will call. Then, we can call it to evaluate a given response, and the whole test will pass if more than 50% of the evaluations pass.

// Inside testing class
@PostConstruct
public void init() {
	evaluator = MajorityVoteEvaluator.builder()
		.invocationCount(5)
		.evaluatorAgent(evaluatorAgent)
		.build();
}

Now let’s take a look at an example.

This test will attempt to do something, but the tool that the conversation agent needs to call will force an error. The conversation agent should then give a response to the end user that indicates there was an error during execution. This is the functionality we want to test.

@Test
public void Search_ThrowException_AgentRespondsToErrorMsg() throws Exception {
	String messageStr = "Does the store have any pizza?";
	ClientRequestMessage message = ClientRequestMessage.builder().message(messageStr).build();

	when(searchService.search(anyString(), anyString(), anyString()))
		.thenThrow(new RuntimeException("could not get product information from the store"));

	AgentResponseMessage agentMsg = chatService.processMessage(message, chatId);
	String agentResponse = agentMsg.getMessage();
	assertTrue(evaluator.evaluate("An error should have occurred", 
		agentResponse, 
		"""
		Example 1:
		Condition: There should not have been any indication of errors
		Result: I couldn't find any products due to an issue with the search functionality.
		Conclusion: false
		-----
		Example 2:
		Condition: There should not have been any indication of errors
		Result: Here are some products I found: a, b, c, etc. Do you want me to do anything else?
		Conclusion: true
		"""
	));
}

In this test, we are creating a message (messageStr) that will get passed to the conversation agent through the processMessage function, and our conversation agent will use the tools it has access to. When it calls the search function, which it should because it needs to search for pizza, it will get back an error. The conversation agent will take into account this error it sees from the tool call when creating its response for the end user. We then take the response we get back from the conversation agent and evaluate it to either pass/fail the test.

Conclusion

Using the llm-as-judge design gives us a very good idea if something happened during the conversation agent’s execution by having an agent evaluate the response. These tests are most useful when we are updating our application, whether that be tools the agent can call or the underlying LLM models because then we can see if the agent behaves in a way we don’t expect.

One downside to testing like this might be that if you are invoking an agent, you may incur costs, whether through OpenAI, Anthropic, etc. And if you run these tests a lot, that cost could be quite a bit. Something we may look into in the future is using a model like llama3 locally to judge our unit tests so that we don’t incur costs through another service - we'll blog about that when it happens, so come back soon!

References:

langchain4j: https://docs.langchain4j.dev/
SpringBoot: https://spring.io/projects/spring-boot