In our previous articles, we explored why accuracy is essential in AI-driven software engineering and how hybrid techniques – like combining Reinforcement Learning (RL) with Code Execution (CE) – can push that accuracy beyond 95%. In this final installment, we’ll look at how these accuracy-focused advances are enabling “agents” that can plan, build, test, and refine software autonomously.
Vision for Autonomous Agents
Right now, most AI coding solutions act as assistants. They provide suggestions in the form of snippets or partial implementations that engineers must review and integrate manually. The leap to autonomy envisions AI handling entire user stories or features: gathering requirements, implementing functionality, writing and refining tests, debugging errors, and ultimately delivering a working component with minimal human intervention.
Realizing this kind of autonomy requires near-perfect accuracy. An autonomous agent that churns out faulty code would quickly erode developer trust and create more overhead than it eliminates. That’s why advanced verification steps – ranging from static analysis to sandboxed test execution – must be used during the development process. Rather than relying on human attention, the AI itself should be the first line of defense against bugs, edge-case failures, and regressions.
Tools that combine reinforcement learning with thorough code execution feedback can produce test suites that reach or exceed human-level coverage. With deeper integration, the agentic AI solutions could even propose architectural changes or refactoring strategies, evaluate them in a secure sandbox, and refine the approach based on real-time data. The future of AI-driven development, therefore, is less about code suggestion and more about continuous self-correction.
Essential Building Blocks
Solutions aiming to become autonomous agents must systematically enhance the AI’s ability to self-verify and adapt via a combination of following:
- Code Execution Infrastructure (Sandboxes): By running AI-generated code in an isolated environment, organizations protect their production systems while ensuring every code snippet is tested under realistic conditions. Sandboxed execution clarifies whether the AI’s output truly compiles, performs as expected, and integrates with dependencies.
- Advanced Verification Techniques: Static analysis, symbolic execution, and mutation testing help confirm that generated tests (and the underlying code) are robust. Mutation testing is especially crucial, as it systematically injects minor “faults” to evaluate whether the AI’s tests can catch them, which is an excellent metric for test suite thoroughness.
- Domain-Specific Data Refinement: Training or fine-tuning on high-quality, enterprise-specific data helps the AI model grasp domain nuances. This ensures that the AI’s output aligns with actual business logic and coding standards, rather than echoing generic or outdated open-source practices.
- Continuous Integration with CI/CD Pipelines: Incorporate AI-verification loops into the existing build process. By automatically running static analyzers, coverage tools, and test suites on each AI-generated snippet, teams lock in higher accuracy without adding burdensome manual gates.
Together, these pillars help an AI evolve from passively suggesting code to proactively validating it – raising the ceiling on how autonomously it can operate in real-world engineering workflows.
Diffblue’s Approach: Reinforcement Learning using Code Execution
Diffblue Cover is an AI solution that automatically writes unit tests for Java code. Unlike many code generation tools that rely on LLMs, Cover uses reinforcement learning, allowing it to “learn by doing”. It generates tests, gets feedback on their effectiveness, and updates its strategy. AI gets a positive reward when a generated test meets specific quality criteria (e.g., it compiles, passes all its assertions, increases code coverage, and contains no smells), but gets penalized for undesirable outcomes (like syntax errors, failing tests, or trivial assertions). Over many iterations, the model learns to prefer generating test cases that maximize the reward – effectively tuning it to produce higher-quality tests. This approach was shown to boost accuracy significantly: a Microsoft study applied RL with static code quality metrics (like ensuring each test has proper assertions and no anti-patterns) and the RL-optimized model outperformed the base model, reducing test anti-patterns and improving overall test quality by up to 23% . Notably, this RL-tuned model even outperformed a larger GPT-4 model on those quality metrics, despite being trained on a smaller base. The key takeaway is that RL can drive an AI not just to generate any test, but the best test it can find for a given piece of code, by systematically exploring and evaluating many possibilities
The RL component essentially “injects ground truth into the learning process” by rewarding the AI when its code actually works, thereby reducing hallucinations over time. By running each generated test, Diffblue’s system gets perfect information on whether the test is valid and how the code behaves. This is akin to how a human developer might write a test, run it to see if it fails, and then refine it. But Cover does this autonomously at machine speed up to 1000 times per second and at scale. The use of RL effectively injects ground truth into the learning process, as Diffblue’s researchers describe it. Instead of just guessing what a correct test might look like, the AI empirically verifies each guess by executing it. This tight feedback loop is what enables Diffblue Cover to reach very high levels of accuracy.
How high? In internal evaluations, Diffblue reports that Cover achieves over 95% accuracy in generating unit tests with very high compile & pass rates. In a head-to-head study we conducted comparing Diffblue Cover and GitHub Copilot for Java, Diffblue Cover’s tests achieved nearly perfect accuracy (~99% success rate), while Copilot’s tests were only about 65% accurate.
This gap underscores the power of combining RL with code execution: Cover isn’t just predicting tests from patterns; it’s learning from actual outcomes. Every time it writes a test, it runs it against the code, ensuring that the tests it ultimately delivers do what they’re supposed to do (and if they don’t, the system knows and fixes them before presenting them to the user). As a result, developers can trust Cover’s tests to meaningfully exercise the target code with minimal or no manual tweaking.
Overcoming Engineering Challenges
To deliver this level of accuracy, we had to solve a very concrete engineering problem: executing arbitrary project code in a safe, automated way. Unit tests, by definition, run against your codebase, which may have complex dependencies (external libraries, frameworks) and intricate build processes. We engineered a robust Java sandbox (or “code executor”) that can take a real-world Java project, compile it, and run tests on it – all without human intervention. This involves automatically understanding and invoking the project’s build system (for example, Maven or Gradle for Java) to resolve dependencies and build the code.
Diffblue chose to focus on Java first for several reasons. Java is widely used in enterprise environments, and its broadly standardized build/test ecosystem (JUnit for testing, Maven/Gradle for builds) lends itself to automation. By mastering one language deeply, including all its tooling quirks, we can deliver a product with high accuracy.
One key challenge lies in replicating this approach for other programming languages, as each language requires a dedicated code executor with the same level of deep integration. This requirement largely explains why other AI coding tools have been slow to match Diffblue’s accuracy in Java unit test generation: while many competitors acknowledge the need to improve accuracy, they often opt for partial solutions. For example, GitHub and Amazon employ lightweight but low-precision type checking and source-based Static Analysis techniques to increase the accuracy of their tools, which is better than nothing but insufficient (Github Copilot showed 65% accuracy in our study). However, ultimately, only Code Execution can be a realistic, scalable solution.
Another challenge is performance and scalability. Running a complete build and test cycle for a large project can be time-consuming. Diffblue addresses this with a combination of optimizations, such as instrumenting tests to focus on one target method at a time and using isolated classloading to avoid repeated overhead. The result is that Cover can generate and verify tests much faster than a human could. One case study highlighted that Diffblue’s tool could update or create tests roughly 250 times faster than manual effort and 10x faster than Copilot-assisted developer without requiring human review. This kind of speed, combined with high accuracy, is crucial for integrating into continuous integration (CI) pipelines in enterprise environments. This means that for large codebases, an AI like Cover can generate reliable tests in minutes or hours, while a team of engineers might take weeks.
A Blueprint for High-Accuracy AI
Instead of relying solely on neural networks, which often generate unreliable results, Diffblue Cover learns from actual code behavior, reducing errors and improving reliability. Diffblue Cover shows how combining advanced AI with proven software engineering methods can create useful, real-world tools for developers.