In our first article, we discussed the critical role of accuracy in AI-driven software engineering. Conventional large language models (LLMs) can generate code that looks correct but often lacks true reliability, making it unsuitable for mission-critical enterprise applications. The need for higher accuracy has led many researchers to question whether pure neural network approaches can ever attain the consistent correctness that enterprise environments require. This second installment explores the challenges and potential solutions for achieving higher accuracy in AI-driven software engineering.
The Limits of LLM-Only Approaches
Today’s LLM-based coding assistants (like GitHub Copilot, Amazon Q Developer, or Devin) demonstrate both the power and the limits of pure neural approaches. They excel at pattern recognition – for instance, scanning a codebase and writing a new function in a similar style, or integrating known API calls correctly. They can rapidly produce boilerplate code or suggest solutions that would take a human considerable time to recall or write from scratch. This strength comes from their training: by ingesting millions of code examples, LLMs become very good at guessing what “looks right” in a given context.
However, that very same mechanism is also why they can’t always be trusted to get the details right. Some of their limitations include:
- Hallucinations: Because LLMs rely on statistical patterns, they sometimes fabricate code, create non-existent library calls, or reference API endpoints that simply do not exist.
- No Inherent Correctness Verification: An LLM’s internal architecture cannot execute or test its own output. It provides best guesses based on textual patterns, not actual runtime outcomes.
- Lack of Semantic Grounding: Even though LLMs appear to “understand” code, their comprehension is probabilistic, not grounded in the formal rules and behaviors of programming languages.
An LLM has no innate sense of the intent behind the code beyond what it infers probabilistically. It doesn’t truly “know” the rules of the runtime environment or the exact semantics of every API; it approximates them. As a result, every output from an LLM-based coder needs human scrutiny and testing. If an AI tool saves you typing time but still demands that you meticulously debug its work, the net productivity gains shrink dramatically. The major obstacle in front of scalability is the need for human involvement.
Hybrid AI Approaches for Higher Accuracy
Key strategies to overcome the limitations of LLMs include hybrid approaches that combine generative AI with other techniques to ensure code correctness. These include:
- Reinforcement Learning (RL)
Reinforcement learning is a machine-learning paradigm where an agent learns by receiving feedback (rewards or penalties) on its actions. In software engineering, an AI model can generate a snippet of code, execute it (or test it), and then receive a score based on the results—did the code compile, pass all test cases, or produce the correct output? This loop provides real-world correctness signals into the model’s learning process. Over multiple iterations, the AI adjusts its parameters to maximize its reward, effectively internalizing best coding practices and minimizing hallucinations. - Integrating Code Execution (CE): Code execution is the ultimate arbiter in the development cycle. By running the generated code in a sandbox (or code executor), the AI can see, unambiguously, whether a snippet fails or succeeds at a task. This allows for automated debugging: the system can discard or revise failing candidates and move toward a correct version. Practically, this might involve generating multiple solution candidates, executing each against unit tests, and selecting (or refining) the best-performing outputs. Because many enterprise codebases have test suites already, integration with an AI that runs tests autonomously is straightforward in principle (though still complex in practice).
- Why the Hybrid Approach Matters
Relying purely on neural networks to perform high-accuracy code analysis creates a “Munchausen Trilemma”—they need to be accurate in the first place just to verify their own correctness. It may not even be desirable to use neural networks for this as they are computationally highly inefficient. For instance, training an LLM to do arithmetic is less practical than having it generate a small program to compute the result directly. Consequently, combining optimized analysis techniques (e.g., static or symbolic analysis) with neural networks is a more sustainable way to achieve the necessary level of accuracy in code analysis.
Combined RL and CE transform code generation into an iterative, self-correcting process. Instead of passively relying on an LLM’s “one-shot” guess, the AI actively tests each hypothesis and learns from mistakes. This drastically reduces hallucinations and lifts accuracy closer to the 95% mark. As we will see in the next section, real-world evidence from solutions like Diffblue Cover proves how effective hybrid approaches can be.
The importance of Code Execution for building coding agents
Diffblue Cover currently has >95% accuracy (in our Copilot study, it was even >99%) in the unit test generation task, thanks to its use of a Code Execution-based approach married with Reinforcement Learning (RL).
RL is automated learning from feedback received on the output. Its purpose is to inject ground truth into the learning process. Thus, reinforcement learning results in highly accurate systems when the feedback reflects correctness. We achieve this through Code Execution, which perfectly reflects what the code is doing.
Here’s how it works in simplified terms:
- Initial Code Generation: A baseline initial test candidate is created for each method after analysing the byte code of your project. Reinforcement learning selects the best inputs to identify all testable pathways and to select the best inputs to write mocks and assertions
- Execution and Feedback: The proposed tests are then executed against the actual code. If tests fail or do not compile, Cover receives a clear signal that its proposal is incorrect or incomplete.
- Reinforcement Learning Loop: Based on the failures, Cover refines its approach, gradually reaching a test that reliably passes. Over hundreds of computationally efficient iterations, the AI “learns” how to construct high-quality, logically consistent unit tests.
Executing the code of real projects is hard because the entire project and its dependencies need to be available, which in turn requires understanding the project’s build system, etc.
This is a complex, language-specific engineering problem that Diffblue solved for Java.
Our internal benchmarks have shown that once integrated into a continuous integration (CI) pipeline, Cover can handle test generation for large-scale Java projects with minimal human oversight. This is a stark contrast to purely neural approaches that might generate plausible test syntax but fail when run, or omit crucial edge cases.
Practical Challenges
Despite its potential, implementing Code Execution+Reinforcement Learning at scale is not easy. Below are some challenges we have encountered and effectively answered at Diffblue.
- Engineering Complexity: Building robust code executors is labor-intensive. It requires deep expertise in compilers, build systems, and runtime environments, which many AI startups lack.
- Computational Overhead: Running code repeatedly for RL feedback is time-consuming, particularly in large projects with an extensive suite of tests and dependencies. A high-performance infrastructure is necessary to handle the load efficiently.
- Edge Cases and Toolchain Diversity: Code execution must consider different frameworks, dependencies, and project structures.
- Maintenance and Continuous Learning: Reinforcement learning systems don’t automatically remain accurate when projects evolve. Therefore, continuous training and updates are necessary, as are stable budgets and specialized staff.
For enterprises that recognize the value of reliable AI, bridging LLMs with RL and CE is the only viable path to the levels of accuracy that can deliver breakthrough productivity gains. This approach directly tackles the trust issue by allowing AI to correct its own hallucinations instead of relying on human intervention. In the next article, we’ll explore how Diffblue uses reinforcement learning and code execution to virtually eliminate hallucinations in unit test generation.