In the four months since OpenAI took the world by storm with its launch of ChatGPT, it has reached over 100M users and caused industries and schools alike to rethink their attitude to AI. It’s since become clear that many organizations had similar Large Language Models (LLMs) in the works, but for one reason or another had generally been more reticent about releasing them. Love them or loathe them, it looks like mainstream use of generative AI tools is here to stay.
GPT-4: A black box
GPT-4, the latest iteration of the Large Language Model (LLM) on which ChatGPT is based, was released earlier this month and it’s undoubtably a step forward in terms of capability and the output it can produce. The model is now multimodal, featuring the ability to process data from a variety of sources, including photos, graphics and images, though output remains text only for now.
As a generative AI for code company, it’s always exciting for us to see such rapid advancements in this space. There is no doubt that GPT-4 does some amazingly clever stuff and may well change the technology landscape forever. But does that mean we’re really any closer to a ‘general’ AI solution that’s the best tool for every job?
We know comparatively little about how GPT-4 actually works – a major shift in approach for OpenAI. Earlier versions were completely open source but the company has chosen not to share what data GPT-4 was trained on, how many parameters it uses (although there are rumours of an increase to 1 Trillion), what safeguards and guardrails are in place, and so on. All we can really do is judge the model on what it produces. And despite the advances that have been made, key limitations found in earlier versions remain.
As OpenAI themselves put it:
“Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors.)”
The hallucination problem
In a recent example, ChatGPT recommended a company called OpenCage as a provider of an API to turn a mobile phone number into the location of the phone. No such API exists, leading to much frustration among ChatGPT users and a real headache for OpenCage. OpenCage co-founder Ed Freyfogle, pointed out in his blog, “The key difference is that humans have learned to be sceptical when getting advice from other humans, for example via a video coding tutorial. It seems though that we haven’t yet fully internalized this when it comes to AI in general or ChatGPT specifically. The other key difference is the sheer scale of the problem. Bad tutorial videos got us a handful of frustrated sign-ups. With ChatGPT the problem is several orders of magnitude bigger.”
This property of hallucination is an endemic feature of LLMs like GPT-4. They are each ‘trained’ on a specific dataset (these days essentially the internet) and use that data, via a model consisting of billions (or maybe even trillions) of parameters, to ‘transform’ inputs into what they consider to be the most likely related output. Such systems are intrinsically dependent on the inputs given and how they are trained. GPT-4 now also includes a layer of reinforcement learning from human feedback to refine the responses it generates (surely somewhat ironic given some of the claims being made!)
This key limitation is one of the main reasons why LLMs are not always the best approach to automatically writing code.
Another way: reinforcement learning
An alternative type of AI is a unsupervised approach, based on reinforcement learning. At a simplistic level, a model is given the parameters of a ruleset to work within and teaches itself each time it runs. It determines the best result by trying each option in turn and comparing to the next option it finds, until it comes up with the best solution to the problem. Perhaps the most famous example of reinforcement learning success is Google’s AlphaGo project.
Reinforcement learning can be a more effective approach than LLMs when the “most likely” output simply isn’t good enough. For example, our product Diffblue Cover uses it to write Java unit tests completely autonomously.
GPT-4 has no understanding of the Java language: its ‘knowledge’ is simply based on learned text patterns. That means when it comes to generating Java unit tests to catch regressions, it doesn’t know whether the code it suggests will even compile, let alone whether or not it’s a good test for the method in question. Often neither will be true. Human supervision and review are essential.
By contrast, Cover understands what a unit test is and how Java works. It takes compiled Java code, evaluates it against its rulesets and defines a unit test. It then runs the test against the method to check how good it is, by looking at coverage and other qualities. It then predicts the changes it can make to the test to make it better and runs and checks the test again. It does this as many times as possible until it comes up with the best test(s) for that particular method, that will test all the possible pathways through the method. All in a matter of seconds.
Thanks to reinforcement learning, Cover will not “hallucinate” to fill a gap. Once Cover has written a test, it will compile and run it to make sure it passes and will be ready to catch regressions when a code change is made. What’s more, thanks to its understanding of the ‘rules’ of unit testing, Cover won’t write spurious tests for code that’s actually untestable.
These properties allow Cover to run completely autonomously. It needs no human interaction to check the tests it has written and so has the ability to scale massively and generate tests for an entire project from one simple command – something that is not possible with any of today’s LLM-based tools.
ChatGPT is pretty amazing and GPT-4 is a real step forward in how LMMs can be used, but it is clear that LLMs and tools based on them are not the best answer to every problem yet.
To learn more about Diffblue Cover and how it can help you generate Java unit tests, you can try it now or contact us for a demo.