On 11 May, IBM Research released Project CodeNet, a curated collection of code examples to help “AI for code” applications. As far as I’m aware, this is the first large-scale polyglot dataset that is well labelled, making it useful for supervised learning.
The AI in Diffblue Cover
One thing we’re often asked at Diffblue is whether our product learns across codebases and customers in order to improve its test-generation capabilities. Diffblue Cover itself uses unsupervised learning to search the space of possible tests for a method. The question is whether there’s a way to improve that search by making use of predictions from supervised machine learning (e.g. neural networks).
Today, we do not extract examples from different customer codebases and aggregate it because customers don’t want us to do that. Diffblue Cover is software you install into your development environment for that reason—so that you bring our product to your code, and not the other way around.
This matters because to be effective, supervised learning needs a large number of labelled cases from which to learn. This is often a stumbling block to deploying neural network approaches: there just isn’t enough data. But some of our prospects and customers have very large Java codebases: one bank evaluating Cover mentioned that they have over 400 million lines of Java code. So there is scope for learning inside a single customer’s aggregate code base—there’s enough data to make it possible.
IBM’s New Dataset Makes Supervised Learning Easier
In producing a curated dataset, IBM Research has solved the second part of the problem: labelling the large amount of data. The code is drawn from programming problem submissions in different languages. Programming problems are sometimes used as a form of candidate testing, and in some cases recreationally in coding challenges. Candidates submit solutions to a coding problem and are graded and given feedback. In the CodeNet dataset, there is a history of submissions from the first one through to the accepted solution, so you can see how the code evolved.
There are over 4000 programming problems, 14 million code samples and 500 million lines of code in the CodeNet Dataset. Because the problem statement is the same regardless of programming language, the programs can also be considered equivalent across languages. This makes the dataset potentially useful for automated code translation as well as automated refactoring or debugging tools. IBM reports that models it trained using the dataset reduced code refactoring time of 3,500 Java files from 1 year to just 4 weeks.
The trade-off of this approach is that you essentially get coding fragments rather than complete applications. One thing we have learned at Diffblue is that how the program is structured makes a big difference to how challenging it is to unit test. The interactions between Java classes matter, and as IBM’s researchers note, the difficulty of machine translation of coding languages is that the context of the code matters a great deal.
We’re delighted to see this dataset from Ruchir Puri and the IBM Research team, and we’ll be taking a look to see if it can be used to improve our products and approach.