Join today’s top leaders online at the Data Summit on March 9. Register here.
Code-generation AI — AI systems that can write in different programming languages using a prompt — promise to reduce development costs while allowing coders to focus on creative tasks and less repetitive. But while research labs like OpenAI and Alphabet-backed DeepMind have developed powerful code-generating AI, many of the most successful systems aren’t available as open source. For example, training data for OpenAI’s Codex, which powers GitHub’s Copilot functionality, has not been made public, preventing researchers from refining the AI model or studying aspects of it. such as interpretability.
To remedy this, researchers at Carnegie Mellon University – Frank Xu, Uri Alon, Graham Neubig and Vincent Hellendoorn – developed PolyCoder, a model based on OpenAI’s GPT-2 language model that was trained on a 249 GB database of code in 12 programming languages. . Although PolyCoder does not match the performance of the best code generators in every task, researchers claim that PolyCoder is able to write C with greater accuracy than any known model, including Codex.
“When GitHub’s Copilot was released last summer, it became clear that these very large code language patterns can be very useful in helping developers and increasing their productivity. But no model even close to this scale was publicly available,” the researchers told VentureBeat via email. “So [PolyCoder] started with Vincent just trying to see what was the largest model that could be trained on our lab server, which ended up being 2.7 billion parameters… and this model was a league ahead of other oriented models code that was publicly available at the time. ”
In machine learning, parameters are the part of the model that is learned from historical training data. The correlation between number of parameters and sophistication held up remarkably well, overall.
Code Generation Survey
A growing number of organizations are exploring code-generating AI. At its Build developer conference in May 2021, Microsoft detailed a new feature in Power Apps that leverages OpenAI’s GPT-3 language model to help users choose formulas. Intel’s ControlFlag can autonomously detect errors in code. And Facebook’s TransCoder converts code from one programming language to another.
DeepMind recently announced AlphaCode, which the lab claims is one of the first code generation systems to compete with human programmers. In programming contests hosted on Codeforces, a programming contest platform, DeepMind reports that AlphaCode achieved an average ranking of 54.3% in recent contests with over 5,000 entrants.
But the Carnegie Mellon researchers note that “hardly anyone” outside of well-resourced companies can train models the size of AlphaCode or Codex. A 2020 study by startup AI21 Labs pegged the cost of training a text-generating model with 1.5 billion parameters — about half the size of PolyCode — at between 80,000 and 1.6 million. dollars. Copilot has 12 billion parameters.
“Big tech companies don’t publicly release their models, which really hampers scientific research and the democratization of these big code language models,” the researchers said. “To some degree, we hope our open source efforts will convince others to do the same. But overall, the community should be able to train these models on their own. Our model has pushed the limit of what you can train on a single server – anything larger requires a cluster of servers, which dramatically increases the cost.
Setbacks in code generation
While developing PolyCoder, researchers also studied and compared the performance of different code-generating AI systems, including Codex (via an API). Interestingly, they found that models mostly trained on English text and only a bit of source code turned out to be very good at generating code – perhaps because they got code information from resources such as the Q&A Stack Overflow development website that were included in the 249 GB database
“A promising approach to creating solid code generation models seems to be to train on various sources of programming knowledge, including code in a wide range of programming languages, but also code-related web texts” , the researchers said.
Researchers fear that models like PolyCoder have an incentive to generate buggy programs, including those with hard-to-detect security vulnerabilities. In the future, they fear that adversaries may “hide” malicious behavior in code generation patterns that only appear with the correct prompt, such as a keyword (e.g., company name or product), or download vulnerable code that can be recovered. by legitimate code generator models.
They suggest Codex-sized open-source models as a way to combat this, which could allow security researchers to look for failure modes in these models. As a side benefit, open source would allow developers to customize models or “teach” them new programming languages through a process known as fine-tuning, less expensive than training models from scratch.
“While the industry currently has many more computing resources, there is still plenty of room for innovation from academia and the research community, including creating smaller and faster custom models. that do not rely on an internet connection, useful applications such as bug detection and fixing, automatic code review, etc. These are tasks for which the research community has built promising prototypes that could really benefit from the power of these kinds of very large language models,” the researchers said. “Decentralized training, where multiple groups join together to form a large model together, could make a big difference here. Research grants and collaborations between companies and universities could also help. »
VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Learn more