New technology often brings with it a bit of controversy. When considering stem cell therapies, self-driving cars, genetically modified organisms, or nuclear power plants, fears and concerns come to mind as much as, if not more than, excitement and hope for a brighter tomorrow. New technologies force us to evolve perspectives and establish new policies in hopes that we can maximize the benefits and minimize the risks. Artificial Intelligence (AI) is certainly no exception. The stakes, including our very position as Earth’s apex intellect, seem exceedingly weighty. Mathematician Irving Good’s oft-quoted wisdom that the “first ultraintelligent machine is the last invention that man need make” describes a sword that cuts both ways. It is not entirely unreasonable to fear that the last invention we need to make might just be the last invention that we get to make.
Artificial Intelligence and Learning
Artificial intelligence is currently the hottest topic in technology. AI systems are being tasked to write prose, make art, chat, and generate code. Setting aside the horrifying notion of an AI programming or reprogramming itself, what does it mean for an AI to generate code? It should be obvious that an AI is not just a normal program whose code was written to spit out any and all other programs. Such a program would need to have all programs inside itself. Instead, an AI learns from being trained. How it is trained is raising some interesting questions.
Humans learn by reading, studying, and practicing. We learn by training our minds with collected input from the world around us. Similarly, AI and machine learning (ML) models learn through training. They must be provided with examples from which to learn. The examples that we provide to an AI are referred to as the data corpus of the training process. The robot Johnny 5 from “Short Circuit”, like any curious-minded student, needs input, more input, and more input.
Learning to Program
A primary input that humans use to learn programming is a collection of example programs. These example programs are generally printed in books, provided by teachers, or found in various online samples or projects. Such example programs make up the corpus for training the student programmer. Students can carefully read through example programs and then attempt to recreate those programs or modify them to create different programs. As a student advances, they usually study increasingly complex programs and they start combining techniques discovered from multiple example programs into more complex patterns.
Just as humans learn to program by studying program code, an AI can learn to program by studying existing programs. Stated more correctly, the AI trains on a corpus of existing program code. The corpus is not stored within the AI model anymore than books studied by the human program are stored within the student. Instead, the corpus is actually used to train the model in a statistical sense. Outputs generated by the trained AI do not come from copies of programs in the corpus, because the trained AI does not contain those programs. The outputs should instead be generated from the statistical model of the corpus that has been trained into the AI system.
AI Systems that Generate Code
GitHub Copilot is based on the OpenAI Codex. It uses comments in the code of a human programmer as its natural language prompts. From these prompts, Copilot can suggest code blocks directly into the human programmer’s editor screen. The programmer can accept the code blocks, or not, and then test the new code as part of their program. The OpenAI Codex has been trained on a corpus of publicly available program code along with associated natural language text. Public GitHub repositories are included in that corpus.
Copilot documentation does claim that its outputs are generated from a statistical model and that the model does not contain a database of code. On the other hand, it has been discovered that code suggested by the AI model will match a code snippet from the training set only about one percent of the time. One reason for this happening at all is that some natural language prompts correspond to a relatively universal solution. Similarly, if we were to ask a group of programmers to write C code for using binary trees, the results might largely resemble the code in chapter six of Kernighan & Ritchie because that is a common component in the training corpus for human C programmers. If accused of plagiarism, some of those programmers might even retort, “That’s just how a binary tree works.”
But [sometimes Copilot will recreate code _and comments_ verbatim](https://github.blog/2021-06-30-github-copilot-research-recitation/). Copilot has implemented a filter to detect and suppress code suggestions that match public code from GitHub. The filter can be enabled or disable by the user. There are plans eventually provide references for code suggestions that match public code from GitHub so that the user can look into the match and decide how to proceed.
Is Learning Always Encouraged?
Even if it’s very rare that an AI model trained on a corpus of example code later generates code matching the corpus, we should still consider instances where the code should not have been used to train the model to begin with. There may be limits to when and which source code can be used for training AI models. Looking to the field of intellectual property, software can be protected by patent, copyright, trademark, and trade secret.
Patents generally offer the broadest protection. When a system or method practices one or more claims of a patent, it is said to infringe the patent. It does not mater who wrote the code, where it came from, or even if the programmer had no idea of the existence of the patent. Objections to software patents aside, this one is straightforward. If an AI model generates code that practices a patented method, it does not mater if that code does or does not match any existing code, there is a real risk of patent infringement.
Trade secret only applies in the highly pathological situation where the source code was misappropriated, or stolen, from the original owner who was acting to keep the source code secret. Obviously, stolen source code should not be used for any purpose including the training of AI models. Source code that has been published online by its author or owner is not being protected as a trade secret. Trademarks only really apply to names, logos, slogans, or other identifying marks associated with the software and not to the source code itself.
When considering AI model training, copyright concerns can a little more nuanced. Copyright protection covers original works of authorship fixed in a tangible medium of expression including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture. Copyrights do not protect facts, ideas, systems, or methods of operation. Generally, studying copyrighted code and then rewriting your own code is not an infringement of the original copyright. Copyright does not protect the concepts or operations of computer code, it merely protects the specific expression or presentation of the code. Anyone else can write their own code that accomplishes the same thing without offending the copyright.
Copyright can protect computer code from being reproduced into other code that is substantially similar to the original. However, copyright does not protect against reading, studying, or learning from computer code. If the code has been published online, it is generally accepted that others are allowed to read it and learn from it. At one extreme, the concept clearly does not extend to reading the protected work with a photocopier to make a duplicate. So it remains to be seen if, and to what extant, the concept of being free to read will extend to “reading” the copyrighted work into an AI model.
Law and Ethics Controlling the Corpus
There is litigation pending against GitHub, Microsoft, and OpenAI alleging that the AI systems violate the legal rights of programmers who have posted code on public GitHub repositories. The lawsuits specifically point out that much of the public code was posted under one of several open-source licenses that require derivative works to include attribution to the original author, notice of that author’s copyright, and a copy of the license itself. These include the GPL, Apache, and MIT licenses. The lawsuits accuse defendants of training on computer code that does not belong to them without proper attribution, ignoring privacy policies, violating online terms of service, and offending the Digital Millennium Copyright Act (DMCA) provisions that protect against removal or alteration of copyright management information.
It is interesting to note however, that the pending suits do not explicitly allege copyright violation. The defendants posit that any assertion of copyright would be defeated under the fair use doctrine. The facts do appear to parallel those in Authors Guild v. Google where Google scanned in the contents of books to make them searchable online. Publishers and authors complained that Google did not have permission to scan in their copyrighted works. However, the court granted summary judgement in favor of Google affirming that Google met the legal requirements of the fair use doctrine.
An interesting open project for the development of source code models is The Stack. The Stack is part of BigCode and maintains a 6.4 TB corpus of source code under permissive license. The project seems strongly rooted in ethical transparency. For example, The Stack allows creators to request removal of their code from the corpus.
Projects like Copilot, OpenAI, and The Stack will likely continue to bring very interesting questions to light. As AI technology advances in its ability to suggest code blocks, or eventually write code itself, clarity around authorship rights will evolve. Of course, authorship right may be the least of our worries.