According to a press release by the Joseph Saveri Law Firm, the firm has officially filed what it calls “the first major step in the battle against intellectual-property violations in the tech industry arising from artificial-intelligence systems.”
To what extent this is a “battle” clearly depends on your position and interests within the industry. But what does appear clear is that this is among the first examples of a lawsuit challenging the practice of training large generative content models on publicly available data.
Overview of the Case
Let me preface this by saying that I’m not a lawyer–nothing here is meant to provide legal advice, and you should speak with your own lawyer for any specific legal queries. As a journalist, though, I can share some analysis on what makes this case interesting.
This specific suit is against OpenAI, Microsoft, and Github and deals with their Copilot product. Copilot helps programmers write code faster by producing “code completions” as they work.
The suit alleges that the companies use large amounts of open-source code from millions of developers in order to train their generative AI models, in violation of the license terms under which the code was released.
The main plaintiff in the case is Matthew Butterick, who is an open-source developer and also a lawyer. Butterick claims that his code was unfairly used to train AI models.
In the press release, Butterick says: “As a longtime open-source programmer, it was apparent from the first time I tried Copilot that it raised serious legal concerns, which have been noted by many others since Copilot was first publicly previewed in 2021”
Potential Impacts
Although this lawsuit deals specifically with AI used to generate code, it will have major implications for other industries, including visual AI.
As expert Mark Milstein shared in a recent interview with Synthetic Engineers, many of today’s generative AI platforms “have mined PII (Personally identifiable information) from vast portions of the world’s population without their permission. This presents a major barrier at present to this technology’s potential disruptive powers.”
Whether or not companies can train AI systems using publicly available data is a major question in the industry at the moment. Shira Perlmutter, the US Register of Copyright, pondered AI and copyright at the recent Digital Media Licensing Conference, and talked specifically about fair use.
Fair use may become a major factor in the current case. The Verge is already calling it “the lawsuit that could rewrite the rules of AI copyright.” In an interview with that publication, Butterick repeatedly compares today’s generative AI systems to Napster, and advocates for versions of AI systems that integrate licensed content.
Butterick also makes clear that the same questions raised in this case could apply to fields like generative AI images. He specifically mentions Shutterstock’s recent partnership with OpenAI.
Again, I’m not a lawyer, but I expect that the impact may come down to the facts in this specific case. As Butterick notes, code generators are known sometimes to spit out whole chunks of code from other sources verbatim. One programmer cited in the Verge’s article claims that Codex does this.
Wholesale copying–if proven–may be considered quite different from analyzing sources and creating something new and potentially transformative. Generative image AI systems rarely spit out an exact duplicate of an existing image, as Butterick alleges that Codex does with code.
Courts may also have to grapple with questions of when code becomes sufficiently creative to deserve copyright protection. As attorney Russ Pearlman writes in an article, purely functional code or short code snippets may not be eligible for protection.
Either way, this is a big step towards clarifying rules and norms around generative AI systems. It’s a case we’ll certainly continue to follow here at Synthetic Engineers.