Urgent: Google's data grab grabs mathematics

May 26, 2023

"For advice specific to your situation you will need to talk to a lawyer."

16 Comments

May 29, 2023

It's always an exciting sign when a writer references Quine's "gavagai" story with its "undetached rabbit parts." The issue of whether the world can be "cut at its joints" was also central to medieval philosophical debates about the nature of language and representation. Good to see we are still debating this 800 years later. I hope you'll write more about this question vis-a-vis the 'naturalness' of the object of mathematics.

Expand full comment

Reply (1)

Michael Harris

May 30, 2023

Credit really goes to one of my housemates in graduate school, a philosopher whose friends used to hang out at our house and talk about undetached rabbit parts and the like. Since we prepared meals together, we kept planning to have an undetached rabbit parts dinner, but we couldn't find them in the local Star Market. Later it turned out that they were in the exotic meat section of Savenor's, along with lion loins and elephant steaks. Savenor's was Julia Child's grocery store and I remember the time she had Quine on "The French Chef," her PBS show, but I'm not sure what was on the menu.

Expand full comment

Reply (1)

Margaret Wertheim

May 30, 2023

Thats hilarious. I presume they ate some very detached parts of something.

Expand full comment

Michael Vielhaber

May 28, 2023

1. Google, for a long time yet, IS ABOVE the law. Nobody may flatout copy entire books against all copyright provisions. Googlebooks does just that, nobody cares (?), nobody acts (!). That is the main problem here.

Same now with arXiv etc. - which is public though anyway, so up for graps (legally!) by anybody incl. Google.There IS though a license attached to arXiv material and it requires citing and/or non-modifying etc. Will they bother?

2. Garbage in - garbage out. AI up to now (in particular chatGPT) mixes input snippets to output ouevres. Nice enough, but always within the bounds of the received input material. Nothing genuinely new. Difficult to tell though, where it comes from, whether / that it is just copy&paste. Same might happen to math articles, where already today some 99% of us do not enter into the details of stuff too distant of our own tiny circle of competence. In the future, we will see Sokal-style "Fashionable Nonsense" also in math.

Expand full comment

Peter Gerdes

May 27, 2023

I couldn't disagree more with these objections to the use of mathematics to train models. Ultimately, we all make use of the intellectual work of others and use ideas (both cited and uncited because they can't be traced to a specific source) when we do our work.

Copyright doesn't reflect any natural right to stop others from using your work. It's merely a cludge to ensure the incentive exists to make valuable work. There is no reason to believe harvesting arxiv undermines those incentives so I don't see a moral problem. Indeed, far from Google behaving like Elsiever it's those who wish to keep this precious info locked down using copyright who are more like Elsiever in this situation.

Regarding the legality, google is probably in a pretty decent place. Copyright governs only the copying of data and doesn't prevent you from being inspired or getting an idea from it so the ultimate ouput of the ML process is probably no more a violation than it would be to read the paper yourself and have an idea. The initial training may be copying but it's probably covered under fair use just like creating an index for web search is.

Expand full comment

Reply (1)

Michael Harris

May 27, 2023

I suspect you actually could disagree more, but let's check. Do you disagree that Google or any commercial enterprise should ignore the explicit preference of authors who choose the NC-ND license (which will be my choice going forward)? Do you disagree that authors are entitled to choose one of the licenses proposed by arXiv? Do you disagree that "wealth makes might and might makes right" is morally problematic?

If you really couldn't disagree more, then your position is that the interests of Silicon Valley take precedence over those of the mathematicians who post their articles online. If that is really your position, then I am grateful to you for sharing it here; it makes my point more forcefully than I ever could.

The legal experts I have consulted anticipated the points you made — including the analogy with reading a paper — as well as other more technical points that are likely to be raised in the litigation that everyone assumes is inevitable. The Supreme Court did not uphold your position in the recent Andy Warhol case, but US copyright law is very complex and no one dares to predict how this will be settled. However, the US is not the only country with interests in arXiv. European courts and regulators have shown little sympathy for your positions in the past. We can agree that a fine of a few billion dollars will probably not dissuade these corporations from doing whatever they believe is in their long-term interest; but that is hardly a moral argument.

Expand full comment

Reply (2)

Peter Gerdes

May 27, 2023Edited

The Warhol case the court considered the fact that the copy was very close to the original, competed in exactly the same market and wasn't particularly transformative. Miles away from training an AI where the product looks nothing like the original.

But I agree that what the courts do is unclear. I'm arguing that what they *should* do is let anyone who wants to use the data to train an AI do so regardless of what the authors want.

Expand full comment

Reply (1)

Michael Harris

May 27, 2023Edited

Again, there's nothing in your messages that wasn't already explained to me by the legal experts with whom I have been speaking. It's kind of pointless to be making these arguments here when the courts will take years to establish a framework for dealing with these issues. Moreover, your comments regarding the purpose of US copyright law are largely irrelevant in Europe. But I am grateful to you for taking the time to set out the opposing position in such an extreme form, where "creation" apparently means "creation for the sole benefit of the rich and powerful."

Whether or not "the product looks nothing like the original" is definitely at issue in the artists' class action suit, as reported at https://news.artnet.com/art-world/class-action-lawsuit-ai-generators-deviantart-midjourney-stable-diffusion-2246770. I've already explained why I reject the kind of mindset that refers to what mathematicians create as "product," so again you are helping me make my main points.

Expand full comment

Reply (2)

Peter Gerdes

May 29, 2023

Ohh, and speaking of who is handling things over to big buisness there is no option here where no one trains data on math papers.

There is the option where copyright law restricts training to only those with a license so only Springer and Elseiver get to train AI on our mathematical work (they hold copyright on almost every published work in math) and the option where anyone can go harvest it who wants.

Expand full comment

Peter Gerdes

May 29, 2023

First really? Where in any of my statements did I say anything about a special status for the rich and powerful. The same legal rules should obviously apply to anyone who wants to use that data. Or are you suggesting that it should be legal if a nonprofit uses the information but not if a corporation does?

But yes, ultimately, I think that the default needs to be that ideas are free for anyone to use and that we only restrict such use when it's shown to be necessary to incentivize creation. And no, I don't see a problem if a corporation uses the data anymore than I think the courts should have stopped web search because only someone with lots of computational power could setup a search engine.

This situation is much more like Authors Guild v. Google than the Warhol case. Indeed, the use of the scanned copies in those cases was much less transformative than the uses here. In those cases, the literal text was maintained and simply turned into a reverse index. If that's sufficently transformative than why wouldn't this be. But this is neither here nor there because the courts will do whatever the courts will do and neither of us are legal experts (though we can both cite plenty). The real question is whether that should be allowed. I'll say more about that in my reply below.

Expand full comment

Reply (1)

Michael Harris

May 29, 2023

As you point out, this is not the place to make predictions about the outcome of future litigation, in the US or elsewhere. You have made your preferences clear and I don't see any reason to pursue this exchange. But for the sake of clarity: (1) I am not defending corporate copyright but rather the right of authors to decide whether or not their papers can be used for commercial purposes, or for any purposes by bad actors such as Google (or Elsevier, for that matter); (2) Only the "rich" have the resources to fund the training and only the "powerful" have the resources to defend their questionable practices in court; (3) the option of public funding, following robust deliberation in the interest of all parties directly concerned, with full democratic oversight and with no commercial intentions, is missing from your list.

Expand full comment

Peter Gerdes

May 27, 2023

As you know 'couldn't disagree more' is a phrase that means 'I disagree alot'.

My position is that choice of a license doesn't give you extra rights beyond copyright law. They aren't actually violating copyright law (you could have said all rights reserved and wouldn't matter). And if it was a violation of copyright law that would be a reason to change it.

Copyright law exists only to incentivize creation and it's an unfortunate fact that to do this we need to make it illegal to copy that item. When you have a case like this one where the original item is almost completely transformed (no different than someone reading it and getting an idea) so you aren't selling copies of the original that incentive consideration is highly attenuated which is why courts have generally recognized these kinds of transformative uses (eg copying for a web search db) to be fair use regardless of whether the copyright holder gives permission. This is the same principle.

Expand full comment

Reply (2)

Michael Harris

Jul 20, 2023

From a recent article in the MIT Technology Review (link below):

"Last week, the Federal Trade Commission opened an investigation into whether OpenAI violated consumer protection laws by scraping people’s online data to train its popular AI chatbot ChatGPT.…

An agency like the FTC can take companies to court, enforce standards against the industry, and introduce better business practices, says Marc Rotenberg, the president and founder of the Center for AI and Digital Policy (CAIDP), a nonprofit. CAIDP filed a complaint to the FTC in March asking it to investigate OpenAI. The agency has the power to effectively create new guardrails that tell AI companies what they are and aren’t allowed to do, says Myers West.

The FTC could require OpenAI to pay fines or delete any data that has been illegally obtained, and to delete the algorithms that used the illegally collected data, Rotenberg says."

https://www.technologyreview.com/2023/07/17/1076416/judges-lawsuits-dictate-ai-rules/?truid=&utm_source=the_download&utm_medium=email&utm_campaign=the_download.unpaid.engagement&utm_term=&utm_content=07-20-2023&mc_cid=9f8f0ac5a5&mc_eid=c2ebf1fa97

Expand full comment

Michael Harris

May 27, 2023

There's something confusing in your reference to the purpose of copyright law. It seems its purpose is to "incentivize creation," but after the creation is finished the result is (or should be) up for grabs. So where is the incentive?

Expand full comment

Reply (1)

Peter Gerdes

May 29, 2023

The incentive is unchanged from before. If someone wants a literal copy of your paper they need to get that under the license the copyright holder offers.

More generally, AI doesn't really change the incentives much about publication of anything other than making ir much easier to create new works or summarize existing knowledge. Does that decrease the incentive to create works? Sure, to some extent the same way the creation of textbooks and encyclopedias decrease the incentives to create the original work because some people might just happily read the summary but we have to balance those interests and I think it's pretty clear that it's good that it's legal to summarize and describe (hence why we don't allow patents for writing or math) and this use also seems on the positive side (I expect more creative works if AI is allowed to train on existing works than if not).

But, for math specifically, that incentive isn't really what's driving any publication. If I could pass any law I'd completely revoke copyright protection for academic math papers. That would solve the problem of for-profit academic publishers gouging universities for access to back issues and wouldn't really disincentivize any mathematical discovery because mathematicians have never relied on renumeration from copyright law to encourage discovery. Indeed, the net effect of copyright on mathematics has almost entirely been negative.

Expand full comment

jean-michel Kantor

May 27, 2023

Plus que les arguments philosophiques et politiques ( comme ceux de ton texte annexe '' mathematics and the undead '' je pense que l'absence d'autoformalisation des définitions et l'identification ridiclue de corrélations et intuition jettent un doute sérieux sur l'avenir de cette direction de '' prise en main '' des mathématiques par Google .

Expand full comment

Silicon Reckoner

Urgent: Google's data grab grabs mathematics