All the textbooks will have to change, say donors

ChatGPT agrees, in part

Jan 06, 2024

Headlines like this may be in store for us

In December, Rep. Virginia Foxx (R-N.C.), the chair of the House Education and the Workforce Committee, sent Pritzker a letter asking for documents related to plagiarism allegations against Gay. The letter cites that Harvard’s federal funding “is conditioned upon the school’s adherence to the standards of a recognized accreditor,” which in Harvard’s case requires efforts against plagiarism.

(“Harvard agitators turn their ire toward Penny Pritzker,” Politico, January 3, 2024)

The editorialists I read have different reactions to the hounding of two women out of prestigious university presidencies. But they all agree that the target was the Ivy League as the most emblematic symbol of “woke” higher education, and that the details of the accusations were secondary at best. This is hardly controversial; Christopher Rufo, the “conservative mastermind” of the successful campaign, hasn’t been “shy about revealing the true motives” behind this and other operations.

Mathematicians should not assume we won’t soon be on Rufo’s hit list.1 That is why I am postponing finishing the second half of my report on Kevin Buzzard’s “Formalizing Fermat” grant to think about how we should best prepare for the coming onslaught. My thoughts naturally turned first to the colleagues who have been basking in the feral embrace of the tech industry, showing no sign of remembering that these monsters can bite when crossed — and so can the investors on whose good opinion their survival depends, like the hedge fund managers who turned against the Harvard and Penn Presidents.

But concern for my own professional survival is not far from my thoughts. The day before Claudine Gay resigned from the Harvard presidency, the NSF reminded me that

The Project Outcomes Report for the General Public for the award referenced above is NOW DUE.

After the Congressional fiasco, this allusion to the “General Public” set off alarm bells. If an unscrupulous congressional inquisitor can go unchallenged when she makes up spurious translations of Arabic words, what Project Outcomes Report can survive the withering (though irrelevant) sarcasm of a committee hearing? Seeing a “folklore conjecture” mentioned in a colleague’s recent “General Public” report confirmed my fears that the tabloids may soon be superimposing unflattering banner headlines on our least flattering photos, like the one pictured above.2 I imagine the Stefanik bot yielding to the honorable chatbot from North Carolina, who then asks “and do they also teach basket-weaving in your department?”

Text within this block will maintain its original spacing when published

           It’s only a matter of time before the donors realize we are all plagiarists

The issues were found in Dr. Gay’s 1997 doctoral dissertation, in which Harvard said it had found two examples of “duplicative language without appropriate attribution.”…
Altogether, the allegations accuse Dr. Gay, a political scientist, of using material from other sources without proper attribution in her dissertation and about half of the 11 journal articles listed on her résumé.
The examples range from brief snippets of technical definitions to paragraphs summing up other scholars’ research that are only lightly paraphrased, and in some cases lack any direct citation of the other scholars.3

“[S]ome of the most extreme and clear-cut cases of plagiarism yet,” according to the Washington Free Beacon, seem to have involved Gay’s nearly verbatim reproduction of four sentences and a footnote from a book by University of Wisconsin political scientist David Canon, without quotation marks and with citation only in the bibliography.

Illustration accompanying the plagiarism accusations in the *Washington Free Beacon*

Contacted by a local news station in Wisconson, Canon said “the accusations aren't ‘even close to an example of academic plagiarism.’

"Both Dr. Gay and I are defining basic terms such as Section 2 and Section 5 of the Voting Rights Act, covered jurisdictions, and bailout," he said. "Good definitions of these terms would have to use similar language or they would not be accurate."4

This is where the working mathematician should begin to get nervous. Haven’t we all copied definitions, and whole theorems, from the work of other mathematicians, verbatim, and reproduced them in our papers, without quotation marks, and often with only minimal citation?5 As I’ve been doing with increasing frequency, I turned to ChatGPT for guidance. The answer, which arrived instantaneously, could not have been more unsettling.

MH to Chat GPT: Suppose mathematician X needs to use a definition by mathematician Y. Which of the following is most appropriate to avoid plagiarism? (a) Copying the definition verbatim, without quotation marks, with a reference to mathematician Y's paper. (b) Copying the definition verbatim with a reference to mathematician Y's paper and placing it in quotation marks. (c) Paraphrasing the definition in different words, without reference to mathematician Y.

ChatGPT: The most appropriate option to avoid plagiarism while using mathematician Y's definition is:

(b) Copying the definition verbatim with a reference to mathematician Y's paper and placing it in quotation marks.

This method explicitly acknowledges that the definition is directly quoted from mathematician Y and gives proper credit by referencing their paper.

I had not yet read the Canon endnote presented in the illustration above when I asked ChatGPT this question and assumed that — as the accusers strongly insinuated — Gay’s incriminated text was citing Canon’s original work, rather than his description of aspects of the Voting Rights Act of 1965. A parallel example, closer to Gay’s transgression than to the hypothetical situation I brought up to ChatGPT, would be mathematician X’s verbatim copying of a statement of a standard definition, presumably well known to intended readers, in a paper by mathematician Y. I hereby confess to my 1038 subscribers and to any innocent bystanders that I have been that mathematician X! I have lifted verbatim, and without a hint of remorse, definitions of terms — including Shimura datum, de Rham complex, crystalline Galois representation — from papers by more mathematicians Y than could fit in an SUV, without quotation marks, and occasionally without attribution! I have even used technological tools to scan these definitions directly into LaTeX, when they include elaborate formulas, to spare myself the chore of typesetting them, so that they look identical in print to mathematician Y’s original!!!

Let me not get carried away. In almost none of these cases was the definition the original discovery of mathematician Y, any more than Canon was the originator rather than the recorder of the Voting Rights Act. I fear, however, that these fine points will be lost on the journalists of the Washington Free Beacon and the New York Post when our donors decide to turn on us or, what is more likely, when Elise Stefanik or a clone thereof decides to raise her profile at our expense.

Even our first-year textbooks may betray us. I know that “right-wing activists like” Christopher Rufo have been inspecting math textbooks for content, but they have concentrated on elementary school material, presumably because of their limited understanding of advanced mathematics. Now that Rufo’s congressional experiment has demonstrated that it’s unnecessary to understand the content, as long as one can read the punctuation, we can look forward to looking on helplessly as our libraries burn. Once again, ChatGPT confirmed my worst fears.

MH: The wording of the definition of the limit in calculus is essential. If the order of the words is changed the definition no longer has the same meaning. So should calculus textbooks state the definition of the limit in quotation marks?

ChatGPT: Given that the wording of the definition of the limit in calculus is crucial and altering the order of the words changes the meaning, it would be best to use direct quotations when presenting this definition in a textbook. Quoting the definition with the original wording and citing the source helps maintain accuracy and ensures that the concept is conveyed precisely as intended by the original mathematician. This approach helps preserve the integrity of the definition and avoids any misinterpretation due to rephrasing.

We could learn to live with this.

Plagiarism-free definition for calculus textbook, footnote omitted

Text within this block will maintain its original spacing when published

                             The mechanical mathematician is a born plagiarist

The only consolation in this sorry tale is that, in order to maintain an appearance of consistency, the defenders of the Congressional Inquisition will now have to require, upon pain of humiliation by Elise Stefanik, that LLMs insert quotation marks around every sample they extract from their training set and reproduce in their output. Investors, take note: the market for punctuation is about to get very tight!

ChatGPT, needless to say, doesn’t see things that way. I end with a (verbatim!) reproduction of my attempt to get ChatGPT to accept the consequences of its (her? their?) reasoning.

MH: Now suppose a large language model is trained on a data set including the work of mathematicians X and Y and thousands of others and is then asked to solve a mathematical problem. The answer makes use of definitions and formulas from thousands of sources. Should they all be cited?

ChatGPT: In the scenario you've described, the usage of a large language model trained on a diverse dataset that includes contributions from various mathematicians doesn't require citing individual sources for each definition or formula used in providing an answer.

Since the model's responses are generated based on the patterns and information it learned from the aggregated dataset during training, it doesn't directly copy or retrieve specific content from individual sources. Therefore, citations for each specific definition or formula used within the model's response aren't necessary.

However, if the model generates a response that includes unique or specialized information that can be directly traced back to a particular mathematician or source, it would be good practice to acknowledge or cite that source if known. But generally, due to the nature of how the model operates, citing individual sources for each component of its output isn't feasible or expected.

Everybody’s favorite LLM is trying to have things both ways. I pressed the point with an explicit reference to copyright law.

MH: Suppose mathematicians X and Y publish their work with a copyright provision that specifically rules out commercial use and requires that any citation of their results in whole or in part must be explicitly acknowledged. Can the large language model be designed in such a way as to respect this requirement?

ChatGPT: Yes, a large language model can be designed or configured to respect copyright provisions and citation requirements set forth by mathematicians X and Y. Developers and users of such models can implement certain mechanisms or protocols to ensure compliance with these restrictions.

For instance, the model's training data could be filtered or modified to exclude content that doesn't comply with specified copyright provisions. Additionally, the model's output generation could include a system that recognizes and respects citation requirements, such as providing explicit acknowledgments when using or referencing the work of mathematicians X and Y.

This might involve programming the model to:

Identify specific copyrighted content and refrain from reproducing it in a way that violates the copyright restrictions.
Acknowledge and cite the work of mathematicians X and Y whenever their results or content are utilized in the generated output, following the stipulated citation requirements.

By incorporating these checks and balances into the model's design and usage guidelines, it can be tailored to uphold the copyright provisions and citation requirements established by mathematicians X and Y.

This answer left me unconvinced, but instead of asking why ChatGPT thought such a design might be implemented I moved directly to a question that more directly approximates the plans for training LLMs to generate new mathematics.

MH: Suppose all the thousands of mathematicians whose work is included in the training set use the same copyright provisions. Can all the thousands of mathematicians be acknowledged in the output?

ChatGPT: Acknowledging all the individual mathematicians whose work is included in the training set of a large language model might be challenging due to the sheer volume of contributors. It could also potentially clutter or overwhelm the output, making it less readable or usable for practical purposes.

However, there are ways to acknowledge the collective contributions of the mathematical community without explicitly listing each individual mathematician. For instance:

General Acknowledgment: The model's output or accompanying documentation could include a general acknowledgment of the diverse contributions from the mathematical community that contributed to its training. This could express gratitude for the collective knowledge and research without specifying individual names.
References to Datasets or Sources: Instead of citing individual mathematicians, the model's output could reference the datasets or sources used for its training. This approach recognizes the collective nature of the data without singling out specific contributors.
Metadata or Documentation: The model could be accompanied by detailed documentation or metadata that outlines the sources and datasets used in its training. This information could be made available separately for those interested in the specifics of the training data.

While it might not be feasible to acknowledge each mathematician individually in the output due to the volume of contributors, finding a way to honor the collective contributions or providing transparent information about the training data sources can still uphold the spirit of acknowledging the community's work.

Unsurprisingly ChatGPT defended the company line. It remains to be seen whether the courts will be convinced.

Silicon Reckoner

Discussion about this post