Artificial Intelligence and Inherent Mathematical Difficulty (summary)

A guest post by Walter Dean and Alberto Naibo

Jun 22, 2025

The talk that had begun when I was leaving the Isaac Newton Institute, where I had dropped in briefly to scope out its Big Proof conference, had the title

Will mathematical theorem proving be solved by scaling laws?

This should be read alongside the following use of the word “solved,” which had appeared just a few days before the Newton Institute meeting —

“If Governor Gavin Newscum, of California, and Mayor Karen Bass, of Los Angeles, can’t do their jobs, which everyone knows they can’t, then the Federal Government will step in and solve the problem, RIOTS & LOOTERS, the way it should be solved!!!”

— as well as alongside Evgeny Morozov’s writings on solutionism:

the idea that given the right code, algorithms and robots, technology can solve all of mankind's problems, effectively making life "frictionless" and trouble-free. Morozov argues that this drive to eradicate imperfection and make everything "efficient" shuts down other avenues of progress and leads ultimately to an algorithm-driven world where Silicon Valley, rather than elected governments, determines the shape of the future.

If the idea that Silicon Valley determines the future of mathematics appeals to you then you may be disappointed by the conclusions of the article “Artificial Intelligence and Inherent Mathematical Difficulty” by Walter Dean1 and Alberto Naibo2, soon to appear in Philosophia Mathematica. The authors have graciously agreed to write a quick review of their conclusions for Silicon Reckoner’s readers.

Text within this block will maintain its original spacing when published

           Artificial Intelligence and Inherent Mathematical Difficulty, a summary
                                            by Walter Dean and Alberto Naibo

We’d like to thank Michael for the opportunity to make this post on a blog primarily directed to mathematicians. As philosophers with interests in logic and computation, we’re slightly outside of the target audience. But we’re sufficiently nearby to also realize that the recent interaction of mathematics and artificial intelligence raises a number of questions we think philosophers ought to be actively engaging and might even help to clarify.

We hope our forthcoming paper in Philosophia Mathematica will help to facilitate this. However we come not to praise the application of AI to mathematics but rather to offer a principled assessment of what we might hope to achieve. The basis for our skepticism is grounded in traditional limitative theorems – i.e. mathematical results such as the undecidability of first-order logic or the NP-completeness of propositional satisfiability. These theorems make it explicit that there are tasks we cannot solve computationally, either “in principle” or “in practice”.

It would be an understatement to say that such results are “well-known” in mathematical logic and theoretical computer science. But they are often left out of popular reporting on the growing role of AI-related methods in mathematics. Such reports typically showcase recent results set alongside reflections by mathematicians on the growing role of automation and its implications. Some examples include:

•“Computer Math Proof Shows Reasoning Power” (Gina Kolata, New York Times, December 10, 1996.)
•“How Close Are Computers to Automating Mathematical Reasoning?” Stephen Ornes, Quanta Magazine, 27 August 2020.
•“A.I. Is Coming for Mathematics, Too”, Siobhan Roberts, New York Times, July 28, 2023.
•‘We’re Entering Uncharted Territory for Math”, Matteo Wong, The Atlantic, October 4, 2024.
•“Mathematical Beauty, Truth and Proof in the Age of AI”, Jordana Cepelewicz, Quanta Magazine, 30 April 2025.
•“At Secret Math Meeting, Researchers Struggle to Outsmart AI”, Lyndie Chiou, Scientific American, June 6, 2025.

Accounts like these promote the impression that we are on the threshold of an era in which automation will radically transform mathematics, perhaps even taking over from humans for certain tasks. We will not contest here that some applications of automation – e.g. with respect to formal verification or even conjecture exploration – will become routine parts of mathematical research in the future. But as Michael’s recent experiences attest, such reporting sometimes de-emphasizes or even excludes dissenting views about both the scope and desirability of the envisioned transformation. In light of this, what we sought to address is how automation bears on what is presumably a central goal of our mathematical activities – that of resolving recognized open questions by finding suitable proofs or refutations.

In our paper, we argued that not only is this task difficult for humans, there are also good reasons to believe that it will remain so for automated systems. What we refer to as the basic argument to this effect runs as follows:

(P1) i) Proof discovery is a central goal of mathematical practice.
ii) This task can be naturally associated with the question of deciding whether a mathematical statement φis provable from a set of axioms Γ – i.e. determining if φ ∈Prov(Γ) where the latter is defined as {φ: Γ ⊢φ} and ⊢ denotes derivability in a proof system for first-order or higher-order logic.
iii) The difficulty of proof discovery is approximated by the computational complexity of Prov(Γ), understood as the problem of deciding if Γ ⊢φ in the general case.3
(P2) For a wide range of relevant mathematical theories Γ, the problem Prov(Γ) is of high computational complexity.
(P3) If a problem is of high computational complexity, then it is inherently difficult to decide in the general case using computing hardware that we can construct and apply in practice.

Therefore:

(C) Advances in computing theory or technology – inclusive of those associated with artificial intelligence – will not radically alter the following aspect of mathematical practice: Settling the status of significant open questions is an inherently difficult problem.

The anticipation of this argument in the 1920s and 1930s is directly linked to developments in automated theorem proving in the 1950s and 1960s. These in turn serve as the background for developments that are still transpiring today – e.g. like this. Indeed, in the case where Γ is a finite set of first-order axioms, the question of whether Prov(Γ) is decidable by an algorithm is just the classical Entscheidungsproblem.

This term was introduced by Heinrich Behmann, who in a 1921 lecture described its significance as follows:

[T]he development of a general, deterministic procedure proving mathematical propositions may seem like a hopeless, indeed ludicrous endeavor, perhaps comparable to the medieval quest for the philosopher’s stone. And indeed the achievement of this goal would be no less significant than the wonders believed to be inherent in that mythical substance for not just mathematics but for knowledge as a whole. If we look more closely entirety of mathematics would indeed be transformed into one enormous triviality (p. 175).

Later in the 1920s, John von Neumann4 and G.H. Hardy5 made similar remarks about both the difficulty of resolving the Entscheidungsproblem and the significance which a positive solution would have for mathematics. While they remained doubtful, in their 1928 textbook exposition Hilbert and Ackermann went so far as to describe the question of whether a decision method existed as the “main problem of mathematical logic”.

As many readers will be aware, Church and Turing showed in 1936 that there does not exist an algorithm for determining if a statement is derivable from a set of first-order axioms Γ. This is reported by what is now called Church’s Theorem. To see why this retains its significance as an undecidability result for all mathematics, it suffices to note that we can take Γ to be a finite set of axioms sufficient for formalizing all contemporary mathematics – e.g. a suitable extension of Gödel-Bernays set theory.

Reflection on what it means for a problem to be decidable by an algorithm, which was required to formulate a negative mathematical result of this sort, helped to initiate computer science as a discipline. But it also did not inhibit attempts to apply early computers to automate theorem proving – a development which is itself bound up with the early history of artificial intelligence. Such attempts also directly informed the famous (or infamous) 1973 Lighthill Report which remarked

In the nineteen-fifties and early nineteen-sixties. . . a great deal of optimism was generated from the concept of realising on a computer the algorithms for theorem proving. . . Those most involved now emphasize that this is particularly an area where hopes have been disappointed through the power of the combinatorial explosion in rapidly cancelling out any advantages from increase in computer power.

Not only did the Lighthill report contribute to the the first of the so-called “AI winters”, its submission coincided with the discovery of NP-completeness. This made precise the sense in which many of the in-principle decidable problems involved in “good old fashioned AI” – e.g. in logic, graph theory, and linear algebra – are indeed “combinatorially explosive”. This is to say that their general solution requires searching through a class of cases that is (at least) exponential in size relative to that of a problem instance.

This observation provided the context for a more incisive formulation of the basic argument by Michael Rabin within a 1974 address entitled “Theoretical Impediments to Artificial Intelligence”:

The above results. . . point to the possibility that this bad behavior of theorem proving programs is inherent. That [is to say] at least as long as we have a general purpose theorem prover either for first-order logic, or for the whole of propositional logic, or for all statements of even a very elementary and decidable fraction of mathematics, the following bad thing will happen. For rather short sentences it will very often, perhaps even almost always, be the case that the shortest proof or shortest decision algorithm will be impractically long to an extent that settling the question whether the sentence is a theorem, or producing a proof for the sentence, will be impossible in practice. (p. 617)

The results to which Rabin alludes (obtained with Fischer) concern the super-exponential hardness of deciding statements in the language of Presburger arithmetic – i.e. the (first-order) theory of the natural numbers with only addition. In contradistinction to its extension Peano arithmetic, Presburger arithmetic has been known to be algorithmically decidable since the 1920s. But Rabin’s result is striking because it shows that even in-principle decidable theories are often “practically undecidable”. In particular, it entails that there exist infinite classes of formulas whose shortest proof or refutation is “infeasibly long” as computed by any algorithm or proof system that can be employed in practice.6

In this case, the formulas in question only concern the additive structure of the natural numbers. But Fischer and Rabin also proved a doubly exponential lower bound on the theory of Real Closed Fields. This theory is sufficient for interpreting Euclidean geometry and thus retains its significance in light of developments like AlphaGeometry (because of its relation to Wu’s method). Such decidable theories are exceptional in the sense that they do not even allow for the interpretation of integer multiplication as is necessary for expressing familiar number-theoretic questions like the Goldbach or Twin Prime conjectures. And once we consider even a weak theory that makes this possible, Gödel’s first incompleteness theorem kicks in again to show that Prov(Γ) is not decidable by an algorithm at all.7

Results of this sort provide the evidence we take to stand behind premise (P2) in the basic argument. This contrasts with widely publicized reports that developments in artificial intelligence have made possible the automated resolution of problems that were previously thought of as intractable – e.g. in regard to protein folding, board games, object recognition, natural language translation, or text generation. As least to the uninitiated, such applications may indeed appear like philosopher’s stones within their respective domains. So why should we expect the situation to be different for proof discovery?

Readers from logic will have a ready answer to this. For note that the proofs of limitative theorems often show not only that Prov(Γ) is undecidable (or intractable) but also that it is complete for the relevant complexity class – e.g. Turing complete for first-order logic, coNP-complete for propositional logic, etc. This is to say that they are universal in the sense that if we could solve them algorithmically, then we could also solve all of the other problems in the relevant class, inclusive of ones which are known to be undecidable (or intractable). And this presumably isn’t the case for the other tasks to which artificial intelligence-inspired techniques have been successfully applied (even putting aside questions of their well-definedness).

Note, however, that this sort of response is premised on treating Prov(Γ) as a problem with infinitely many instances. On the other hand, what we are faced with in practice are specific statements corresponding to open questions we hope to resolve by finding specific proofs or refutations. One way the basic argument might come to be regarded as obsolete is if automation accrued enough successes to convince us that the cases we care about in practice are within the purview of available methods (the difficulty of Prov(Γ) in the logician’s sense notwithstanding).

This may already reflect prevailing opinions. However we suggest that the following observations should also be taken into account:

(O1) The statements that have been successfully resolved by automated methods thus far are not what would typically be regarded as significant open questions by mathematicians.
(O2) When the details are taken into account, this seems to reflect in-principle limitations on the relevant automated methods and computing hardware on which it must be implemented.

To illustrate what we mean by (O1), note that some recently publicized successes still pertain to high school-style geometry problems. And for (O2), formal results calibrating the capacities of machine learning systems relative to established computational hierarchies are starting to appear (including, e.g., deep neural networks and transformers).

But before commenting further on the details, we should not pass over the fact that a claim like (O1) may only be framed once we accept a criterion for evaluating the significance of mathematical statements. In our paper we settled for a characterization of a finite number of examples in virtue of their inclusion in historically influential lists of open questions (the Millennium Problems perhaps being the best-known, although many others will be known to specialists).

Of course the judgments about significance which inform inclusion on such lists ultimately cannot be divorced from a more general discussion of what we value in the practice of mathematics. And indeed several authors have pointed out that developments in automation may already be affecting the sorts of statements we choose to focus on, the characteristics we value in proofs, and even the sort of understanding we hope they will impart. These questions are in full view in a recent volume of the Bulletin of the AMS, including Akshay Venkatesh’s essay and Michael’s reflection on it.

We share these concerns. But regardless of whether one takes them to heart, we also think that a better appreciation of (O1) and (O2) may help to allay them. A familiar reason for this is that headline-grabbing applications of automation have, at least to this point, been confined to questions originating in subjects like combinatorics that concern specific finite objects or classes of objects – e.g. the computation of Schur numbers. This may account in part for the indifference towards recent trends in automation that Michael reports among his colleagues. Indeed even in the course of advocating for further engagement between mathematics and computer science, Jeremy Avigad acknowledges that “most mathematicians aren’t interested” in the results obtained thus far.

In his earlier essay “Varieties of mathematical understanding”, Avigad also contrasts the content and reception of the results which have been obtained by automated methods and those of “fashionable” branches of mathematics – a term he borrows from Kevin Buzzard. Based on Buzzard’s discussion here and here, these subjects are taken to be exemplified by algebraic geometry in general and the Langlands program in particular. While no precise analysis is offered, Avigad suggests that “fashionable” subjects may be distinguished from combinatorics in virtue of involving “complicated networks of algebraic definitions” leading to results generating “conceptual understanding” relating structures across different branches of mathematics. He also suggests that fashionable mathematics has long emphasized “notions over notation” and “concepts over calculation”.

This prompts Avigad to ask directly

Can/should/will computational methods like the ones used toward obtaining results [thus far by automation] become, in and of themselves, part of a proper mathematical understanding?

He then suggests a positive answer based on the observation that mathematical values have indeed changed over time – e.g. away from the tendency to devalue algebraic proofs and methods in favor of geometric ones, which persisted through the 17th century.

But of course the rise of algebra and the concomitant decline in the value assigned to geometric reasoning over the course of the 18th through 20th centuries was a gradual process that can reasonably be characterized as internal to the practice of mathematics. On the other hand, it is the possibility of a realignment of values imposed by forces external to the mathematical community that Michael highlighted in his talk at the 2025 Joint Mathematics Meeting. Indeed it even appears to have influenced the selection of this year’s title “We Decide Our Future: Mathematics in the age of AI”.8

We again share this concern. But when we finally turn to the details, we also think that the milieu of mathematical logic which originally gave rise to the basic argument not only offers some reassurance about what may be on the horizon, but perhaps also suggests a strategy for resisting external realignment. To this end we offered two refinements of (O1) and (O2) in our paper:

(O3) Successful applications of automated methods to date have mostly been obtained by variants of brute force search.
(O4) The statements they have been used to resolve are of a simple logical form that allows us to see in advance that, if true, they may be demonstrated by brute force. In particular, they are equivalent to Σ1-formulas9 – i.e. those of the form there exists an n ∈N such that φ(n) where φ(x) is a property of natural numbers which can be easily checked by an algorithm.

In support of these claims we considered three examples in detail:

McCune’s resolution of the Robbins problem in 1996 using traditional techniques from automated theorem proving.
Huele et al.’s resolution of the Boolean Pythagorean Triple Conjecture in 2016 using a SAT-solver.
Romera-Paredes et al.’s 2024 improvement on a lower bound result for cap sets using a combination of classical methods (greedy and genetic algorithms) and machine learning (a large language model to suggest programs for generating cap sets).

In each of these cases, the questions that were resolved were genuinely open in the sense of being stated in the prior literature. But they are also insignificant relative to both Avigad’s criteria and ours. Since a number of other examples could now also be discussed alongside them, it is important to understand why these methods are representative and also why they satisfy (O3) and (O4).

In regard to (O3), brute force is traditionally characterized as a form of enumerative search – i.e. a method that determines if a statement asserting the existence of a certain kind of object is true by successively generating and testing potential witnesses. While demonstrations proceeding in this manner are typically devalued by mathematicians, the analysis of brute force has an interesting history and also a potentially evolving sense – e.g. specifically in light of recent work on SAT-solvers.

So we also sought to sharpen the traditional characterization by observing that the searches at issue were not guided by the specific mathematical content of the statements to which they were applied and led to unsurveyably large demonstrations (often measured in terabytes).10

With respect to (O4), suppose that a statement ψ has Σ1 form and that α is an algorithm which allows us to efficiently test if φ(x) holds. Then if it turns out that ψ is true, this may be discovered by applying α to 0, 1, 2, ... until a witness n∈N for which φ(x) is found. And as we argued, when implementational details are taken into account, this is indeed a reasonable description of how the aforementioned applications of automation proceeded.

We also suggested that this condition already appears sufficient to distinguish most of the successes of automation to date from significant open questions in the sense characterized above. In particular, it is straightforward to see that most well-known problems that can be expressed number-theoretically have at least Π1-complexity – i.e. they are equivalent to statements of the form for all n∈N, φ(n) holds for decidable φ(x) (like the Goldbach conjecture) or require more complicated quantifier alternations (like the Collatz/3x+ 1 conjecture or P ̸= NP which have the Π2-form for all n, there exists m such that ...).11 As far as we are aware, no significant open question of this form has been resolved by automated means. But this is perhaps not surprising because it is easy to see that a true Π1-statement cannot confirmed by brute force alone.

Such reflections are based on individual applications of specific automated techniques. But it is still useful to compare them with the means by which longstanding open questions have historically been resolved – e.g. by employing analogical reasoning, introducing so-called impure concepts and methods from prima facie unrelated subjects, or by developing novel “abstract” notions or theories, etc. Attending to the contrasts between such cases and the “success stories” of automation – e.g in regard to the epistemological and normative dimensions of proof recently highlighted by Andrew Arana – provides a reminder of just how far away current techniques are from replicating at least the methods by which resolutions to significant open questions have traditionally been obtained.

At the same time, statements about provability from axioms can themselves be formalized metamathematically as Σ1-statements. So nothing in principle rules out that tomorrow’s newspaper will announce that a Millennium-style question has been resolved by an automated system operating in an essentially brute force-like manner.

Assessing the likelihood of this requires going deeper into how current and foreseeable automated methods operate and also what (if anything) makes the open questions we currently regard as significant inherently difficult to resolve. To our colleagues in philosophy, we propose these as central conceptual questions concerning the role of artificial intelligence in mathematics. In a sequel we will additionally consider how other recent developments appear to bear on them – e.g. Gowers’ Manifesto, Tao’s Equational Theories Project, AlphaGeometry and AlphaProof, and the Frontier Math benchmarks.

But as long as a headline announcing an automated proof of a significant open question has not appeared – or, as is perhaps more likely, some partial success is reported but then seen on reflection to resemble prior cases satisfying (O3) and (O4) – solace may still perhaps be found in the following observations. First, the basic argument continues to provide a reason for believing that resolving the specific questions we have historically regarded as significant will continue to be difficult for automated systems. Second, if an aegis against the external imposition of values is desired, then we should not lose sight of how and why mathematics has developed to lead us to these questions.

Leave a comment

Department of Philosophy, University of Warwick, Coventry CV4 7AL, U.K.

Department of Philosophy, Université Paris 1 Panthéon-Sorbonne, IHPST (UMR 8590), Paris 75005, France

As is so for many classical problems in AI, what we actually care about in practice is better characterized as a search problem – i.e. given a statement, return either a proof or refutation. But in most relevant cases the set of axioms in the background will be incomplete. We thus settle for considering the decision problem in light of the fact that most automated methods for (partially) deciding φ ∈Prov(Γ) return, by design, proof-like objects attesting to one of these possibilities.

“As of today we cannot in general decide whether an arbitrary well-formed formula can or cannot be proved from the axiom schemata given below. And the contemporary practice of mathematics, using as it does heuristic methods, only makes sense because of this undecidability. When the undecidability fails then mathematics, as we now understand it, will cease to exist; in its place there will be a mechanical prescription for deciding whether a given sentence is provable or not.” (pp. 11-12)

“Suppose. . . that we could find a finite system of rules which enabled us to say whether any given formula was demonstrable or not. This system would embody a theorem of metamathematics. There is of course no such theorem, and this is very fortunate, since if there were we should have a mechanical set of rules for the solution of all mathematical problems, and our activities as mathematicians would come to an end.” (p. 16)

In drawing such a conclusion a link must of course be maintained between the theoretical models involved in stating limitative results – e.g. Turing machines – and the sorts of concrete computing devices we can construct and apply in practice. But despite familiar talk of how the “parallel architectures” of machine learning systemz surpass classical constraints, the hardware on which such applications are implemented – including graphical and tensor processors – remainz classical in design. As such, current and foreseeable AI-inspired technologies remain subject to limitative results.

Of course this does not entail that specific conjectures in number theory are formally undecidable. Rather it illustrates that the general decision problem for number theory is astronomically more difficult than that of the case of decidable theories like Presburger arithmetic (which is already highly intractable). Whether the restriction of this problem to “naturally occuring” or “significant” statements (as discussed below) remains difficult is a question which work in automation may help to clarify.

This trades the significant/insignificant or fashionable/non-fashionable distinctions for a perhaps equally ill-defined one involving “internality/externality”. But attention to how modern mathematics developed – as Avigad himself displays in his contribution to this volume – and well as its presumptive autonomy from other disciplines – as also highlighted by Michael in this newsletter (for example here)– helps to focus what is at issue.

Apologies to logicians: I don’t know how to make Substack typeset subscripts. Likewise for ∏1 and ∏2 that appear later.

Of course the search strategies employed were not entirely “naive”. But the factors which controlled the order in which potential witnesses were tested was determined not by the properties of the mathematical objects under consideration – e.g. Robbins algebras, 2-colorings, or cap sets – but rather by parameters internal to the systems at issue – e.g. the length or “age” of terms, symmetries in propositional encodings, or the output of a large language model trained on general data. In this sense, the methods are also “brute” in the sense of being “general purpose” rather than tailored to specific mathematical questions to which they were applied.

Perhaps the most significant case of the interaction of logical complexity and number theory concerns the Riemann Hypothesis. Turing himself observed that this could be formulated as a Π2-statement that was then improved to Π1 by Kreisel. This treatment of Lagarias provides a simple exposition.

Peter Gerdes

Jul 1Edited

As a logician I've been trying to convince all the people who think that superintelligent AIs will be magic of the existence of inherent computational complexity for a long time. No AIs won't have godlike capabilities relative to us.

But I don't really understand how you get from that fact to the idea that they won't massively change how proofs are found by being substantially faster (or at least able to bring about more computational resources) at proving things than us. I mean, sure we are more efficient than brute force proof search via unification/resolution but at some level when we search for an idea we are engaged in a type of brute force search and there is no particular reason to think that AI won't eventually match that.

Sadly, imo, that's kinda the most depressing outcome. It will still take lots of effort to solve hard problems but we will all be in the situation where we can't even pretend that we are adding much value by trying to prove things ourselves.

I mean, if I felt that, in computability theory, I'd get the theorems I'm interested in proved faster by doing someone else's laundry so they could take the time to solve it for me I wouldn't be a mathematican. Whether or not you want to say that the machine proof doesn't count as mathematical understanding to you, it counts as far as I care about doing math. I know some people would do math even if it was all already known to rediscover it themselves. I wouldn't indeed, I'm not interested if I don't think I'm at least sorta advancing the frontier of knowledge in a meaningful way.

Obviously, other mathematicans may be brighter than me but I need to believe that I can at least learn a sliver of the mathematical universe well enough that my contributions advance knowledge there more than interacting with me slows it down and I worry we will -- not in the next few years but maybe in 20 years -- hit that point with AI.

To people who see doing math like playing music, as a kind of purely expressive/artistic activity maybe that will be fine. But to those of us who do math exactly because we don't like that kind of thing it will be sad. Though maybe VR will come along and give us something else fun to do.

Expand full comment

2 replies

Gudrun Hedwig Thäter

Jul 1

Dear Alberto Naimo and Walter Dean, a few weeks ago I asked (through your emails) if I am allowed to reprint a version of this contribution for the journal of the German Math. Association. I am still very much interested in a "yes" from your side. Thank you, Gudrun

1 reply

16 more comments...

Silicon Reckoner

Artificial Intelligence and Inherent Mathematical Difficulty (summary)

A guest post by Walter Dean and Alberto Naibo

Discussion about this post