Terrence Tao
Primitive sets and von Mangoldt chains: Erdős Problem #1196 and beyond
Boris Alexeev, Kevin Barreto, Yanyang Li, Jared Duker Lichtman, Liam Price, Jibran Iqbal Shah, Quanyu Tang, and I have just uploaded to the arXiv our paper Primitive sets and von Mangoldt chains: Erdős Problem #1196 and beyond. This paper (which is a work in progress) represents our efforts to digest and document the recent flurry of developments around the following problem of Erdős, Sárközy, and Szemerédi on primitive sets:
Conjecture 1 (Erdős problem #1196) Suppose that is a primitive set of integers, which means that no element of divides another. Then as .
One can show that the upper bound of is best possible up to the error by taking to be the set of products of primes for some suitable parameter . This was one of the most well-known open problems in the study of primitive sets, and had attracted some number of partial results (for instance, Lichtman was able to show the upper bound of ). It was thus notable that this problem was first solved by an autonomous AI query (by the fifth author) a few weeks ago. This solution introduced a proof technique – based on Markov chains in the divisibility poset – which in retrospect is very natural for controlling primitive sets, but which had not been explicitly used in previous literature, though in retrospect many of the arguments in that literature involved a specific Markov chain which we call the downwards Mertens chain. The proof instead revolved around a different Markov chain, which we call the downwards von Mangoldt chain, which manages to neatly avoid the “” type losses in the previous Mertens-based arguments, and resolve Conjecture 1. In this paper we develop the Markov chain approach more systematically, and show that it settles several further conjectures concerning primitive sets, and also provides simpler proofs of some previous results in the literature. More precisely, in addition to Conjecture 1, we establish the following:
Theorem 2 (Erdős primitive set conjecture, #164) For any primitive consisting of numbers greater than ,
Theorem 3 (Odd Banks–Martin) Let and suppose is a primitive set consisting of odd numbers with at most prime factors. Then
where denotes the primes appearing as factors of elements of , and is the collection of products of primes from .
Theorem 4 ( is Erdős-strong) If is a primitive set consisting of even numbers, then
Theorem 5 (Ahlswede–Khachatrian–Sárközy) If is a primitive set, then whenever .
Theorem 6 (Erdős–Sárközy–Szemerédi, #1217) Let be such that the upper doubly logarithmic density is positive. Then there exists a strictly increasing infinite divisibility chain in such that
Theorem 2 and Theorem 5 had been previously established by Lichtman and Ahlswede–Khachatrian–Sárközy respectively, but the Markov chain formalism gives shorter (and more unified) proofs of both. Theorems 3, 4, 6 were open conjectures that can now be settled by this method. These results were obtained with varying levels of AI involvement, ranging from completely autonomous AI queries to traditional pen-and-paper calculations, to various hybrid approaches (for instance, with humans suggesting key inequalities that could then be rapidly tested numerically or even proved by various AI tools).
— 1. Chain/antichain duality and Markov chains —
I’ll now discuss the basic method of proof and try to motivate the main ideas, which have become much clearer in retrospect. Primitive sets can be viewed as antichains in the divisibility poset , in which the partial ordering is given by the divisibility relation . So, one can pose the following more abstract question: given a general poset and a weight function , what is the maximal value of as ranges over all antichains in ?
One can attack this problem using the well known duality between antichains and chains (totally ordered subsets of ): every antichain and chain can meet in at most one point, thus one has
for any chain and any antichain . In particular, if one has a measure on the space of all chains (viewed as a compact subspace of the power set of , equipped with the product topology) with the property that for all , then by integrating the previous inequality against and using Tonelli’s theorem one would obtain the upper boundIn fact this duality is completely tight:
Proposition 7 (Chain/antichain duality) Let be a poset, let be a weight function, and let . Then the following are equivalent:
- (i) for all antichains .
- (ii) There exists a measure of total mass at most on the space of chains, such that (1) holds for all .
Proof: We have already indicated how (ii) implies (i). Now we need to show that (i) implies (ii). A standard compactness argument allows us to reduce to the case when is finite. If (i) holds, then we also have
for all in the Stanley chain polytope, defined as the convex hull of the indicator functions of antichains. By a classic result of Stanley, this polytope can also be defined as the space of all obeying the inequalities and Applying linear programming duality (or the Farkas lemma), we conclude that the inequality (2) must be a non-negative linear combination of the inequalities (3), (4) (as well as the trivial inequality ). Equivalently, we can find non-negative weights for each chain such that and for all . The claim follows by viewing as a measure on the space of chains.Thus, the “universal” problem of obtaining a uniform upper bound on for all antichains is replaced with the equivalent “existential” dual problem of exhibiting a single measure on chains of controlled mass, which hits each element of the poset with a mass of at least the original weight . Thus, such problems are now reduced to that of finding a sufficiently clever construction of such a measure . If the mass of was normalized to equal , this becomes a probability problem: find a random chain process in the poset that hits each element of the poset with a sufficiently high probability. (Though in our paper we found it more convenient for technical reasons to not normalize the measure, and allow the mass to take values other than .)
It turns out (in a manner that was not explicitly appreciated in past literature) that particularly good choices of random chain to use here can come from Markov chains. (Here, the term “chain” is now being used in two different ways, but fortunately the order-theoretic concept of a chain and the Markov process-theoretic concept of a chain will be quite compatible in this discussion.) There will be two types of Markov chains on the poset that will be relevant: downwards Markov chains and upwards Markov chains. Here is our notation for a downwards Markov chain:
Definition 8 (Downwards Markov chain) Let be a poset, and suppose we designate some subset of to be the “absorbing states” (in practice these will be the minimal elements of , although they do not have to be). A downwards Markov chain on with absorbing states is a collection of transition probabilities for obeying the following axioms:
- (i) , with equality unless and , or if and .
- (ii) For any , one has .
Given such a downwards Markov chain and an initial state , one can generate a random decreasing sequence by having each transition to with probability after conditioning on the past history of the chain. This sequence will (almost surely) be strictly decreasing until it hits an absorbing state, in which case it stays there forever, although if the descending chain condition is not satisfied it is also possible for the sequence to be strictly decreasing indefinitely. We let denote the law of this decreasing sequence. This construction is already enough to recover the Lubell-Yamomoto-Meshalkin (LYM) inequality:
Theorem 9 (LYM inequality) If is the power set of with the inclusion partial ordering, and is an antichain in this poset (i.e., a Sperner family), then
Proof: We introduce the downwards Markov chain with absorbing state in which each non-empty subset of of some cardinality transitions to a -element subset chosen uniformly at random (i.e., with probability of transitioning to each). If we start the descending sequence from the maximal element of , then one can easily check that each is hit with probability . Applying Proposition 7 with , , and , we obtain the claim.
In the above argument we fixed the initial location of the Markov chain, but more generally one can start with any source mass and work with the measure for the purposes of applying Proposition 7.
One can also define upwards Markov chains in exact analogy with downwards Markov chains (reversing the order in the poset), which now generate random increasing sequences rather than decreasing sequences. There is a useful adjoint construction that can convert a downwards Markov chain into an upwards Markov chain: if we have a positive weight which is invariant under the chain in the sense that
for any , then we can define an adjoint upwards Markov chain (with no absorbing state) by the formula for any in . More generally, if is merely sub-invariant in the sense that for all , one can still construct an adjoint upwards chain as before, but now one must also add an additional absorbing maximal state to ensure that the transition probabilities still sum to one.A downwards or upwards Markov chain, when equipped with an invariant or sub-invariant measure, also induces a flow network on the poset, in which an edge from to is assigned a flow capacity of
One can rewrite the Markov chain arguments in the paper in terms of such flow networks, in which case the arguments often boil down to an application of the discrete divergence theorem, giving very short proofs of many of the above results; see the paper for more discussion. However, we chose to focus more on the Markov chain approach in our presentation, as this formalism is also natural and could potentially be more flexible for further applications.
— 2. The Mertens and von Mangoldt chains —
For the purpose of analyzing primitive sets, there are two downward chains on the natural numbers (with absorbing state )that, in retrospect, are particularly natural to use:
- The downwards Mertens chain, in which each transitions deterministically to , where is the largest prime factor of ; and
- the downwards von Mangoldt chain, in which each transitions to with probability for each dividing , with the von Mangoldt function.
The von Mangoldt weight is a natural choice here thanks to the fundamental identity
which encodes the fundamental theorem of arithmetic. The two chains are similar in many way to each other: the von Mangoldt process favors the division by the largest prime factor, but does not require it.The Mertens chain generates deterministic downward divisibility chains
starting from a product of primes , and as such this process was implicit in much of the previous literature on primitive sets. However, it does not quite interact well with the weight, which is not invariant or subinvariant with respect to this process. Intuitively, the process of dividing by tends to increasingly select for numbers for which is smaller than expected. Instead, the natural invariant measure for this chain is the Mertens weight the verification that this is indeed an invariant weight is a nice exercise in telescoping series. Taking the adjoint of the downwards Mertens chain with respect to this weight and running that chain from gives the upwards Mertens divisibility chain in which each transitions to for some prime with probability . A routine induction shows that each is hit by this chain with a probability of ; this for instance gives a weak version of Theorem 1, and similarly for the other results discussed above.The key innovation (which was uncovered by the AI-assisted proofs, though not quite in the notation and framework presented here) is to switch to the von Mangoldt chain, which removes the bias towards numbers whose largest prime factor is small. Indeed, the weight now turns out to be sub-invariant (after removing ) under this chain, and there is a modification
of this weight which turns out to be perfectly invariant. (We have an interpretation of this formula in terms of a zeta process that couples together various zeta distributions into a continuous divisibility chain; see the paper for further details.) Taking the adjoint with respect to this weight (or with the original weight ) can eliminate the loss in the previous argument, and give one of the proofs of Theorem 1 recorded in our paper (there are several other variants of this method that we also present).One slight defect of the von Mangoldt chain, as compared to the Mertens chain, is that it can “jump over” primitive sets (such as the set of products of primes) due to the fact that it will sometimes multiply or divide by a power of a prime rather than a prime itself. This turns out to be a technical difficulty for many of our applications, resulting in a need to make various small ad hoc modifications to the von Mangoldt chain to eliminate this type of jump.
In order to establish some crucial sub-invariance properties, it turns out (after some standard manipulations) to be useful to obtain good bounds on the negative log-derivative
of the Riemann zeta function in the region . Here it turns out that there is a clean (and rather efficient) upper bound which is equivalent to the non-decreasing nature of the Dirichlet eta function in this region. There are many proofs of this fact in the literature, but I would like to record a particularly cute proof, that is in the spirit of other arguments in the paper: one can interpret probabilistically as where is a gamma random variable of shape and scale . Because the sum of independent copies of and has the distribution of , one can couple together all the so that they are increasing in , at which point the claim follows.
— 3. Further directions —
Will Sawin and Ofir Gorodetsky have obtained analogues of several of the above results for function field models or permutation models respectively; we briefly discuss these in the paper, although we do not plan to cover these models in depth. We also note another recent use of this technique to solve a separate Erdős problem (#858) relating to antichains in a variant of the divisibility poset.
The zeta process that we have discovered hints at an emerging theory of the “developmental anatomy of integers”, which differs from the existing topic of anatomy of integers in that it views a large integer (and its prime factor “organs”) not as a static entity, but rather as an evolving process in which primes (or powers of primes, in the case of the von Mangoldt process) are added or removed to the integer over time. With this perspective, primitive sets can be viewed as singular moments in such a developmental process, which are only encountered at most once in the life cycle of a given integer. It seems of interest to study this developmental perspective further.
The paper is currently a work in progress; we have released an early version due to the public interest in this problem. We plan to explore some further applications, and also to formalize more of the above results in Lean (currently two of the six main theorems are formalized), before submitting the paper for publication. The situation here highlights a distinction I have recently made between three components of the problem solving process in mathematics, namely proof generation, proof verification, and proof digestion. In this particular case, the first two steps were extremely rapid due to modern AI tools; however, properly digesting the AI-generated proofs into a coherent exposition that places the arguments in context with both past literature and future directions remains a slower process that requires expert human attention.
Products of consecutive integers with unusual anatomy
I’ve just uploaded to the arXiv my paper “Products of consecutive integers with unusual anatomy“. This paper answers some questions of Erdős and Graham which were initially motivated by the study of the Diophantine factorial equation
where and are positive integers. Writing , one can rewrite this equation as where denotes the squarefree part of (the smallest factor of formed by dividing out a perfect square). For instance, we have which corresponds to the solution to the original equation.The equation (1) ties into the general question of what the anatomy (prime factorization) of the product looks like. This is a venerable topic, with the first major result being the Sylvester-Schur theorem from 1892 that the largest prime factor of is greater than whenever . Another notable result is the Erdős-Selfridge theorem that the product is never a perfect power for .
Erdős and Graham were able to show that solutions to (1) were somewhat rare, in that the set of possible values of had density zero. For them, the hardest case to treat was when the interval was what they called bad, in the sense that was divisible by the square of its largest prime factor. They were able, with some effort, to show that the union of all bad intervals also had density zero, which was a key ingredient in to prove the previous result about solutions to (1). They isolated a subcase of the bad intervals, which they called the very bad intervals, in which the product was a powerful number (divisible by the square of every prime factor).
A later paper of Luca, Saradha, and Shorey made the bounds more quantitative, showing that both the set of values of , as well as the union of bad intervals, had density for some absolute constant . In the other direction, just by considering the case , one can show that the number of possible values of up to is , where is the constant
As for the bad intervals, by again considering the case , it is possible to show that the number of bad points up to is see for instance this paper of Ivic. Similarly, the union of the very bad intervals contains as the set of powerful numbers; Golomb worked out that the number of powerful numbers up to is .It was conjectured by Erdős and Graham that all of these lower bounds are in fact sharp (up to multiplicative factors); this is Erdos Problem 380 (and a portion of Erdos Problem 374). The main result of this paper is to confirm this conjecture in two cases and come close in the third:
Theorem 1
- The number of numbers up to that lie in a bad interval of length is of the number of bad points up to .
- The number of numbers up to that lie in a very bad interval of length is .
- The number of numbers up to of the form for a solution to (1) is .
Not surprisingly, the methods of proof involve many standard tools in analytic number theory, such as the prime number theorem (and its variants in short intervals), zero density estimates, Vinogradov’s bounds on exponential sums, asymptotics for smooth numbers, the large sieve, the fundamental lemma of sieve theory, and the Burgess bound for character sums. There was one point where I needed a small amount of algebraic number theory (the classification of solutions to a generalized Pell equation), which was the one place where I turned to AI for assistance (though I ended up rewriting the AI argument myself). One amusing point is that I specifically needed the recent zero density theorem of Guth and Maynard (as converted to a bound on exceptions to the prime number theorem in short intervals by Gafni and myself); previous zero density theorems were barely not strong enough to close the arguments.
A few more details on the methods of proof. It turns out that very bad intervals, or intervals solving (1), are both rather short, in that the bound holds. The reason for this is that the primes that are larger than (in the very bad case) or for a large constant (in the (1) case) cannot actually divide any of the unless they divide it at least twice. This creates a constraint on the fractional parts of and that turns out to be inconsistent with the equidistribution results on those fractional parts coming from Vinogradov’s bounds on exponential sums unless is small. In the very bad case, this forces a linear relation between two powerful numbers; expressing powerful numbers as the product of a square and a cube, matters then boil down to counting solutions to an equation such as
with say . The number of solutions here turns out to be by work of Aktas-Murty and of Chan; we generalize the arguments in the former to handle a slightly more general equation. A similar argument handles solutions to (1), except in one regime where the parameter is somewhat large (comparable to ), in which case one instead collects some congruence conditions on and applies the large sieve.The situation with bad intervals is more delicate, because there is no obvious way to make small in all cases. However, by the large sieve (as well as the Guth–Maynard theorem), one can show that the contribution of large is negligible, and from bounds on smooth numbers one can show that the interval contains a number with a particularly specific anatomy, of the form where are all primes of roughly the same size, and is a smoother factor involving smaller primes. The rest of the bad interval creates some congruence conditions on the product . Using some character sum estimates coming from the Burgess bounds, we find that the residue of becomes fairly equidistributed amongst the primitive congruence classes to a given modulus when one perturbs the primes randomly (there are some complications from exceptional characters of Siegel zero type, but we can use a large values estimate to keep their total contribution under control). This allows us to show that the congruence conditions coming from the bad interval are restrictive enough to make non-trivial bad intervals quite rare compared to bad points. One innovation in this regard is to set up an “anti-sieve”: the elements of a bad interval tend to have an elevated chance of being divisible by small primes, and one can use moment methods to show that an excessive number of small prime divisors is somewhat rare. This can be compared to standard sieve arguments, which often seek to limit the event that a number has an unexpectedly deficient number of small prime divisors.
Mathematical methods and human thought in the age of AI
Tanya Klowden and I have uploaded to the arXiv our preprint “Mathematical methods and human thought in the age of AI“. This is an unabridged version of a solicited article for a forthcoming Blackwell Companion to the Philosophy of Mathematics. I rarely write article-length essays of a philosophical nature (perhaps the last one was in 2007), but given the topical interest in AI and formalization for mathematics, which has begun to raise increasingly fundamental questions about what the nature, purpose and practice of mathematics actually is (or ought to be), it seemed like it was a timely opportunity to write about these matters. Other mathematicians seem to have recently come to this conclusion also; see for instance this paper of Avigad, or this paper of Commelin, Jamnik, Ochigame, Taelman, and Venkatesh, both of which have come out in the last few weeks.
Our piece took over a year to write – which means, at the current pace of development in the field, that some of it is already slightly out of date. Nevertheless, it was an instructive exercise for both of us to try to look beyond the immediate technical issues presented by current AI and formalization tools and try to point out the philosophical questions that we will have to grapple with as these tools become increasingly capable and integrated into our profession, using prior examples of technological advancement as a guide. We don’t pretend to have definitive answers to most of these questions, but as with mathematics itself, the first step is to pose the questions and then try to make partial progress on them (or at least identify some negative results and eliminate some failed approaches). One point we particularly felt worth stressing is that AI tools and applications (in mathematics or elsewhere) should not be viewed purely through the technical lens of what microscale problems they solve and how effective or efficient they are at solving them, but also through the macroscopic humanitarian lens of how our society, our shared body of knowledge and understanding, and our species benefits (or is harmed) as a whole from these technologies.
Our initial submission ended up significantly exceeding the page limits of the submission and ventured beyond the philosophy of mathematics into broader philosophical and ethical questions about AI in general. A streamlined version of the paper will appear in the forthcoming companion, but we have decided to make the original longer version available on the arXiv.
Local Bernstein theory, and lower bounds for Lebesgue constants
I’ve just uploaded to the arXiv my paper “Local Bernstein theory, and lower bounds for Lebesgue constants“. This paper was initially motivated by a problem of Erdős} on Lagrange interpolation, but in the course of solving that problem, I ended up modifying some very classical arguments of Bernstein and his contemporaries (Boas, Duffin, Schaeffer, Riesz, etc.) to obtain “local” versions of these classical “Bernstein-type inequalities” that may be of independent interest.
Bernstein proved many estimates concerning the derivatives of polynomials, trigonometric polynomials, and entire functions of exponential type, but perhaps his most famous inequality in this direction is:
Lemma 1 (Bernstein’s inequality for trigonometric polynomials) Let be a trigonometric polynomial of degree at most , with for all . Then for all .
Similar inequalities concerning norms of derivatives of Littlewood-Paley components of functions are now ubiquitious in the modern theory of nonlinear dispersive PDE (where they are also called Bernstein estimates), but this will not be the focus of this current post.
A trigonometric polynomial of degree is of exponential type in the sense that for complex . Bernstein in fact proved a more general result:
Lemma 2 (Bernstein’s inequality for functions of exponential type) Let be an entire function of exponential type at most , with for all . Then for all .
There are several proofs of this lemma – see for instance this survey of Queffélec and Zarouf. In the case that is real-valued on , there is a nice proof by Duffin and Schaeffer, which we sketch as follows. Suppose we normalize , and adjust by a suitable damping factor so that actually decays slower than as . Then, for any and , one can use Rouche’s theorem to show that the function has the same number of zeroes as in a suitable large rectangle; but on the other hand one can use the intermediate value theorem to show that has at least as many zeroes than in the same rectangle. Among other things, this prevents double zeroes from occuring, which turns out to give the desired claim after some routine calculations (in fact one obtains the stronger bound for all real ).
The first main result of the paper is to obtain localized versions of Lemma 2 (as well as some related estimates). Roughly speaking, these estimates assert that if is holomorphic on a wide thin rectangle passing through the real axis, is bounded by on the intersection of the real axis with this rectangle, and is “locally of exponential type” in the sense that it is bounded by on the upper and lower edges of this rectangle (and obeys some very mild growth conditions on the remaining sides of this rectangle), then can be bounded by plus small errors on the real line, with some additional estimates away from the real line also available. The proof proceeds by a modification of the Duffin–Schaeffer argument, together with the two-constant theorem of Nevanlinna (and some standard estimates of harmonic measures on rectangles) to deal with the effect of the localization. (As a side note, this latter argument was provided to me by ChatGPT, as I was not previously aware of the Nevanlinna two-constant theorem.)
Once one localizes this “Bernstein theory”, it becomes suitable for the analysis of (real-rooted, monic) polynomials of a high degree , which are not bounded globally on (and grow polynomially rather than exponentially at infinity), but which can exhibit “local exponential type” behavior on various intervals, particularly in regions where the logarithmic potential
behaves like a smooth function (here is the empirical measure of the roots of ). A key example is the (monic) Chebyshev polynomials , which locally behave like sinusoids on the interval (and are locally of exponential type above and below this interval):
This becomes relevant in the theory of Lagrange interpolation. Recall that if are real numbers and is a polynomial of degree less than then one has the interpolation formula
where the Lagrange basis functions are defined by the formula In terms of the monic polynomial , we can write The stability and convergence properties of Lagrange interpolation are closely related to the Lebesgue function and for a given interval , the quantity is known as the Lebesgue constant for that interval.If one chooses the interpolation points poorly, then the Lebesgue constant can be extremely large. However, if one selects these points to be the roots of the aforementioned monic Chebyshev polynomials, then it is known that for all fixed intervals in . In the case , it was shown by Erdős} that this is the best possible value of the Lebesgue constant up to errors for interpolation on , thus
whenever (a more precise bound was later shown by Vertesi). Erdős and Turán then asked if the same lower bound held for more general intervals . This is shown in our paper; a variant integral bound is also established, answering a separate question of Erdős. These lower bounds had previously obtained up to constants by Erdős and Szabados}; the main issue was to obtain the sharp constant in the main term.In terms of the monic polynomial , these two estimates can be written as
and Using the intuition that should behave locally like a trigonometric polynomial, and performing some renormalizations, one can extract the following toy problem to work with first:Problem 3 Let be a trigonometric polynomial of degree with roots in .
It is easy to check that the lower bounds of and are sharp by considering the case when is a sinusoid .
The bound (3) is immediate from Bernstein’s inequality (Lemma 1). By applying a local version of this inequality, I was able to get a weak version of the claim (1) in which was replaced with ; see this early version of the paper, which was developed through conversations with Nat Sothanaphan and Aron Bhalla. By combining this argument with ideas from the older work of Erdős}, I was able to establish (1).
The bound (2) took me longer to establish, and involved a non-trivial amount of playing around with AI tools, the story of which I would like to share here. I had discovered the toy problem (4), but initially was not able to establish this inequality; AlphaEvolve seemed to confirm it numerically (with sinusoids appearing to be the extremizer), but did not offer direct clues on how to prove this rigorously. At some point I realized that the left-hand side factorized into the expressions and , and tried to bound these expressions separately. Perturbing around a sinusoid , I was able to show that the norm was a local minimum as long as one only perturbed by lower order Fourier modes, keeping the frequency coefficients unchanged. Guessing that this local minimum was actually a global minimum, this led me to conjecture the general lower bound
whenever was a degree trigonometric polynomial with highest frequency components . AlphaEvolve numerically confirmed this inequality to be likely to be true also. I still did not see how to prove this inequality, but I decided to try my luck giving it to ChatGPT Pro, which recognized it as an approximation problem and gave me a duality-based proof (based ultimately on the Fourier expansion of the square wave). With some further discussion, I was able to adapt this proof to functions of global exponential type (replacing the Fourier manipulations with contour shifting arguments, in the spirit of the Paley-Wiener theorem), which roughly speaking gave me half of what I needed to establish (2). However, I still needed the matching lower bound on the other factor in the toy problem. Again, AlphaEvolve could numerically confirm that this inequality was likely to be true, but now the quantity I was trying to control did not look convex or linear in , and so the previous duality method did not seem to apply. At this point I switched to pen and paper; eventually I realized that the expression almost looked like a sum of residues, and eventually after playing around with contour integrals of using the residue theorem I was able to establish (4), and then with a bit more (human) effort I could move from the toy problem back to the original problem to obtain (2). Quite possibly AI tools would also have been able to assist with these steps, but they were not necessary here; their main value for me was in quickly confirming that the approach I had in mind was numerically plausible, and in recognizing the right technique to solve one part of the toy problem I had isolated. (I also used AI tools for several other secondary tasks, such as literature review, proofreading, and generating pictures, but these applications of AI have matured to the point where using them for this purpose is almost mundane.)
Mathematics Distillation Challenge – Equational Theories
Mathematical research traditionally involves a small number of professional mathematicians working closely on difficult problems. However, I have long believed that there is a complementary way to do mathematics, in which one works with a broad community of mathematically minded people on problems which may not be as deep as the problems one traditionally works on, but still are of mathematical interest; and that modern technologies, including AI, are more suitable for contributing the latter type of workflow. The “Polymath projects” were one example of this broad type of collaboration, where internet platforms such as blogs and wikis were used to facilitate such collaboration. Some years later, collaborative formalization projects (such as the one to formalize the Polynomial Freiman–Ruzsa conjecture of Marton, discussed previously on this blog here) became popular in some circles. And in 2024, I launched the Equational Theories Project (ETP) (discussed on this blog here and here), combining the rigor of Lean formalization with “good old fashioned AI” (in the form of automated theorem provers) to settle (with formal verification) over 22 million true-false problems in universal algebra.
Continuing in this spirit, Damek Davis and I are launching a new project, in the form of an experimental competitive challenge hosted by the SAIR Foundation (where I serve as a board member, and which is supplying technical support and compute). The idea of this challenge, motivated in part by this recent paper of Honda, Murakami, and Zhang, is to measure the extent to which the 22 million universal algebra true-false results obtained by the ETP can be “distilled” into a short, human-readable “cheat sheet”, similar to how a student in an undergraduate math class might distill the knowledge learned from that class into a single sheet of paper that the student is permitted to bring into an exam.
Here is a typical problem in universal algebra that the ETP was able to answer:
Problem 1 Suppose that is a binary operation such that for all . Is it true that for all ?
Such a problem can be settled either by algebraically manipulating the initial equation to deduce the target equation, or by finding a counterexample to the target equation that still satisfies the initial equation. There are a variety of techniques to achieve either of these, but this sort of problem is difficult, and even undecidable in some cases; see this paper of the ETP collaborators for more discussion. Nevertheless, many of these problems can be settled with some effort by humans, by automated theorem provers, or by frontier AI systems; here for instance is an AI-generated solution to the above problem.
However, these AI models are expensive, and do not reveal much insight as to where their answers come from. If one instead tries a smaller and cheaper model, such as one of the many open-source models available, it turns out that these models basically perform no better than random chance, in that when asked to say whether the answer to a question such as the above is true or false, they only answer correctly about 50% of the time.
But, similarly to how a student struggling with the material for a math class can perform better on an exam when provided the right guidance, it turns out that such cheap models can perform at least modestly better on this task (with success rates increasing to about 55%-60%) if given the right prompt or “cheat sheet”.
“Stage 1” of the distillation challenge, which we launched today, asks for contestants to design a cheat sheet (of at most 10 kilobytes in size) that can increase the performance of these models on the above true-false problems to as high a level as possible. We have provided a “playground” with which to test one’s cheat sheet (or a small number of example cheat sheets) some cheap models against a public set of 1200 problems (1000 of which were randomly selected, and rather easy, together with 200 “hard” problems that were selected to resist the more obvious strategies for resolving these questions); a brief video explaining how to use the playground can be found here.
Submissions stage will end on April 20, after which we will evaluate the submissions against a private subset of test questions. The top 1000 submissions will advance to a second stage which we are currently in the process of designing, which will involve more advanced models, but also the more difficult task of not just providing a true-false answer, but also a proof or counterexample to the problem.
The competition will be coordinated on this Zulip channel, where I hope there will be a lively and informative discussion.
My hope is that the winning submissions will capture the most productive techniques for solving these problems, and/or provide general problem-solving techniques that would also be applicable to other types of mathematical problems. We started with the equational theory project data set for this pilot competition due to its availability and spectrum of difficulty levels, but if this type of distillation process leads to interesting results, one could certainly run in on many other types of mathematical problem classes to get some empirical data on how readily they can be solved, particularly after we learn from this pilot competition on how to encourage participation and share of best practices.
SAIR will also launch some other mathematical challenges in the coming months that will be of a more cooperative nature than this particular competitive challenge; stay tuned for further announcements.
SLMath Deputy Director Search
[This post is written in my capacity as Vice-Chair of the Board of Trustees of SLMath. -T.]
SLMath, formerly MSRI, has launched the search for the next Deputy Director. This key position is a close advisor to the Director and shares in the internal management of the scientific team and programs at SLMath. This position is ideal for an experienced professional with a PhD in mathematical sciences seeking a new opportunity to leverage their strengths in program and grant management, financial management, and people management.
Six Math Essentials
Just a brief announcement that I have been working with Quanta Books to publish a short book in popular mathematics entitled “Six Math Essentials“, which will cover six of the fundamental concepts in mathematics — numbers, algebra, geometry, probability, analysis, and dynamics — and how they connect with our real-world intuition, the history of math and science, and to modern practice of mathematics, both in theory and in applications. The scheduled publication date is Oct 27, but it is currently available for preorder.
