# Matematički blogovi

### News from Theorem of the Day

### New Theorem

### New Theorem

### New Theorem

### New Theorem

### Theorem Update

### New Theorem

### New Theorem

### New Theorem

### New Theorem

### New Theorem

### New Theorem

### Theorem Update

### New Theorem

### New Theorem

### Serrapilheira Postdoctoral Fellowship at Universidade Federal do Ceará (Brazil)

Yuri Lima asked me to announce in this blog the following opportunity for a post-doctoral position in Dynamical Systems and Ergodic Theory at Universidade Federal do Ceará (located in Fortaleza, Brazil):

**Serrapilheira Postdoctoral Fellowship – UFC**

The Department of Mathematics at Universidade Federal do Ceará (UFC) invites applications for a Serrapilheira Postdoctoral Fellowship in dynamical systems and ergodic theory. The position is for one year with start date at any moment between March 2020 and September 2020, with possibility of extension for another year.

**Qualifications and expectations**

The position is part of the project “Jangada Dinâmica – boosting dynamical systems in Brazil’s Northeastern region”, which is funded by Instituto Serrapilheira and aims to boost dynamical systems and ergodic theory in the mathematical community of universities located in the Northeastern region of Brazil. The applicant must have completed a PhD and be qualified for conducting research in either dynamical systems and/or ergodic theory. There are NO teaching duties. As part of the program, and to foster interaction, the fellow shall visit another department of Mathematics in the Northeast for one month each semester or two months per year. Applications from underrepresented groups in Mathematics are highly encouraged.

**Salary**

The salary will range from 5000–6000 Brazilian Reais monthly, tax free, in a twelve month-base calendar, according to the applicant’s qualifications. There will be an extra 5000 Brazilian Reais for each of the two months of visits to another institution in the Northeast. The salary is more attractive than those offered by regular Brazilian funding agencies.

**Department of Mathematics at UFC**

The Department of Mathematics at UFC currently holds the highest rank among Brazilian Mathematics departments. Having a strong history in the field of differential geometry, during the last 15 years it has developed new research groups in analysis, graph optimization and, more recently, in dynamical systems. Currently, the group of dynamical systems has two members, with expertise on nonuniform hyperbolicity, partial hyperbolicity, and symbolic dynamics.

**Location**

UFC is located in the city of Fortaleza, which has approximately 2.5 million inhabitants and is the fifth largest city of Brazil. Located in the Northeastern region of Brazil, Fortaleza is becoming a common port of entry to the country, with many direct flights to the US and Europe. Historically known for touristic reasons, it is nearby beaches with warm water and white sand dunes, and its cost of living is cheaper than bigger cities like Rio de Janeiro and São Paulo, thus making the monthly stipend affordable.

**Documentation required**

– CV with publication list

– Research statement

– Two (or more) letters of recommendation.

All documents must be sent to jangadadinamica@gmail.com. The applicant must send the first two documents, and ask two (or more) professors to directly send their letters of recommendation.

**Deadline:** December 31, 2019.

**More information:** jangadadinamica@gmail.com.

### 254A, Notes 9 – second moment and entropy methods

In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.

The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form

of the prime number theorem (where we take the limit ). One can interpret this estimate probabilistically as

where is a random variable drawn uniformly from the natural numbers up to , and denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum

After dividing by , this is essentially

With probabilistic intuition, one may expect the random variables to be approximately independent (there is no obvious relationship between the number of prime factors of , and of ), and so the above average would be expected to be approximately equal to

which by (2) is equal to . Thus we are led to the prediction

The asymptotic (3) is widely believed (it is a special case of the *Chowla conjecture*, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.

How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “” measure of independence known as covariance, and the “” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version

of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound , but the latter bound is much easier to establish than the former).

As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them *on average*, rather than uniformly. For instance, given two random variables and of number-theoretic origin (such as the random variables and mentioned previously), it can often be extremely difficult to determine the extent to which behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if are a large collection of “independent” random variables, and is a further random variable that is “not too large” in some sense, then must necessarily be nearly independent (or conditionally independent) to many of the , even if one cannot pinpoint precisely which of the the variable is independent with. In the case of the second moment method, this allows us to compute correlations such as for “most” . The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities for “most” thanks to tools such as the Pinsker inequality.

** — 1. Second moment methods — **

In this section we discuss probabilistic techniques of an “” nature. We fix a probability space to model all of random variables; thus for instance we shall model a complex random variable in these notes by a measurable function . (Strictly speaking, there is a subtle distinction one can maintain between a random variable and its various measure-theoretic models, which becomes relevant if one later decides to modify the probability space , but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more discussion.)

We will focus here on the space of complex random variables (that is to say, measurable maps ) whose *second moment*

of is finite. In many number-theoretic applications the finiteness of the second moment will be automatic because will only take finitely many values. As is well known, the space has the structure of a complex Hilbert space, with inner product

and norm

for . By slight abuse of notation, the complex numbers can be viewed as a subset of , by viewing any given complex number as a constant (deterministic) random variable. Then is a one-dimensional subspace of , spanned by the unit vector . Given a random variable to , the projection of to is then the *mean*

and we obtain an orthogonal splitting of any into its mean and its mean zero part . By Pythagoras’ theorem, we then have

The first quantity on the right-hand side is the square of the distance from to , and this non-negative quantity is known as the variance

The square root of the variance is known as the standard deviation. The variance controls the distribution of the random variable through Chebyshev’s inequality

for any , which is immediate from observing the inequality and then taking expectations of both sides. Roughly speaking, this inequality asserts that typically deviates from its mean by no more than a bounded multiple of the standard deviation .

A slight generalisation of Chebyshev’s inequality that can be convenient to use is

for any and any complex number (which typically will be a simplified approximation to the mean ), which is proven similarly to (6) but noting (from (5)) that .

Informally, (6) is an assertion that a square-integrable random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger concentration of measure than what is provided by Chebyshev’s inequality if one is able to calculate higher moments than the second moment, such as the fourth moment or exponential moments , but we will not pursue this direction in this set of notes.

Clearly the variance is homogeneous of order two, thus

for any and . In particular, the variance is not always additive: the claim fails in particular when is not almost surely zero. However, there is an important substitute for this formula. Given two random variables , the inner product of the corresponding mean zero parts is a complex number known as the covariance:

As are orthogonal to , it is not difficult to obtain the alternate formula

The covariance is then a positive semi-definite inner product on (it basically arises from the Hilbert space structure of the space of mean zero variables), and . From the Cauchy-Schwarz inequality we have

If have non-zero variance (that is, they are not almost surely constant), then the ratio

is then known as the correlation between and , and is a complex number of magnitude at most ; for real-valued that are not almost surely constant, the correlation is instead a real number between and . At one extreme, a correlation of magnitude occurs if and only if is a scalar multiple of . At the other extreme, a correlation of zero is an indication (though not a guarantee) of independence. Recall that two random variables are *independent* if one has

for all (Borel) measurable . In particular, setting , for and integrating using Fubini’s theorem, we conclude that

similarly with replaced by , and similarly for . In particular we have

and thus from (8) we thus see that independent random variables have zero covariance (and zero correlation, when they are not almost surely constant). On the other hand, the converse fails:

**Exercise 1** Provide an example of two random variables which are not independent, but which have zero correlation or covariance with each other. (There are many ways to produce some examples. One comes from exploiting various systems of orthogonal functions, such as sines and cosines. Another comes from working with random variables taking only a small number of values, such as .

for any finite collection of random variables . These identities combine well with Chebyshev-type inequalities such as (6), (7), and this leads to a very common instance of the second moment method in action. For instance, we can use it to understand the distribution of the number of prime factors of a random number that fall within a given set . Given any set of natural numbers, define the *logarithmic size* to be the quantity

Thus for instance Euler’s theorem asserts that the primes have infinite logarithmic size.

**Lemma 2 (Turan-Kubilius inequality, special case)** Let be an interval of length at least , and let be an integer drawn uniformly at random from this interval, thus

for all . Let be a finite collection of primes, all of which have size at most . Then the random variable has mean

and variance

In particular,

and from (7) we have

for any .

*Proof:* For any natural number , we have

We now write . From (11) we see that each indicator random variable , has mean and variance ; similarly, for any two distinct , we see from (11), (8) the indicators , have covariance

and the claim now follows from (10).

The exponents of in the error terms here are not optimal; but in practice, we apply this inequality when is much larger than any given power of , so factors such as will be negligible. Informally speaking, the above lemma asserts that a typical number in a large interval will have roughly prime factors in a given finite set of primes, as long as the logarithmic size is large.

If we apply the above lemma to for some large , and equal to the primes up to (say) , we have , and hence

Since , we recover the main result

of Section 5 of Notes 1 (indeed this is essentially the same argument as in that section, dressed up in probabilistic language). In particular, we recover the Hardy-Ramanujan law that a proportion of the natural numbers in have prime factors.

**Exercise 3 (Turan-Kubilius inequality, general case)** Let be an additive function (which means that whenever are coprime. Show that

where

(Hint: one may first want to work with the special case when vanishes whenever so that the second moment method can be profitably applied, and then figure out how to address the contributions of prime powers larger than .)

**Exercise 4 (Turan-Kubilius inequality, logarithmic version)** Let with , and let be a collection of primes of size less than with . Show that

**Exercise 5 (Paley-Zygmund inequality)** Let be non-negative with positive mean. Show that

This inequality can sometimes give slightly sharper results than the Chebyshev inequality when using the second moment method.

Now we give a useful lemma that quantifies a heuristic mentioned in the introduction, namely that if several random variables do not correlate with each other, then it is not possible for any further random variable to correlate with many of them simultaneously. We first state an abstract Hilbert space version.

**Lemma 6 (Bessel type inequality, Hilbert space version)** If are elements of a Hilbert space , and are positive reals, then

*Proof:* We use the duality method. Namely, we can write the left-hand side of (13) as

for some complex numbers with (just take to be normalised by the left-hand side of (14), or zero if that left-hand side vanishes. By Cauchy-Schwarz, it then suffices to establish the dual inequality

The left-hand side can be written as

Using the arithmetic mean-geometric mean inequality and symmetry, this may be bounded by

Since , the claim follows.

**Corollary 7 (Bessel type inequality, probabilistic version)** If , and are positive reals, then

*Proof:* By subtracting the mean from each of we may assume that these random variables have mean zero. The claim now follows from Lemma 6.

To get a feel for this inequality, suppose for sake of discussion that and all have unit variance and , but that the are pairwise uncorrelated. Then the right-hand side is equal to , and the left-hand side is the sum of squares of the correlations between and each of the . Any individual correlation is then still permitted to be as large as , but it is not possible for multiple correlations to be this large simultaneously. This is geometrically intuitive if one views the random variables as vectors in a Hilbert space (and correlation as a rough proxy for the angle between such vectors). This lemma also shares many commonalities with the large sieve inequality, discussed in this set of notes.

One basic number-theoretic application of this inequality is the following sampling inequality of Elliott, that lets one approximate a sum of an arithmetic function by its values on multiples of primes :

**Exercise 8 (Elliott’s inequality)** Let be an interval of length at least . Show that for any function , one has

(*Hint:* Apply Corollary 7 with , , and , where is the uniform variable from Lemma 2.) Conclude in particular that for every , one has

for all primes outside of a set of exceptional primes of logarithmic size .

Informally, the point of this inequality is that an arbitrary arithmetic function may exhibit correlation with the indicator function of the multiples of for some primes , but cannot exhibit significant correlation with all of these indicators simultaneously, because these indicators are not very correlated to each other. We note however that this inequality only gains a tiny bit over trivial bounds, because the set of primes up to only has logarithmic size by Mertens’ theorems; thus, any asymptotics that are obtained using this inequality will typically have error terms that only improve upon the trivial bound by factors such as .

**Exercise 9 (Elliott’s inequality, logarithmic form)** Let with . Show that for any function , one has

and thus for every , one has

for all primes outside of an exceptional set of primes of logarithmic size .

**Exercise 10** Use Exercise (9) and a duality argument to provide an alternate proof of Exercise 4. (*Hint:* express the left-hand side of (12) as a correlation between and some suitably -normalised arithmetic function .)

As a quick application of Elliott’s inequality, let us establish a weak version of the prime number theorem:

**Proposition 11 (Weak prime number theorem)** For any we have

whenever are sufficiently large depending on .

This estimate is weaker than what one can obtain by existing methods, such as Exercise 56 of Notes 1. However in the next section we will refine this argument to recover the full prime number theorem.

*Proof:* Fix , and suppose that are sufficiently large. From Exercise 9 one has

for all primes outside of an exceptional set of logarithmic size . If we restrict attention to primes then one sees from the integral test that one can replace the sum by and only incur an additional error of . If we furthermore restrict to primes larger than , then the contribution of those that are divisible by is also . For not divisible by , one has . Putting all this together, we conclude that

for all primes outside of an exceptional set of logarithmic size . In particular, for large enough this statement is true for at least one such . The claim then follows.

As another application of Elliott’s inequality, we present a criterion for orthogonality between multiplicative functions and other sequences, first discovered by Katai (with related results also introduced earlier by Daboussi and Delange), and rediscovered by Bourgain, Sarnak, and Ziegler:

**Proposition 12 (Daboussi-Delange-Katai-Bourgain-Sarnak-Ziegler criterion)** Let be a multiplicative function with for all , and let be another bounded function. Suppose that one has

as for any two distinct primes . Then one has

as .

*Proof:* Suppose the claim fails, then there exists (which we can assume to be small) and arbitrarily large such that

By Exercise 8, this implies that

for all primes outside of an exceptional set of logarithmic size . Call such primes “good primes”. In particular, by the pigeonhole principle, and assuming large enough, there exists a dyadic range with which contains good primes.

Fix a good prime in . From (15) we have

We can replace the range by with negligible error. We also have except when is a multiple of , but this latter case only contributes which is also negligible compared to the right-hand side. We conclude that

for every good prime. On the other hand, from Lemma 6 we have

where range over the good primes in . The left-hand side is then , and by hypothesis the right-hand side is for large enough. As and is small, this gives the desired contradiction

**Exercise 13 (Daboussi-Delange theorem)** Let be irrational, and let be a multiplicative function with for all . Show that

as . If instead is rational, show that there exists be a multiplicative function with for which the statement (16) fails. (Hint: use Dirichlet characters and Plancherel’s theorem for finite abelian groups.)

** — 2. An elementary proof of the prime number theorem — **

Define the Mertens function

As shown in Theorem 58 of Notes 1, the prime number theorem is equivalent to the bound

as . We now give a recent proof of this theorem, due to Redmond McNamara (personal communication), that relies primarily on Elliott’s inequality and the Selberg symmetry formula; it is a relative of the standard elementary proof of this theorem due to Erdös and Selberg. In order to keep the exposition simple, we will not arrange the argument in a fashion that optimises the decay rate (in any event, there are other proofs of the prime number theorem that give significantly stronger bounds).

Firstly we see that Elliott’s inequality gives the following weaker version of (17):

**Lemma 14 (Oscillation for Mertens’ function)** If and , then we have

for all primes outside of an exceptional set of primes of logarithmic size .

*Proof:* We may assume as the claim is trivial otherwise. From Exercise 8 applied to and , we have

for all outside of an exceptional set of primes of logarithmic size . Since for not divisible by , the right-hand side can be written as

Since outside of an exceptional set of logarithmic size , the claim follows.

Informally, this lemma asserts that for most primes , which morally implies that for most primes . If we can then locate suitable primes with , thus should then lead to , which should then yield the prime number theorem . The manipulations below are intended to make this argument rigorous.

It will be convenient to work with a logarithmically averaged version of this claim.

**Corollary 15 (Logarithmically averaged oscillation)** If and is sufficiently large depending on , then

for all outside of an exceptional set of logarithmic size .

*Proof:* For each , we have from the previous lemma that

for all outside of an exceptional set of logarithmic size . We then have

so it suffices by Markov’s inequality to show that

But by Fubini’s theorem, the left-hand side may be bounded by

and the claim follows.

Let be sufficiently small, and let be sufficiently large depending on . Call a prime *good* if the bound (18) holds and *bad* otherwise, thus all primes outside of an exceptional set of bad primes of logarithmic size are good. Now we observe that we can make small as long as we can make two good primes multiply to be close to a third:

**Proposition 16** Suppose there are good primes with . Then .

*Proof:* By definition of good prime, we have the bounds

We rescale (20) by to conclude that

We can replace the integration range here from to with an error of if is large enough. Also, since , we have . Thus we have

Combining this with (19), (21) and the triangle inequality (writing as a linear combination of , , and ) we conclude that

This is an averaged version of the claim we need. To remove the averaging, we use the identity (see equation (63) of Notes 1) to conclude that

From the triangle inequality one has

and hence by Mertens’ theorem

From the Brun-Titchmarsh inequality (Corollary 61 of Notes 1) we have

and so from the previous estimate and Fubini’s theorem one has

and hence by (22) (using trivial bounds to handle the region outside of )

Since

we conclude (for large enough) that

and the claim follows.

To finish the proof of the prime number theorem, it thus suffices to locate, for sufficiently large, three good primes with . If we already had the prime number theorem, or even the weaker form that every interval of the form contained primes for large enough, then this would be quite easy: pick a large natural number (depending on , but independent of ), so that the primes up to has logarithmic size (so that only of them are bad, as measured by logarithmic size), and let be random numbers and drawn uniformly from (say) . From the prime number theorem, for each , the interval contains primes. In particular, contains primes, but the expected number of bad primes in this interval is . Thus by Markov’s inequality there would be at least a chance (say) of having at least one good prime in ; similarly there is a chance of having a good prime in , and a chance of having a good prime in . Thus (as an application of the probabilistic method), there exist (deterministic) good primes with the required properties.

Of course, using the prime number theorem here to prove the prime number theorem would be circular. However, we can still locate a good triple of primes using the Selberg symmetry formula

as , where is the second von Mangoldt function

see Proposition 60 of Notes 1. We can strip away the contribution of the primes:

**Exercise 17** Show that

as .

In particular, on evaluating this at and subtracting, we have

whenever is sufficiently large depending on . In particular, for any such , one either has

(or both). Informally, the Selberg symmetry formula shows that the interval contains either a lot of primes, or a lot of semiprimes. The factor of is slightly annoying, so we now remove it. Consider the contribution of those primes to (25) with . This is bounded by

which we can bound crudely using the Chebyshev bound by

which by Mertens theorem is . Thus the contribution of this case can be safely removed from (25). Similarly for those cases when . For the remaining cases we bound . We conclude that for any sufficiently large , either (24) or

In order to find primes with close to , it would be very convenient if we could find a for which (24) and (26) *both* hold. We can’t quite do this directly, but due to the “connected” nature of the set of scales , but we can do the next best thing:

**Proposition 18** Suppose is sufficiently large depending on . Then there exists with such that

*Proof:* We know that every in obeys at least one of (27), (28). Our task is to produce an adjacent pair of , one of which obeys (27) and the other obeys (28). Suppose for contradiction that no such pair exists, then whenever fails to obey (27), then any adjacent must also fail to do so, and similarly for (28). Thus either (27) will fail to hold for all , or (28) will fail to hold for all such . If (27) fails for all , then on summing we have

which contradicts Mertens’ theorem if is large enough because the left-hand side is . Similarly, if (28) fails for all , then

and again Mertens’ theorem can be used to lower bound the left-hand side by (in fact one can even gain an additional factor of if one works things through carefully) and obtain a contradiction.

The above proposition does indeed provide a triple of primes with . If is sufficiently large depending on and less than (say) , so that , this would give us what we need as long as one of the triples consisted only of good primes. The only way this can fail is if either

for some , or if

for some . In the first case, we can sum to conclude that

and in the second case we have

Since the total set of bad primes up to has logarithmic size , we conclude from the pigeonhole principle (and the divergence of the harmonic series ) that for any depending only on , and any large enough, there exists such that neither of (29) and (30) hold. Indeed the set of obeying (29) has logarithmic size , and similarly for (30). Choosing a that avoids both of these scenarios, we then find a good and good with , so that , and then by Proposition 16 we conclude that for all sufficiently large . Sending to zero, we obtain the prime number theorem.

** — 3. Entropy methods — **

In the previous section we explored the consequences of the second moment method, which applies to square-integrable random variables taking values in the real or complex numbers. Now we explore entropy methods, which now apply to random variables which take a finite number of values (equipped with the discrete sigma-algebra), but whose range need not be numerical in nature. (One could extend entropy methods to slightly larger classes of random variables, such as ones that attain a countable number of values, but for our applications finitely-valued random variables will suffice.)

The fundamental notion here is that of the Shannon entropy of a random variable. If takes values in a finite set , its Shannon entropy (or *entropy* for short) is defined by the formula

where ranges over all the possible values of , and we adopt the convention , so that values that are almost surely not attained by do not influence the entropy. We choose here to use the natural logarithm to normalise our entropy (in which case a unit of entropy is known as a “nat“); in the information theory literature it is also common to use the base two logarithm to measure entropy (in which case a unit of entropy is known as a “bit“, which is equal to nats). However, the precise choice of normalisation will not be important in our discussion.

It is clear that if two random variables have the same probability distribution, then they have the same entropy. Also, the precise choice of range set is not terribly important: if takes values in , and is an injection, then it is clear that and have the same entropy:

This is in sharp contrast to moment-based statistics such as the mean or variance, which can be radically changed by applying some injective transformation to the range values.

Informally, the entropy informally measures how “spread out” or “disordered” the distribution of is, behaving like a logarithm of the size of the “essential support” of such a variable; from an information-theoretic viewpoint, it measures the amount of “information” one learns when one is told the value of . Here are some basic properties of Shannon entropy that help support this intuition:

**Exercise 19 (Basic properties of Shannon entropy)** Let be a random variable taking values in a finite set .

- (i) Show that , with equality if and only if is almost surely deterministic (that is to say, it is almost surely equal to a constant ).
- (ii) Show that
with equality if and only if is uniformly distributed on . (Hint: use Jensen’s inequality and the convexity of the map on .)

- (iii) (Shannon-McMillan-Breiman theorem) Let be a natural number, and let be independent copies of . As , show that there is a subset of cardinality with the properties that
and

uniformly for all . (The proof of this theorem will require Stirling’s formula, which you may assume here as a black box; see also this previous blog post.) Informally, we thus see a large tuple of independent samples of approximately behaves like a uniform distribution on values.

One can view Shannon entropy as a generalisation of the notion of cardinality of a finite set (or equivalently, cardinality of finite sets can be viewed as a special case of Shannon entropy); see this previous blog post for an elaboration of this point.

The concept of Shannon entropy becomes significantly more powerful when combined with that of conditioning. Recall that a random variable taking values in a range set can be modeled by a measurable map from a probability space to the range . If is an event in of positive probability, we can then *condition* to the event to form a new random variable on the conditioned probability space , where

is the restriction of the -algebra to ,

is the conditional probability measure on , and is the restriction of to . This random variable lives on a different probability space than itself, so it does not make sense to directly combine these variables (thus for instance one cannot form the sum even when both random variables are real or complex valued); however, one can still form the Shannon entropy of the conditioned random variable , which is given by the same formula

Given another random variable taking values in another finite set , we can then define the conditional Shannon entropy to be the expected entropy of the level sets , thus

with the convention that the summand here vanishes when . From the law of total probability we have

for any , and hence by Jensen’s inequality

for any ; summing we obtain the Shannon entropy inequality

Informally, this inequality asserts that the new information content of can be decreased, but not increased, if one is first told some additional information .

This inequality (33) can be rewritten in several ways:

**Exercise 20** Let , be random variables taking values in finite sets respectively.

- (i) Establish the chain rule
where is the joint random variable . In particular, (33) can be expressed as a subadditivity formula

- (ii) If is a function of , in the sense that for some (deterministic) function , show that .
- (iii) Define the mutual information by the formula
Establish the inequalities

with the first inequality holding with equality if and only if are independent, and the latter inequalities holding if and only if is a function of (or vice versa).

From the above exercise we see that the mutual information is a measure of dependence between and , much as correlation or covariance was in the previous sections. There is however one key difference: whereas a zero correlation or covariance is a consequence but not a guarantee of independence, zero mutual information is *logically equivalent* to independence, and is thus a stronger property. To put it another way, zero correlation or covariance allows one to calculate the average in terms of individual averages of , but zero mutual information is stronger because it allows one to calculate the more general averages in terms of individual averages of , for arbitrary functions taking values into the complex numbers. This increased power of the mutual information statistic will allow us to estimate various averages of interest in analytic number theory in ways that do not seem amenable to second moment methods.

The subadditivity property formula can be conditioned to any event occuring with positive probability (replacing the random variables by their conditioned counterparts ), yielding the inequality

Applying this inequality to the level events of some auxiliary random variable taking values in another finite set , multiplying by , and summing, we conclude the inequality

In other words, the conditional mutual information

between and conditioning on is always non-negative:

One has conditional analogues of the above exercise:

**Exercise 21** Let , , be random variables taking values in finite sets respectively.

- (i) Establish the conditional chain rule
In particular, (36) is equivalent to the inequality

- (ii) Show that equality holds in (36) if and only if are conditionally independent relative to , which means that
for any , , .

- (iii) Show that , with equality if and only if is almost surely a deterministic function of .
- (iv) Show the data processing inequality
for any functions , , and more generally that

- (v) If is an injective function, show that
However, if is not assumed to be injective, show by means of examples that there is no order relation between the left and right-hand side of (40) (in other words, show that either side may be greater than the other). Thus, increasing or decreasing the amount of information that is known may influence the mutual information between two remaining random variables in either direction.

- (vi) If is a function of , and also a function of (thus for some and ), and a further random variable is a function jointly of (thus for some ), establish the submodularity inequality

We now give a key motivating application of the Shannon entropy inequalities. Suppose one has a sequence of random variables, all taking values in a finite set , which are stationary in the sense that the tuples and have the same distribution for every . In particular we will have

and hence by (39)

If we write , we conclude from (34) that we have the concavity property

In particular we have for any , which on summing and telescoping series (noting that ) gives

and hence we have the entropy monotonicity

In particular, the limit exists. This quantity is known as the Kolmogorov-Sinai entropy of the stationary process ; it is an important statistic in the theory of dynamical systems, and roughly speaking measures the amount of entropy produced by this process as a function of a discrete time vairable . We will not directly need the Kolmogorov-Sinai entropy in our notes, but a variant of the entropy monotonicity formula (41) will be important shortly.

In our application we will be dealing with processes that are only asymptotically stationary rather than stationary. To control this we recall the notion of the total variation distance between two random variables taking values in the same finite space , defined by

There is an essentially equivalent notion of this distance which is also often in use:

**Exercise 22** If two random variables take values in the same finite space , establish the inequalities

and for any , establish the inequality

Shannon entropy is continuous in total variation distance as long as we keep the range finite. More quantitatively, we have

**Lemma 23** If two random variables take values in the same finite space , then

with the convention that the error term vanishes when .

*Proof:* Set . The claim is trivial when (since then have the same distribution) and when (from (32)), so let us assume , and our task is to show that

If we write , , and , then

By dividing into the cases and we see that

since , it thus suffices to show that

But from Jensen’s inequality (32) one has

since , the claim follows.

In the converse direction, if a random variable has entropy close to the maximum , then one can control the total variation:

**Lemma 24 (Special case of Pinsker inequality)** If takes values in a finite set , and is a uniformly distributed random variable on , then

Of course, we have , so we may also write the above inequality as

The optimal value of the implied constant here is known to equal , but we will not use this sharp version of the inequality here.

*Proof:* If we write and , and , then we can rewrite the claimed inequality as

Observe that the function is concave, and in fact for all . From this and Taylor expansion with remainder we may write

for some between and . Since is independent of , and , we thus have on summing in

By Cauchy-Schwarz we then have

Since and , the claim follows.

The above lemma does not hold when the comparison variable is not assumed to be uniform; in particular, two non-uniform random variables can have precisely the same entropy but yet have different distributions, so that their total variation distance is positive. There is a more general variant, known as the Pinsker inequality, which we will not use in these notes:

**Exercise 25 (Pinsker inequality)** If take values in a finite set , define the Kullback-Leibler divergence of relative to by the formula

(with the convention that the summand vanishes when vanishes).

- (i) Establish the Gibbs inequality .
- (ii) Establish the Pinsker inequality
In particular, vanishes if and only if have identical distribution. Show that this implies Lemma 24 as a special case.

- (iii) Give an example to show that the Kullback-Liebler divergence need not be symmetric, thus there exist such that .
- (iv) If are random variables taking values in finite sets , and are
*independent*random variables taking values in respectively with each having the same distribution of , show that

In our applications we will need a relative version of Lemma 24:

**Corollary 26 (Relative Pinsker inequality)** If takes values in a finite set , takes values in a finite set , and is a uniformly distributed random variable on that is independent of , then

*Proof:* From direct calculation we have the identity

As is independent of , is uniformly distributed on . From Lemma 24 we conclude

Inserting this bound and using the Cauchy-Schwarz inequality, we obtain the claim.

Now we are ready to apply the above machinery to give a key inequality that is analogous to Elliott’s inequality. Inequalities of this type first appeared in one of my papers, introducing what I called the “entropy decrement argument”; the following arrangement of the inequality and proof is due to Redmond McNamara (personal communication).

**Theorem 27 (Entropy decrement inequality)** Let be a random variable taking values in a finite set of integers, which obeys the approximate stationarity

for some . Let be a collection of distinct primes less than some threshold , and let be natural numbers that are also bounded by . Let be a function taking values in a finite set . For , let denote the -valued random variable

and let denote the -valued random variable

Also, let be a random variable drawn uniformly from , independently of . Then

The factor (arising from an invocation of the Chinese remainder theorem in the proof) unfortunately restricts the usefulness of this theorem to the regime in which all the primes involved are of “sub-logarithmic size”, but once one is in that regime, the second term on the right-hand side of (45) tends to be negligible in practice. Informally, this theorem asserts that for most small primes , the random variables and behave as if they are independent of each other.

*Proof:* We can assume , as the claim is trivial for (the all have zero entropy). For , we introduce the -valued random variable

The idea is to exploit some monotonicity properties of the quantity , in analogy with (41). By telescoping series we have

where we extend (44) to the case. From (38) we have

Now we lower bound the summand on the right-hand side. From multiple applications of the conditional chain rule (37) we have

We now use the approximate stationarity of to derive an approximate monotonicity property for . If , then from (39) we have

Write and

Note that is a deterministic function of and vice versa. Thus we can replace by in the above formula, and conclude that

The tuple takes values in a set of cardinality thanks to the Chebyshev bounds. Hence by two applications of Lemma 23, (43) we have

The first term on the right-hand side is . Worsening the error term slightly, we conclude that

and hence

for any . In particular

which by (47), (48) rearranges to

From (46) we conclude that

Meanwhile, from Corollary 26, (39), (38) we have

The probability distribution of is a function on , which by the Chinese remainder theorem we can identify with a cyclic group where . From (43) we see that the value of this distribution at adjacent values of this cyclic group varies by , hence the total variation distance between this random variable and the uniform distribution on is by Chebyshev bounds. By Lemma 23 we then have

and thus

The claim follows.

We now compare this result to Elliott’s inequality. If one tries to address precisely the same question that Elliott’s inequality does – namely, to try to compare a sum with sampled subsums – then the results are quantitatively much weaker:

**Corollary 28 (Weak Elliott inequality)** Let be an interval of length at least . Let be a function with for all , and let . Then one has

for all primes outside of an exceptional set of primes of logarithmic size .

Comparing this with Exercise 8 we see that we cover a much smaller range of primes ; also the size of the exceptional set is slightly worse. This version of Elliot’s inequality is however still strong enough to recover a proof of the prime number theorem as in the previous section.

*Proof:* We can assume that is small, as the claim is trivial for comparable to . We can also assume that

since the claim is also trivial otherwise (just make all primes up to exceptional, then use Mertens’ theorem). As a consequence of this, any quantity involving in the denominator will end up being completely negligible in practice. We can also restrict attention to primes less than (say) , since the remaining primes between and have logarithmic size .

By rounding the real and imaginary parts of to the nearest multiple of , we may assume that takes values in some finite set of complex numbers of size with cardinality . Let be drawn uniformly at random from . Then (43) holds with , and from Theorem 27 with and (which makes the second term of the right-hand side of (45) negligible) we have

where are the primes up to , arranged in increasing order. By Markov’s inequality, we thus have

for outside of a set of primes of logarithmic size .

Let be as above. Now let be the function

that is to say picks out the unique component of the tuple in which is divisible by . This function is bounded by , and then by (42) we have

The left-hand side is equal to

which on switching the summations and using the large nature of can be rewritten as

Meanwhile, the left-hand side is equal to

which again by switching the summations becomes

The claim follows.

In the above argument we applied (42) with a very specific choice of function . The power of Theorem 27 lies in the ability to select many other such functions , leading to estimates that do not seem to be obtainable purely from the second moment method. In particular we have the following generalisation of the previous estimate:

**Proposition 29 (Weak Elliott inequality for multiple correlations)** Let be an interval of length at least . Let be a function with for all , and let . Let be integers. Then one has

for all primes outside of an exceptional set of primes of logarithmic size .

*Proof:* We allow all implied constants to depend on . As before we can assume that is sufficiently small (depending on ), that takes values in a set of bounded complex numbers of cardinality , and that is large in the sense of (49), and restrict attention to primes up to . By shifting the and using the large nature of we can assume that the are all non-negative, taking values in for some . We now apply Theorem 27 with and conclude as before that

for outside of a set of primes of logarithmic size .

Let be as above. Let be the function

This function is still bounded by , so by (42) as before we have

The left-hand side is equal to

which on switching the summations and using the large nature of can be rewritten as

Meanwhile, the left-hand side is equal to

which again by switching the summations becomes

The claim follows.

There is a logarithmically averaged version of the above proposition:

**Exercise 30 (Weak Elliott inequality for logarithmically averaged multiple correlations)** Let with , let be a function bounded in magnitude by , let , and let be integers. Show that

for all primes outside of an exceptional set of primes of logarithmic size .

When one specialises to multiplicative functions, this lets us dilate shifts in multiple correlations by primes:

**Exercise 31** Let with , let be a multiplicative function bounded in magnitude by , let , and let be nonnegative integers. Show that

for all primes outside of an exceptional set of primes of logarithmic size .

For instance, setting to be the Möbius function, , , and (say), we see that

for all primes outside of an exceptional set of primes of logarithmic size . In particular, for large enough, one can obtain bounds of the form

for various moderately large sets of primes . It turns out that these double sums on the right-hand side can be estimated by methods which we will cover in later series of notes. Among other things, this allows us to establish estimates such as

as , which to date have only been established using these entropy methods (in conjunction with the methods discussed in later notes). This is progress towards an open problem in analytic number theory known as *Chowla’s conjecture*, which we will also discuss in later notes.

### Advances in Combinatorics fully launched

It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline.

I am excited by the business model of the journal, which is that its very small running costs (like Discrete Analysis, it uses the Scholastica platform, which charges $10 per submission, as well as a fixed annual charge of $250, and there are a few other costs such as archiving articles with CLOCKSS and having DOIs) are being met, for the next five years in the first instance, by the library of Queens University in Kingston Ontario, who are also providing us with very useful administrative support. My dream would be for other libraries to have the foresight to support similar ventures, since the potential for savings if the stranglehold of the big commercial publishers is broken is huge. I do not underestimate the obstacles, but for a long time I have felt that what we are waiting for is a tipping point, and even quite a small venture can in principle have a significant effect on when that tipping point occurs.

The aim of Advances in Combinatorics is to be a top specialist combinatorics journal. Information about the editorial board, the submission process, and of course the articles themselves, can all be found on the website. Like Discrete Analysis, the journal has Editorial Introductions to each article, with the aim of making the webpage informative and fun to browse. We will be grateful for any feedback, and even more grateful for support in the form of excellent combinatorics papers while we are still getting established.

A final remark is that although I have reached the limit of the number of journals of this type that I can be closely involved in, I would be delighted to hear from anybody who thinks that they can put together a high-quality editorial board in an area that does not have a suitable journal for people who want to publish good papers without supporting the big commercial publishers. I know people who can offer advice about suitable platforms, funding, and similar matters. It would be great if free-to-read free-to-publish journals could cover all of mathematics, but we are still a long way from that at the moment.

### Almost all Collatz orbits attain almost bounded values

I’ve just uploaded to the arXiv my paper “Almost all Collatz orbits attain almost bounded values“, submitted to the proceedings of the Forum of Mathematics, Pi. In this paper I returned to the topic of the notorious Collatz conjecture (also known as the conjecture), which I previously discussed in this blog post. This conjecture can be phrased as follows. Let denote the positive integers (with the natural numbers), and let be the map defined by setting equal to when is odd and when is even. Let be the minimal element of the Collatz orbit . Then we have

**Conjecture 1 (Collatz conjecture)** One has for all .

Establishing the conjecture for all remains out of reach of current techniques (for instance, as discussed in the previous blog post, it is basically at least as difficult as Baker’s theorem, all known proofs of which are quite difficult). However, the situation is more promising if one is willing to settle for results which only hold for “most” in some sense. For instance, it is a result of Krasikov and Lagarias that

for all sufficiently large . In another direction, it was shown by Terras that for almost all (in the sense of natural density), one has . This was then improved by Allouche to , and extended later by Korec to cover all . In this paper we obtain the following further improvement (at the cost of weakening natural density to logarithmic density):

**Theorem 2** Let be any function with . Then we have for almost all (in the sense of logarithmic density).

Thus for instance one has for almost all (in the sense of logarithmic density).

The difficulty here is one usually only expects to establish “local-in-time” results that control the evolution for times that only get as large as a small multiple of ; the aforementioned results of Terras, Allouche, and Korec, for instance, are of this type. However, to get all the way down to one needs something more like an “(almost) global-in-time” result, where the evolution remains under control for so long that the orbit has nearly reached the bounded state .

However, as observed by Bourgain in the context of nonlinear Schrödinger equations, one can iterate “almost sure local wellposedness” type results (which give local control for almost all initial data from a given distribution) into “almost sure (almost) global wellposedness” type results if one is fortunate enough to draw one’s data from an *invariant measure* for the dynamics. To illustrate the idea, let us take Korec’s aforementioned result that if one picks at random an integer from a large interval , then in most cases, the orbit of will eventually move into the interval . Similarly, if one picks an integer at random from , then in most cases, the orbit of will eventually move into . It is then tempting to concatenate the two statements and conclude that for most in , the orbit will eventually move . Unfortunately, this argument does not quite work, because by the time the orbit from a randomly drawn reaches , the distribution of the final value is unlikely to be close to being uniformly distributed on , and in particular could potentially concentrate almost entirely in the exceptional set of that do not make it into . The point here is the uniform measure on is not transported by Collatz dynamics to anything resembling the uniform measure on .

So, one now needs to locate a measure which has better invariance properties under the Collatz dynamics. It turns out to be technically convenient to work with a standard acceleration of the Collatz map known as the *Syracuse map* , defined on the odd numbers by setting , where is the largest power of that divides . (The advantage of using the Syracuse map over the Collatz map is that it performs precisely one multiplication of at each iteration step, which makes the map better behaved when performing “-adic” analysis.)

When viewed -adically, we soon see that iterations of the Syracuse map become somewhat irregular. Most obviously, is never divisible by . A little less obviously, is twice as likely to equal mod as it is to equal mod . This is because for a randomly chosen odd , the number of times that divides can be seen to have a geometric distribution of mean – it equals any given value with probability . Such a geometric random variable is twice as likely to be odd as to be even, which is what gives the above irregularity. There are similar irregularities modulo higher powers of . For instance, one can compute that for large random odd , will take the residue classes with probabilities

respectively. More generally, for any , will be distributed according to the law of a random variable on that we call a *Syracuse random variable*, and can be described explicitly as

where are iid copies of a geometric random variable of mean .

In view of this, any proposed “invariant” (or approximately invariant) measure (or family of measures) for the Syracuse dynamics should take this -adic irregularity of distribution into account. It turns out that one can use the Syracuse random variables to construct such a measure, but only if these random variables stabilise in the limit in a certain total variation sense. More precisely, in the paper we establish the estimate

for any and any . This type of stabilisation is plausible from entropy heuristics – the tuple of geometric random variables that generates has Shannon entropy , which is significantly larger than the total entropy of the uniform distribution on , so we expect a lot of “mixing” and “collision” to occur when converting the tuple to ; these heuristics can be supported by numerics (which I was able to work out up to about before running into memory and CPU issues), but it turns out to be surprisingly delicate to make this precise.

A first hint of how to proceed comes from the elementary number theory observation (easily proven by induction) that the rational numbers

are all distinct as vary over tuples in . Unfortunately, the process of reducing mod creates a lot of collisions (as must happen from the pigeonhole principle); however, by a simple “Lefschetz principle” type argument one can at least show that the reductions

are mostly distinct for “typical” (as drawn using the geometric distribution) as long as is a bit smaller than (basically because the rational number appearing in (3) then typically takes a form like with an integer between and ). This analysis of the component (3) of (1) is already enough to get quite a bit of spreading on (roughly speaking, when the argument is optimised, it shows that this random variable cannot concentrate in any subset of of density less than for some large absolute constant ). To get from this to a stabilisation property (2) we have to exploit the mixing effects of the remaining portion of (1) that does not come from (3). After some standard Fourier-analytic manipulations, matters then boil down to obtaining non-trivial decay of the characteristic function of , and more precisely in showing that

for any and any that is not divisible by .

If the random variable (1) was the sum of independent terms, one could express this characteristic function as something like a Riesz product, which would be straightforward to estimate well. Unfortunately, the terms in (1) are loosely coupled together, and so the characteristic factor does not immediately factor into a Riesz product. However, if one groups adjacent terms in (1) together, one can rewrite it (assuming is even for sake of discussion) as

where . The point here is that after conditioning on the to be fixed, the random variables remain independent (though the distribution of each depends on the value that we conditioned to), and so the above expression is a *conditional* sum of independent random variables. This lets one express the characeteristic function of (1) as an *averaged* Riesz product. One can use this to establish the bound (4) as long as one can show that the expression

is not close to an integer for a moderately large number (, to be precise) of indices . (Actually, for technical reasons we have to also restrict to those for which , but let us ignore this detail here.) To put it another way, if we let denote the set of pairs for which

we have to show that (with overwhelming probability) the random walk

(which we view as a two-dimensional renewal process) contains at least a few points lying outside of .

A little bit of elementary number theory and combinatorics allows one to describe the set as the union of “triangles” with a certain non-zero separation between them. If the triangles were all fairly small, then one expects the renewal process to visit at least one point outside of after passing through any given such triangle, and it then becomes relatively easy to then show that the renewal process usually has the required number of points outside of . The most difficult case is when the renewal process passes through a particularly large triangle in . However, it turns out that large triangles enjoy particularly good separation properties, and in particular afer passing through a large triangle one is likely to only encounter nothing but small triangles for a while. After making these heuristics more precise, one is finally able to get enough points on the renewal process outside of that one can finish the proof of (4), and thus Theorem 2.

### Wolpert’s examples of tiny Weil-Petersson sectional curvatures revisited

In this previous post here (from 2018), I described some “back of the envelope calculations” (based on private conversations with Scott Wolpert) indicating that some sectional curvatures of the Weil–Petersson (WP) metric could be at least exponentially small in terms of the distance to the boundary divisor of Deligne–Mumford compactification.

Very roughly speaking, this heuristic computation went as follows: the WP sectional curvature of any -plane can be written as the sum of three terms; for the -planes considered in the previous post, the main term among those three *seemed* to be a kind of -norm of Beltrami differentials with essentially disjoint supports; finally, this -type norm was shown to be really small once a certain Green propagator is *ignored*.

Last April 2019, I met Scott during an event at Simons Center for Geometry and Physics, and I took the opportunity to tell him that one could perhaps show that the measure of the set of -planes leading to tiny WP curvatures is very small using the real-analyticity of the WP metric.

More concretely, my idea was very simple: since the Grassmannian of -planes tangent to a point is a compact space, the WP sectional curvature defines a real-analytic function , and we dispose of good upper bounds for and all of its derivatives in terms of the distance of to the boundary (see this article here), we can hope to get reasonable estimates for the measure of the sets using the techniques of these articles here and here (which are close in spirit to the classical fact [explained in Lemma 3.2 of Kleinbock–Margulis paper, for instance] that the measure of the sets are small whenever is a polynomial function on whose degree and -norm are bounded).

As it turns out, Scott thought that this strategy made some sense and, in particular, he promised to use my suggestion as a motivation to review his arguments concerning WP sectional curvatures.

After several email exchanges with Howard Masur and I, Scott announced that there were some *mistakes* in the construction of tiny WP sectional curvature: in a nutshell, one should not restrict the analysis to a single “main term” in the formula for WP sectional curvatures as a sum of three expressions, and one can not ignore the effect of the Green propagator. More importantly, Scott made a detailed study of these mistakes which ultimately led him to establish polynomial upper bounds for WP sectional curvatures at the heart of his newest preprint available here.

In this post, we will follow closely Scott’s preprint in order to give an outline of the proof of a polynomial upper bound for WP sectional curvatures:

**Theorem 1 (Wolpert)** *Given two integers and with , there exists a constant with the following property.*If denotes the product of the lengths of the short geodesics of a hyperbolic surface of genus with cusps whose systole is sufficiently small, then the sectional curvatures of the Weil-Petersson metric at are at most

**Remark 1** *As it was pointed out by Scott in his preprint, it is likely that this estimate is not optimal: indeed, one expects that the best exponent should be rather than .*

In what follows, we’ll assume some familiarity with some basic aspects of the geometry of the Weil–Petersson metric (such as those described in these posts here and here).

**1. Weil–Petersson sectional curvatures**

Let be a hyperbolic surface of genus with . If we write , where is the usual hyperbolic plane and is a group of isometries of describing the fundamental group of , then the holomorphic *tangent* space at to the moduli space of Riemann surfaces of genus with punctures is naturally identified with the space of harmonic Beltrami differentials on (and the *cotangent* space is related to quadratic differentials).

In this setting, the Weil–Petersson metric is the Riemannian metric induced by the Hermitian inner product

where and is the hyperbolic area form on .

**Remark 2** *Note that is well-defined: if and are Beltrami differentials, then is a function on .*

The Riemann tensor of the Weil–Petersson metric was computed by Wolpert in 1986:

where and is an operator related to the Laplace–Beltrami operator on .

**Remark 3** *Our choice of notation here differs from Wolpert’s preprint! Indeed, he denotes the Laplace–Beltrami operator by and he writes .*

The Riemann tensor gives access to nice formulas for the sectional curvatures thanks to the work of Bochner. More concretely, given and span a -plane in the real tangent space to at , let us take Beltrami differentials and such that , , and is orthonormal. Then, Bochner showed that the sectional curvature of is

Hence, by Wolpert’s formula for the Riemann tensor of the WP metric, we see that

**2. Spectral theory of **

Wolpert’s formula for the Riemann tensor of the WP metric hints that the spectral theory of plays an important role in the study of the WP sectional curvatures.

For this reason, let us review some key properties of (and we refer to Section 3 of Wolpert’s preprint for more details and references). First, is a positive operator on whose norm is : these facts follow by integration by parts. Secondly, is essentially self-adjoint on , so that is self-adjoint on . Moreover, the maximum principle permits to show that is also a positive operator on with unit norm. Finally, has a positive symmetric integral kernel: indeed,

where the Green propagator is the Poincaré series

associated to an appropriate Legendre function . (Here, stands for the hyperbolic distance on .) For later reference, we recall that has a logarithmic singularity at and whenever is large.

**3. Negativity of the WP sectional curvatures**

Interestingly enough, as it was first noticed by Wolpert in 1986, the spectral features of described above are sufficient to derive the negativity of WP sectional curvatures from Cauchy-Schwarz inequality. More precisely, since is self-adjoint, i.e.,

and its integral kernel is a *real* function, a straightforward computation reveals that the equation (1) for the sectional curvature of a -plane can be rewritten as

If we decompose the function into its real and imaginary parts, say , then we see that

Since is a positive operator, we conclude that and, *a fortiori*,

The non-positivity of the right-hand side of (2) can be established in three steps. First, the positivity of also implies that

Secondly, the fact that has a positive integral kernel allows to apply the Cauchy–Schwarz inequality to get that . Therefore,

Finally, the Cauchy–Schwarz inequality also says that

In particular, , so that it follows from (2) that all sectional curvatures of the WP metric are non-positive, i.e., .

Actually, it is not hard to derive that at this stage: indeed, would force a case of equality in Cauchy-Schwarz inequality and this is not possible in our context because is orthonormal.

**Remark 4** *Philosophically speaking, the “analog” to this argument in the realm of Teichmüller dynamics is Forni’s proof of the spectral gap property for the Lyapunov exponents of the Teichmüller geodesic flow. In fact, after some computations with variational formulas for the so-called Hodge norm, Forni establishes that by ruling out an equality case in a certain Cauchy-Schwarz estimate.*

**4. Reduction of Theorem 1 to bounds on ‘s kernel**

The discussion in the previous section says that small WP sectional curvatures correspond to almost equalities in certain Cauchy-Schwarz inequalities.

Hence, a natural strategy towards the proof of Theorem 1 consists into showing that an almost equality in (3) is impossible. In this direction, Wolpert establishes the following result:

**Theorem 2 (Wolpert)** *There are two constants and with the following property. If we have an almost equality*

*between the terms and in (3), then and can not be almost equal:*

Of course, Theorem 1 is an immediate consequence of Theorem 2 (in view of (2) and the estimate [implied by (3)]).

Thus, it remains only to prove Theorem 2. For this sake, we need further spectral information on , namely, some lower bounds on its the kernel . In order to illustrate this point, let us now show Theorem 2 *assuming* the following statement.

**Proposition 3** *There exists a constant such that*

*whenever and do not belong to the cusp region of .*

**Remark 5** *We recall that the cusp region of is a finite union of portions of which are isometric to a punctured disk (equipped with the hyperbolic metric ).*

For the sake of exposition, let us first establish Theorem 2 when is compact, i.e., , before explaining the extra ingredient needed to treat the general case.

**4.1. Proof of Theorem 2 modulo Proposition 3 when **

Suppose that for a constant to be chosen later. In this regime, our goal is to show that is “big” and is “small”, so that is necessarily “big”.

We start by quickly showing that is “big”. Since and are unitary tangent vectors, it follows from Proposition 3 that

Let us now focus on proving that is “small”. Since (cf. (3)), if we write (where and are the positive and negative parts of the real part of , then we obtain that

Since is positive, we derive that . Thus, if is compact, i.e., , then Proposition 3 says that for all . It follows that

By orthogonality of , we have that , i.e., . By plugging this information into the previous inequality, we obtain the estimate

Next, we observe that (cf. (3)) in order to obtain that

On the other hand, Proposition 3 ensures that for . Since and are unitary tangent vectors, one has for . By inserting this inequality into the previous estimate, we derive that

whenever has a sufficiently small systole.

This bound on can be converted into a bound thanks to Cauchy integral formula. More concrentely, as it is explained in Section 2 of Wolpert’s preprint, after observing that and replacing Beltrami differentials and by the dual objects and (namely, quadratic differentials), we are led to study quartic differentials . By Cauchy integral formula on , one has

On the other hand, if has systole and the cusp region is empty, then the injectivity radius at any is . Thus, there exists an universal constant such that

for all . By plugging this inequality into (7), we conclude that

for all .

Since is a positive operator on with unit norm (cf. Section 2 above) and , we have that the previous inequality implies the following bound on :

for all . By combining this estimate with (7), we conclude that

In summary, (4) and (8) imply that

for the choice of constant . This proves Theorem 2 in the absence of cusp regions.

**4.2. Proof of Theorem 2 modulo Proposition 3 when **

The arguments above for the case also work in the case because the cusp regions carry only a tiny fraction of the mass of the relevant functions, Beltrami differentials, etc.

More precisely, as it is explained in Section 2 of Wolpert’s preprint, if the constant is chosen correctly, then the Cauchy integral formula and the Schwarz lemma can be used to prove that

for all holomorphic quartic differentials .

In particular, we do not lose too much information after truncating , , etc. to and this allows us to repeat the arguments of the case to the corresponding truncated objects , , etc. without any extra difficulty: see Section 5 of Wolpert’s preprint for more details.

**5. Proof of Proposition 3**

Closing this post, let us give an idea of the proof of Proposition 3 (and we refer the reader to Section 4 of Wolpert’s preprint for more details).

Since and (cf. Section 2 above), our task is reduced to give lower bounds on the Poincaré series

For this sake, let us first recall that a hyperbolic surface has thick-thin decomposition: the thick portion is the region where the injectivity radius is bounded away from zero by a uniform constant and the thin portion is the complement of the thick region. Geometrically, the thin region is the disjoint union of the cusp region and a finite number of *collars* around simple closed short geodesics: roughly speaking, a collar consisting of the points at distance of a short simple closed geodesic of length .

We can provide lower bounds on in terms of the behaviours of simple geodesic arcs connecting and on .

More concretely, let be the shortest geodesic connecting and . Since is simple, we have that, for certain adequate choices of the constants defining the collars, one has that can not “back track” after entering a collar, i.e., it must connect the boundaries (rather than going out via the same boundary component). Furthermore, can not go very high into a cusp. Thus, if we decompose according to its visits to the thick region, the collars and the cusps, then the fact that permits to check that it suffices to study the passages of through collars in order to get a lower bound on .

Next, if is a subarc of crossing a collar around a short closed geodesic , then we can apply Dehn twists to to get a family of simple arcs indexed by giving a “contribution” to of

for some constant depending only on the topology of . In this way, the desired result follows by putting all “contributions” together.