Narrative of the Project to compute Kazhdan-Lusztig Polynomials for E8

Monday morning, January 8 2007, at a few minutes before 9 o'clock, a computer finished writing to disk a file containing all of the Kazhdan-Lusztig polynomials for the split real group G(R) of type E8. The values at 1 of these polynomials are the coefficients in the formulas for the irreducible characters of representations of G(R); so essentially all of those irreducible characters are written down. The polynomial with biggest coefficient (which is 11,808,808) is

152q22 + 3472q21 + 38791 q20 + 293021q19 + 1370892q18 + 4067059q17 + 7964012q16 + 11159003q15 + 11808808q14 + 9859915q13 + 6778956q12 + 3964369q11 + 2015441q10 + 906567q9 + 363611q8 + 129820q7 + 41239q6 + 11426q5 + 2677q4 + 492q3 + 61q2 + 3q

value at 1: 60779787

That's the headline; what's the story? Quite a few of you have spent lots of your professional lives trying to understand representations of reductive groups, so I won't say too much about why that's interesting. A little of the background for this particular story is that it's been known for more than twenty years that the unitary dual of a real reductive group can be found by a finite calculation.

Twenty years ago, actually carrying out such computations for any interesting groups looked completely impossible. But in twenty years we've learned a little, and computers have gotten a lot bigger and faster. In 2002, Jeff Adams had the idea of getting computers really to make such calculations. Of course as mathematicians we want completely general theorems, and it's by no means clear that there is a finite calculation to find the unitary duals of all reductive groups at once. But the work of Dan Barbasch makes it possible to hope that one can treat classical groups with pencil and paper. The exceptional groups are finite in number, so treating them _is_ a finite calculation. That became the standard for measuring the computer problem: is it possible to treat E8?

It was clear that the first problem was to write a program that could work with the Cartan subgroups, maximal compact subgroups, and Weyl groups of real reductive groups. Jeff recruited Fokko du Cloux to do this, and he began to work in 2003. By 2004 his software could handle this structure theory (better than most of us, at least).

The next step was less obvious, but Fokko and Jeff settled on computing Kazhdan-Lusztig polynomials for real groups. Fokko had written the best software in the world to do this for Coxeter groups; the algorithms for real groups are more or less similar in structure (although MUCH more complicated in detail). Getting these polynomials would provide formulas for irreducible characters, and it is a critical step in several computational approaches to classifying unitary representations.

So late in 2004, Fokko began to add to his software code to implement the Kazhdan-Lusztig algorithm for real groups. Since the papers in which this algorithm is written down are (a) nearly impenetrable, and (b) full of small mistakes, and (c) written with no consideration for computational practice, it was clear that this job was a matter of five or six years of work. And that's how it turned out: it took Fokko almost a full year to finish. In November of 2005, he did. Very quickly he and Jeff used the software to compute KL polynomials (and so character tables) for all of the real forms of F4, E6, and E7, and for the non-split form of E8. That left the split form of E8.

So how big a computation is this character table for split E8? Fokko's software immediately told us exactly how many different representations we were talking about (453,060); that's also the number of terms that could appear in the character formulas. So the character table is a square matrix of size 453060. The number of entries is therefore about 200 billion, or 2 x 1011.

Fortunately the matrix is upper triangular, so we only need 100 billion entries.

Unfortunately the algorithm proceeds not by calculating entries directly but by calculating polynomials whose values at 1 are the entries. The polynomials have average degree about 11, so the total number of coefficients is about 1.2 trillion.

Fortunately many of the coefficients are easily seen to be equal. One needs to store only one representative polynomial for each family of "obviously equal" entries. Fokko's software calculated how many such families there were: a bit more than 6 billion. So we were back down to storing about 70 billion coefficients.

Unfortunately we had no clear idea how big the (non-negative integer) coefficients of these polynomials could be. In the case of split D5, the largest is 5. For split E6, the largest is 27. In the case of split E7, it's 3583. This trend was not encouraging; it seemed clear that the coefficients would exceed 65535, so that they could not be stored in two bytes of computer memory. The next practical size is four bytes.

Fortunately Fokko wrote the software to compute with four-byte integers and to test carefully for numeric overflow throughout the computation. (If overflow happened, the plan was to switch to eight-byte integers and try again.)

Unfortunately, 70 billion 4-byte integers require 280G of RAM. (The nature of the algorithm, which constantly looks at widely distributed results from earlier in the compuation, made storing results on disk impractical.) That's a lot of RAM even for rich scientists.

Fortunately, some of the six billion polynomials are zero, and some of them are equal to others "by chance" (that is, not for reasons that are easy to understand). So Fokko wrote the software to store only one copy of each distinct polynomial. He hoped that the number of distinct polynomials might be a few hundred million: so perhaps 6 billion coefficients, requiring 25G of RAM. The indexes keeping track of all these polynomials would occupy another 100G or so: perhaps 120G altogether.

Unfortunately, we didn't have a computer even with 100G of RAM. We hoped perhaps to buy one using the NSF FRG grant which Jeff had finally pushed through; but that was a year or more away.

Fortunately computer science is often computer art, and even Fokko's work could be improved. Fokko worked on that constantly over the last year, and Marc van Leeuwen began to make serious contributions as well. The two of them rearranged the indexes and the code in some extraordinarily clever ways; Marc's final changes in the last month reduced the sizes of the indexes to about 35G.

Unfortunately, tests running partway through the E8 calculation (done mostly by Birne Binegar, who got a supercomputer in Stillwater, but managed more or less to set fire to it using the atlas software) revealed that Fokko's first hopes about the number of distinct polynomials were too optimistic. Even an optimistic reading of Birne's tests suggested more like 800 million distinct polynomials, meaning 40G or more to hold the coefficients.

Fortunately Dan Barbasch is now chair of the math department at Cornell, and in September he managed to gain access to a machine with 128G of RAM and 128G of swap. He used it to run the E8 computation to the end. The fact that Fokko's tests weren't set off showed that all the coefficients really fit in four bytes.

Unfortunately he had no reasonable way to write the results to disk, so they disappeared. Also unfortunately, Dan didn't have the improvements that Fokko and Marc had made to the code: his computation used 224G of memory (half of it swap space) and took twelve days to finish.

Fortunately, by November, Fokko and Marc had trimmed memory use in the code a great deal. Through the persistence of Birne Binegar, and the generosity of William Stein, the atlas group got access to William Stein's computer sage at the University of Washington (with 64G of RAM and 75G of swap. On this machine we could finally do some large fraction of the E8 character table computation. By late November, we believed that we could finish E8 with about 150G of RAM.

Unfortunately, 150G is just a little more than sage has, even with swap. Birne and Jeff suggested that we start looking seriously at applying for an NSF grant to buy a machine with perhaps 256G of memory: something on the order of $150,000 or more. I asked a number of people whether they might be able to make use of such a computer.

Fortunately Noam Elkies had a fascinating reply:

A computation that actually uses N bytes of storage must take time _at least_ N. But once N gets as large as 256GB it might not be feasible to spend much _more_ than N time: N*log(N) or N*log2(N) is certainly OK [e.g. fast integer or polynomial arithmetic, and other applications of the Fast Fourier Transform; also solving f(x)=f(x') by sorting N values of f(x) and finding a consecutive match]; maybe also N(3/2) [e.g. linear algebra with dense matrices of size sqrt(N), or computing the first N coefficients of modular forms such as Delta without fast arithmetic]; probably not N2. So there might not be all that much room for making use of such a huge machine, though even O(N) allows for some nontrivial computations [as in the example of exhaustive analysis of endgames in chess and similar games, or -- less frivolously but using the same basic idea -- of the distance distribution in a Cayley graph, as in ].

and in addition

Is it clear that the E8 computation cannot fit into "only" 128 or 64GB -- which would still be significantly larger than neron?

I explained about the polynomial storage:

We know that the coefficients can exceed 216 (by computation), and we hope that they don't exceed 232. Each polynomial is stored as a vector of 32 bit integers, of size exactly equal to its degree plus one. Assuming an average degree of 19, that's 80 bytes per polynomial.

On November 30, Noam replied

Well 232 is less than the product of the ten odd primes less than 25, so unless the computation requires divisions by numbers other than 2 you could reduce this from 80 bytes to something like (5/32)*80 = 12.5 bytes, at the cost of running the computation 9 times (counting once for mod 3*5)

I started to write a reply explaining why modular reduction of the KL algorithm doesn't work, but I had to throw it away. Fokko's code was beautifully compartmentalized, and Marc is amazing, so by December 4 we were getting character table entries mod n for any n up to 256.

On December 6 Marc's modifications of Fokko's code on sage computed about 3/4 of the entries in the character table mod 251, finding almost 700 million distinct polynomials and growing to 108G in size. Since 90% of that was indexes, working on their structure became worthwhile. Marc redesigned the indexes rather completely (replacing 8-byte pointers by four-byte counters several billion times, for example). He also added code to output the answers to compact hard disk files (easily re-readable by the atlas or other software, for subsequent processing).

Meanwhile Birne ran various versions of the code on sage. Among other things he established for certain that there were more than one billion distinct polynomials.

Early on December 19, Marc's modified code began a computation for E8 mod 251, with the possibility of actually writing the result usefully at the end. Essentially it worked, finishing the computation in about 16 hours:

Total elapsed time = 62575s. Finished at l = 64, y = 453059 d_store.size() = 1181642979, prim_size = 3393819659 VmData: 64435824 kB

(The number d_store is the number of distinct KL polynomials, and y is the row number in the character table. The number prim_size is the number of not-obviously-equal entries in the character table that turned out to be non-zero.)

Unfortunately, writing the answer to disk took two days. Marc and I went over Marc's output code to see why. Fortunately, we figured it out, and Marc improved the speed. Unfortunately, we found at the same time a bug (he wrote size() in one line where he meant capacity()). The result was that, even though the polynomials were all correctly written to disk, the index files explaining which polynomial was stored where were missing something like half their contents.

Marc fixed things instantly, and on Friday December 22 we started a calculation mod 256 on sage. This did 452,174 out of 453,060 rows of the character table in 14 hours, then sage crashed. We tried again starting late Friday afternoon, and actually finished with good output:

Total elapsed time = 40229s. Finished at l = 64, y = 453059 d_store.size() = 1181642979, prim_size = 3393819896 VmData: 54995416 kB

Just eleven hours this time, (joys of multithreading!), and with Marc's adjustments the disk output was fast.

On Saturday December 23 we started a run mod 255. This time sage crashed a third of the way through the computation. There was no one physically present to reboot it (apparently some kind of local holiday in Seattle) so we retired for a bit (still having mod 256 as our only good output files).

Meanwhile Marc was writing code to combine KL polynomials from several moduli m1, m2,... into KL polynomials modulo lcm(m1, m2, ...). We tested that on the first hundred million entries of the E8 character table modulo 253, 255, and 256, and it worked fine. (The biggest coefficient appearing to that point was 175836.) When sage came back up on December 26, we got a character table mod 255 written to disk. At 1 a.m. on the morning of December 27, we started a run mod 253. About halfway through, sage crashed.

The experts I consulted assured me that the atlas software couldn't possibly be crashing sage. My own opinions about the causes of the crashes wavered between black helicopters from the NSA and Sasquatch. We decided to keep our hands off sage until we were older and wiser: say for a year.

On Wednesday January 3 we were all one year older, which made perhaps thirty years of additional wisdom counting all the atlas people. This factor of thirty seemed like a suitable margin of safety, so that afternoon we started another computation mod 253. This took twelve hours. By 4 a.m. Thursday morning January 4th we had output for three moduli (253, 255, and 256) with lcm 16515840: bigger (we had some hope) than all the coefficients. Marc took unfair advantage of the time difference in Europe to start running his Chinese remainder theorem utility on the results. Arranging the indexes takes a long time, but that finished in nine hours. There was another speed problem: the first version had a pretty little counter to display the number of the polynomial being written to disk, to allow for monitoring the job. Since more than a billion polynomials are written, this meant writing several billion characters to the display. Turns out that takes a long time. So we started over on Friday morning January 5, with a counter that updates only every 4096 polynomials. This went nicely until sage crashed.

William Stein identified sage's problem as a flaky hard drive. (I personally can never remember which of French pastry and hard drives is supposed to be flaky and which is supposed to be sticky.) So he replaced it with some French pastry, on which our 100G of mod m output files were written in dark chocolate icing. (I'm simplifying the technical details a little.) The fact that he _had_ such a pastry puts him high on my long list of heroes for this saga. He did this more or less instantly, and we restarted the Chinese Remainder calculation late Friday afternoon.

Early Saturday morning, the file of polynomial coefficients mod 16515840 had grown to be about 7 billion bytes larger than it was supposed to be. Since it was a day with a "y" in it, I figured Marc ought to be working, so I sent him off to find out what was up (or down, or whatever). (He was visiting his family in the Netherlands for the weekend, further evidence that he had nothing to do.)

The bug was unbelievably subtle. Since the number of polynomials is about a billion, Marc's Chinese Remainder code represented the index of a polynomial by a four-byte integer (perfectly good up to 4,294,967,295). Unfortunately, this integer at some point needs to be multiplied by 5 (the number of bytes in the index entry for one polynomial); the result is put into an 8-byte integer, where it fits nicely. But when the polynomial number exceeds 858,993,459, and the multiplication is done in four bytes, it overflows. The result is that everything works fine in any reasonable test (like the one we ran with a hundred million polynomials). To complicate Marc's task further, the bug was not present in his most recent version of the code; what I was running on sage was a couple of days older.

So it took Marc almost twenty hours to find and fix this bug (assuming that he neither slept nor ate nor spoke to his family.)

Marc's analysis showed that the bug came into play only around polynomial number 858 million; so all the coefficients calculated before that should be correct. The largest of these was 11,808,808, at polynomial number 818553156. This convinced me that we would almost certainly find larger coefficients among the 350 million polynomials that were not correctly evaluated the first time; and that very likely we would need a larger modulus than 16515840 to get them all.

So at 6 a.m. on Sunday the 7th I was able to start Marc's current (and correct) Chinese remainder theorem utility, this time adding modulus 251. Of course nothing went wrong (because what could go wrong?), and the last polynomial was written just before 9 a.m. Eastern time on Monday.

I apologize for making this so long, but it has been like nothing else I've ever done. I've left out much more than I put in. Of course the central omission was Fokko's story. Almost all of you know that he was diagnosed with ALS just after finishing the Kazhdan-Lusztig computation software in November of 2005. By February 2006 he had little use of his hands, and by May he was entirely paralyzed below his neck. But he continued to share his fantastic skills and insights into the mathematics and the programming, with Jeff and with Marc and with me, until his death November 10.

This project has introduced me to great mathematicians I barely knew, like Marc van Leeuwen and John Stembridge, and it has shown entirely new depths in people I knew well, like Jeff Adams and Dan Barbasch. So it's a very high bar...but what has been best of all, mathematically and personally, has been spending time with Fokko. It's _still_ the best: every minute I spend on this stuff is a chance to think about him, and that's always good for a smile.

So congratulations and thank you to Fokko, who did this all by himself.

Congratulations and thank you to everyone who helped him - Marc did more of that than anybody, but there were a lot of indispensable people.

I haven't had a _boss_ since I worked in the lumberyard in the summer of 1972, until Jeff. Congratulations and thank you, boss!

So we've got to get going on this code for branching to K, Alfred...

David Vogan
January 8, 2007