Climategate and the Inevitable Opacity of Scientific Software

Shannon Love at Chicago Boyz picked up on an interesting article in the Guardian about scientific software.  I thought I would add my two cents worth, as I’ve written large scientific computer codes myself.  The quote she included in her post sums up the article pretty well:

Computer code is also at the heart of a scientific issue. One of the key features of science is deniability: if you erect a theory and someone produces evidence that it is wrong, then it falls. This is how science works: by openness, by publishing minute details of an experiment, some mathematical equations or a simulation; by doing this you embrace deniability. This does not seem to have happened in climate research. Many researchers have refused to release their computer programs — even though they are still in existence and not subject to commercial agreements.

Shannon adds,

Keeping scientific software secret destroys reproducibility. If you have two or more programs whose internals are unknown, how do you know why they agree or disagree on their final outputs? Perhaps they disagree because one made an error the other did not or perhaps they agree because they both make the same error. You can never know if you have actually reproduced someone else’s work unless you know exactly how they got the answer they did. There is no compelling reason to keep scientific software secret. In the case of science upon which we base public policy on whose outcomes the lives of millions may depend, such secrecy could be lethal.

In fact, scientists do have reasons for keeping their software secret, although they may not seem compelling to the layman.  For example, good algorithms can form the basis of software Toolkits designed for Matlab and similar products.  As such, they can be quite lucrative.  Many scientists make a living by selling the rights to their proprietary software products.  Many of them fear, rightly or wrongly, that others in their field might use their intellectual property to beat them to the punch in publishing important results.  For that matter, some of them may be reticent to allow others to see their code for the reasons cited by one of Shannon’s commenters:

The worst, most amateurish code I have ever seen is that produced by scientists. Control flow like a bowl of spaghetti, global variables everywhere. A nightmare to understand, as that poor programmer in the Hadley CRU noted in his in-line comments – so, perfect ground for hiding little ‘adjustments’ and ‘tweeks’.

In fact, there are some elegant scientific computer codes, but I’ve seen some pretty lame ones, as well.  It’s true that, in general, computational scientists are not trained as software engineers, and it shows.

One could cite many other plausible reasons for keeping source code secret.  Shannon indulges in a bit of hyperbole when she claims such secrecy is potentially “lethal.”  A lack of food can be lethal, too, but that doesn’t mean farmers are immoral for not giving it away free.

That said, I generally agree with the argument that secrecy destroys reproducibility.  It is possible to let others run scientific codes without revealing the source code, but that can hardly serve as a proof that the code is correctly written, and contains no bugs.  However, the idea that big scientific codes would be significantly more credible and trustworthy if the source code were freely available is probably too optimistic.

The problem is that scientific software is usually complex, and often contains tens or hundreds of thousands of lines of code.  Big packages with lots of modules can run into the millions.  To understand the mathematics implemented in the codes, one must have a good grasp, not only of the math used to express the underlying physical theories, but of the numerical math used to approximate it on the computer as well.  Often, only a handful of scientists will have enough insight into both to be able to make sense of the source code.  Even for them, reading all those lines of code would be a Herculean task if they hadn’t been involved in the development process from the start.  As a result, nothing is easier than for a computational physicist to snow other scientists, not to mention the general public, about the validity of large codes.  It’s simply impractical to expect that “reproducibility” will work the same way for big scientific software packages as it does for physical experiments.  To a large extent, confidence in a given code must be a matter of trust, based on such things as the reputation of the code developer, demonstrated ability to predict results, results that are not “unphysical,” etc.   Scientific codes have proven extremely useful in practice, greatly expanding our physical understanding and underpinning the rapid technological progress we have witnessed in recent decades.  However, we must understand their limitations.

Those who would deny the value of scientific computation should look at a an MRI scan, or one of the images returned by a deep space probe, or, for that matter, one of the animated movies released by Dreamworks or Lucasfilm.  The amount of number crunching necessary to produce them would boggle your mind.  Those pretty pictures are often created with quite accurate physical models of the absorption, emission and scattering of light photons.  The applications of computational models in industry are innumerable.  Obviously, they must be at least somewhat accurate, or the technological and industrial processes that depend on them would fail.

 Of course, a prime target of many of the recent aspersions cast on scientific computing are the climate models used to study global warming.  According to the Guardian article:

One of the spinoffs from the emails and documents that were leaked from the Climate Research Unit at the University of East Anglia is the light that was shone on the role of program code in climate research. There is a particularly revealing set of “README” documents that were produced by a programmer at UEA apparently known as “Harry”. The documents indicate someone struggling with undocumented, baroque code and missing data – this, in something which forms part of one of the three major climate databases used by researchers throughout the world.

It would not surprise me if this were true.  In any case, climate models must somehow meet the seemingly impossible challenge of dealing with a problem with billions of degrees of freedom, incomplete and occasionally inaccurate input data, and incomplete knowledge of the relevant physics.  No computer, available now or likely to be available any time in the foreseeable future, will be able to solve a “full physics” model of the problem in all its complexity.  Physical approximations, some of them quite crude, are necessary to make the problem even reasonably tractable.  Climatologists are similar to scientists in many other fields in that they tend to gloss over the implications of these approximations.  In many cases, they probably honestly believe their models have more predictive value than is warranted by the underlying assumptions.  In spite of that, and in spite of the fact that, whether because of scientific hubris or pure arrogance, they have so often succeeded in shooting themselves in the foot, as in the recent IPCC and Climategate affairs, their results should not be dismissed out of hand.

We know that, other things being equal, sunlight that reaches the earth’s surface is reradiated at wavelengths that are more or less strongly absorbed in a given layer of atmosphere in proportion to the concentration of CO2 and other greenhouse gases in that layer.  If the best computational models we have suggest that the result will be a substantial increase in the planet’s average temperature, it seems to me foolhardy to simply ignore them.  Certainly, the models don’t “prove” anything, but, since this is the only planet we have to live on at the moment, surely it is better to be safe than sorry.  If something is true, it will not become false by virtue of the fact that some of those “scientists” who agree it is true have been arrogant, and have behaved more after the fashion of an ideological sect than disinterested seekers after truth.   In my opinion, much of the criticism being directed at environmental scientists in general and climatologists in particular is richly deserved.  However, it is a bad idea to jump off a cliff, even if the people who are telling us it’s a bad idea are arrogant jackasses.  It would be rather unwise to jump off the cliff anyway, just to spite them.

Author: Helian

I am Doug Drake, and I live in Maryland, not far from Washington, DC. I am a graduate of West Point, and I hold a Ph.D. in nuclear engineering from the University of Wisconsin. My blog reflects my enduring fascination with human nature and human morality.

Leave a Reply