# Statistics and public health

The word “scientist” conjures up images of wild-haired people in lab coats mixing colourful liquids in conical flasks. I don’t mean to imply that this doesn’t happen, but there’s a little more to the profession of science than that. There are dozens of fields and hundreds of subfields of the sciences, all with very different methods and subject matter. Probably the only thing all these fields have in common is the scientist’s number one activity: bickering over statistics.

When the “Climategate” emails were released to the world, almost three years ago, the public was shocked by the scientists’ (sometimes quite strident) disagreements about statistical methods, margins of error, and uncertainty. They said they were sure, and now they’re talking about uncertainty? But to a scientist, these words have rigorously defined statistical meanings that may or may not correspond with the way we use them in everyday life.

It’s fair to say that unless you have a working knowledge of the statistical concepts involved, you can’t properly understand a scientific study at all. In that spirit, here’s an overview of two study types that are common in epidemiology, the statistical study of the spread of diseases. Whenever you see a report that scientists have identified a possible link between some risk factor and a disease (often cancer), chances are they did either a cohort or case-control study.

The Diesel Exhaust in Miners Study (DEMS), which the Manitoban covered on March 14, is a perfect example. Researchers carried out both types of inquiry to paint a near-complete statistical picture of the effects of diesel exhaust on lung cancer, and found that the connection between the two is quite strong.

Remember the probability formulas you learnt in high school? Those are the basic tools of the epidemiology trade. To find out the probability that someone exposed to diesel will develop lung cancer — in mathematical notation, P(Cancer|Exposure) — you run a cohort study, following all the miners exposed to diesel and seeing how many develop lung cancer. Cohort studies are more valuable as evidence, but case-control studies are cheaper, easier to conduct, and faster.

A case-control study looks at all the miners afflicted by lung cancer, estimates the probability that they were exposed to diesel fumes — P(Exposure|Cancer) — and compares it to the probability for miners who do not have cancer — P(Exposure|~Cancer). These figures are combined into an “odds ratio.” An odds ratio greater than one indicates a strong link, and an odds ratio equal or less than one indicates a weak or nonexistent link.

At its core, this kind of science is pretty simple. So why do projects like DEMS take so long and have such complicated reports?

Well, the explanations above are nice and neat, but as you might expect, the real world is quite a bit messier. Little complications — smoking, exposure to other workplace carcinogens, physical health — have to be controlled for, and when you’re dealing with 12,000 experimental subjects, this can add up very quickly. Estimating a worker’s exposure to diesel exhaust is no mean feat, as the four-part article explaining the DEMS exposure assessment scheme will attest.

In fact, accounting for the messiness of the real world is the chief problem of any scientific study. This is why, even after a study has shown a link between a risk factor and a disease, there’s still room for robust debate as scientists point out ways the data could be polluted by “confounding factors,” like smoking in the DEMS studies.

These kinds of studies never reach absolute certainty, but absolute certainty is a lot less common in real life than we’d like. Eventually, when you’ve established strong statistical connections, you get the scientific holy grail: your theory is able to predict the results of experiments in advance. At this point, you’re still not 100 per cent certain, but you’re pretty sure, and in this life that’s as close as anyone can hope to get.

For more on this topic, you may wish to check out Derek Rowntree’s book Statistics Without Tears, a breezy 200-page read aimed at a general audience.