Unlucky numbers: Fighting murder convictions that rest on shoddy … – Science

LEIDEN, THE NETHERLANDS—When a Dutch nurse named Lucia de Berk stood trial for serial murder in 2003, statistician Richard Gill was aware of the case. But he saw no reason to stick his nose into it.
De Berk was a pediatric nurse at Juliana Children’s Hospital in The Hague. In 2001, after a baby died while she was on duty, a colleague told superiors that De Berk had been present at a suspiciously high number of deaths and resuscitations. Hospital staff immediately informed the police. When investigators reexamined records from De Berk’s shifts, they found 10 suspicious incidents. Three other hospitals where De Berk had previously worked added another 10. The probability of such a pattern happening by chance was one in 7 billion, the police said. De Berk was arrested on 13 December 2001, suspected of murdering five children. Newspapers called her a “murder nurse” and an “angel of death.”
Gill, then working as a statistics professor at Leiden University, remembers his wife telling him about a “witch trial” and saying, “They’re using statistics; you should get involved, do something useful.” But Gill knew the statistician working on the case and considered him a decent, careful person. “So I thought I didn’t have to. And anyway, I was obsessed with quantum mechanics,” he says. In 2003, De Berk was found guilty of four murders and three attempted murders and sentenced to life in prison. An appeals court convicted her again in 2004. The Dutch Supreme Court upheld the conviction 2 years later.
It wasn’t until late 2006, when Gill read two whistleblowers’ account of the trial, that he started to look into the case—and became incandescent. Tunnel vision, bad statistics, and poor human intuitions about coincidence had marred the investigation. When Gill ran the numbers himself, he found the string of deaths on De Berk’s watch might well be entirely due to coincidence. Along with fellow statisticians, whistleblowers, and others, Gill campaigned for a retrial that eventually led to De Berk’s exoneration in 2010. Her case is now considered one of the worst miscarriages of justice in the Netherlands.
It also opened a new chapter in Gill’s professional life: He became a leading expert on the statistics of medical murder cases similar to De Berk’s—and a loud, persistent voice warning of the shoddy statistics that are sometimes central to prosecutors’ arguments. “In a normal murder case, you actually have a body which has clearly been murdered,” he says. When there’s only a suspicious cluster of deaths, investigators may assume a murderer is at work and selectively focus on evidence that supports that assumption. People’s intuition of an “impossible coincidence” joins the dots in the evidence.
Gill worked with defense lawyers and campaigned—in vain—to overturn the conviction of British nurse Ben Geen, found guilty in 2006 of two murders and 15 counts of grievous bodily harm. He also helped secure the October 2021 acquittal of nurse Daniela Poggiali, accused of two murders in a high-profile case in Italy. By now, the misuse of statistics has drawn enough attention that prosecutors sometimes insist their evidence is not statistical, Gill says, but often, “hidden statistics” seep into the cracks.
In a report peer reviewed and distributed by the Royal Statistical Society (RSS) in September 2022, Gill and colleagues detailed the statistical missteps in past medical murder trials and made recommendations for how legal systems can do better. Gill hopes the report will help with the case of another British nurse, Lucy Letby, who is now on trial for the alleged murder of seven babies and attempted murder of 10 more in a neonatal unit at the Countess of Chester Hospital.
“Similar issues have arisen across many, many different jurisdictions,” says criminologist William Thompson, professor emeritus at the University of California, Irvine, and a co-author of the RSS report. “The same investigative dynamics play out … the same cognitive biases, and the tunnel vision.” Gill likes to point out such errors with an outspokenness that frequently ruffles feathers, says statistician Peter Grünwald of the Center for Mathematics and Informatics, a friend and colleague who also campaigned for De Berk’s retrial. “He will give very radical opinions. … But somehow he’s a very pleasant person to disagree with.”
GILL HAS NOT ALWAYS been a troublemaker. His career has been defined by long detours into the bowels of arcane mathematical problems. Injustice bothers him—but so does error. He spends a lot of time debating quantum mechanics “crackpots” on the internet.
Gill had a serene childhood in the English countryside. His father, a physicist, spent his career in industry. His mother, Gill discovered after World War II intelligence was declassified in 1974, had been one of the human “computers” who helped crack Germany’s Enigma code at an outstation of Bletchley Park. “I wasn’t surprised,” he says. “I always thought I got my brains as much from her as from my father.”
At the University of Cambridge, where he studied math, it was statistics that most captured his attention. It had “weighty ethical and philosophical implications,” he says. “It was a branch of mathematics that really means something.”
As a student, he was not much of an activist. He says he feels guilty about not speaking up more about injustice when he was young. One incident in particular haunts him: his role as a statistician in a 1970s experiment that severed the front legs of rats to investigate whether bipedalism reshaped their skulls. “What upset me most is that I didn’t have the strength of character to refuse to do that job.”
The study was one of Gill’s first assignments as a statistical consultant at what is now the Center for Mathematics and Informatics in Amsterdam. Newly married, with three children born in quick succession, his most pressing concern was finding and keeping a good job, and the variety of consulting projects at the center fit neatly with his desire to do something practical. He obtained a Ph.D. on the mathematical underpinnings of “survival analysis,” the study of the expected time until an event—such as a mechanical failure or a death in a clinical trial—occurs. Later, statistical problems in quantum mechanics were his main focus.
And then he started to look into the story of Lucia de Berk.
DE BERK’S CASE became famous for a number: one in 342 million. That was the probability that the many “incidents” on her shifts were due to random bad luck, according to Henk Elffers, a law psychologist then at the Netherlands Institute for Crime and Law Enforcement and an expert witness for the prosecution. His figure was less stark than the police’s one in 7 billion, but still very damning.
Elffers’s reasoning was controversial and came under fire from statistical experts during De Berk’s appeals. He had multiplied the probability of De Berk’s pattern of death across multiple wards. This would make any nurse look guiltier with each job change. For example, even a mundane one in 20 chance at one hospital, and the same chance at the next, would transform into a more suspicious one in 400 chance.
But prosecutors had additional evidence: Investigators had found traces of the heart medication digoxin in the body of one alleged victim and an overdose of the sedative chloral hydrate in another. With this evidence of foul play, a court ruled in De Berk’s first appeal, other deaths could be safely attributed to her with weaker evidence—such as the overall “pattern” of incidents, and her diary, which spoke of her “very great secret” and “compulsion.” The appeal was essentially a retrial and the new court convicted De Berk again, adding three additional murders to her count. De Berk, who suffered a stroke 5 days after her failed second appeal, maintained her innocence throughout.
That might have been the end of the case if it hadn’t been for Metta de Noo, a geriatrician who had inside information. De Noo’s sister-in-law was the head pediatrician at Juliana, where De Berk worked, and had aided the police investigation. But when De Noo examined documents from the case, she found what she believed were flaws in the medical evidence. The infant who had allegedly died of digoxin poisoning had been declining for days after heart surgery. And the hospital had prescribed the maximum dose of chloral hydrate for the other child, allowing additional doses if needed. De Berk had been agitating for doctors to pay attention to the child’s deteriorating condition.
When De Noo asked specialists for support, she met with hostility and ridicule. Her doggedness destroyed her good relationship with her brother and his wife. She eventually turned to Ton Derksen, her other brother and a philosopher of science who had spent his career writing about flaws in reasoning of the type that permeated the De Berk investigation.
With De Noo’s help, Derksen published a bombshell book in 2006: Lucia de B.: Reconstruction of a Miscarriage of Justice. (In the Netherlands, suspects’ last names are commonly withheld to protect their privacy.) Derksen dismantled the figure of one in 342 million, giving a meticulous account of statistical errors, weak medical evidence, and bias in the investigation. For example, investigators examining the “incidents” connected with De Berk had classified deaths and resuscitations as suspicious when she was on duty, and not suspicious when she was off.
The prosecution had also argued that De Berk’s ward had seen a total of five deaths between 1996 and 2001, and all had occurred after De Berk had started working in 1999. But the ward had a different name until 1999, and earlier deaths were excluded, Derksen found. In reality, there were seven deaths in the 3 years before De Berk joined and six in the 3 years after. (De Noo published her own account of the case—and the way it tore her family apart—in 2010.)
Grünwald, then a young assistant professor, brought Derksen’s book to Gill’s attention and asked whether he would join a campaign for De Berk’s case to be reopened. Gill says reading the book made him “absolutely furious” with himself for trusting Elffers and not getting involved earlier. And he was angry that the appeals court had claimed its verdict did not rely on statistics: “Ton Derksen showed that it was soaked in statistics.”
Gill quickly reanalyzed the data himself. In a write-up posted online in January 2007, he reported a much less outlandish probability of one in 100,000—even before removing biases in the data. Gill has refined his analysis over the years, building in complexities such as the fact that nurses could be expected to have different mortality rates based on their skill, choices, and work patterns. In a paper in Chance in 2018, he and colleagues calculated a probability of one in 49.
Cognitive biases can easily lead an investigation astray and have drastic effects on how suspicious a cluster of deaths seems. In this imaginary example drawn from real-world errors, a doctor reports that many deaths seem to occur while Nurse X is on duty. The hospital launches an investigation, reexamining deaths at Nurse X’s ward over the past 2 years. A simple statistical test* compares the rate of suspicious deaths when Nurse X is on duty with the rate when she is off. It then calculates the probability of seeing this pattern purely as a result of random chance.** The outcome depends greatly on the type of investigation.
In 2007, convinced of De Berk’s innocence, Gill organized a petition to reopen the case. His quantum mechanics work was “useful after all,” he says, because he persuaded Nobel Prize–winning physicist Gerard ’t Hooft to sign, which generated headlines. But in other ways Gill was less diplomatic. He called some doctors “criminals” and said “outrageous things” to journalists, Grünwald says: “Metta, Ton, and I basically had to hold him back.” Haga Hospital even threatened to sue him after he posted previously unpublished details about the case on his website.
Yet Grünwald says Gill’s cheerful fearlessness was crucial. Many Dutch statisticians knew and liked Elffers, he says. “People … were afraid to say out loud that he was doing something stupid and nonsensical. Richard had no problems with that at all.” (Elffers did not respond to multiple requests for comment.)
The efforts paid off. In 2006, the Commission for the Evaluation of Closed Criminal Cases decided to reconsider the case and appointed a subcommittee to investigate. In a “drab government building” in The Hague, Gill helped explain how bungled statistics had put De Berk in prison. In 2007, the commission recommended reopening the case; in 2008, the Dutch Supreme Court agreed. That same year, the Dutch government suspended De Berk’s sentence and she was released from prison, pending a retrial.
THE MISTAKES in De Berk’s case were far from unique, Gill and others say. “We humans are terribly good at seeing patterns when they’re not there,” says statistician Peter Green, a professor emeritus at the University of Bristol and one of the RSS report’s authors.
Investigators sometimes enhance those patterns by only tallying the evidence that confirms their theory, discarding or not even noticing data that don’t. Even investigators who aim to be unbiased can make minor choices that add up to a skewed picture, Thompson says. “You end up with a piece of evidence that looks extraordinarily unlikely to have occurred by chance. And of course, the problem is it didn’t exactly occur by chance—you kind of helped it along.”
Gill worries this is what led to the 2006 conviction of Geen, who was given 17 life sentences, with a minimum term of 30 years. Prosecutors argued there was a high rate of unexplained respiratory arrests—which are typically rarer than cardiorespiratory arrests—on Geen’s shifts, although they did not try to quantify the probability that this “unusual pattern” occurred by chance. As in De Berk’s case, there was other evidence, including the fact that Geen had in his pocket a syringe containing muscle relaxant when he was arrested. The prosecution argued that he had injected patients with the drug in order to cause respiratory arrest and then play the hero by resuscitating them.
Geen’s defense lawyers challenged the “unusual pattern” in a 2009 appeal, submitting a report by University of Warwick medical statistician Jane Hutton. The appeal judges upheld the conviction. “The judges seemed to be very overconfident that they could detect an unusual pattern without putting in some of the most basic information that you need as a comparison,” Hutton says.
In a 2022 paper published in Laws, Gill and colleagues argued that blinded investigators might have reached different conclusions about Geen’s case. The high rate of respiratory arrests on his shifts was accompanied by a drop in cardiorespiratory arrests, suggesting a bias in how these cases were classified. Compared with data from the same hospital over a wider time period, the deaths and resuscitations on Geen’s shifts do not seem extraordinary, Gill and his co-authors said. He and other statisticians wrote letters of support in 2015 when Geen asked the Criminal Cases Review Commission to look into his case. The request was denied; Geen remains in prison.
EVEN WHEN statistical experts do get involved in a case, they may not be immune to errors of reasoning, as Elffers’s work showed. In the case of Poggiali, the Italian nurse, statisticians wrote that a very high level of statistical significance is a “guarantee” that “there is a causal effect”—in this case between Poggiali being on duty and the deaths. But this is a well-known error of reasoning: “Correlation is not causation,” Green says. Thompson says clusters may have surprising causes that are difficult or impossible to uncover. He points to cases where chemicals leached from equipment or changes in baby formula were at fault.
Gill and his colleagues found that Poggiali’s death rate was higher than her colleagues’, even after various controls, but argued this could be at least partly explained by Poggiali’s long hours—she arrived very early and left late from her shifts—which meant she was present at more death certifications during shift handovers. They also pointed out a statistical flaw in the medical evidence: A toxicologist had said the potassium concentration found in one of the victim’s eyes was unexpectedly high, suggesting potassium chloride poisoning. But this did not take into account any statistical uncertainty in the data on expected levels of potassium, Gill and colleagues wrote in a 2021 paper in Law, Probability and Risk summarizing the findings that had helped secure Poggiali’s acquittal.
The Letby case now in court shows many of the same troubling features as earlier cases, Gill and others say. Letby was moved to clerical duties in 2016 after a series of deaths and resuscitations on her shifts, and first arrested in 2018. She is accused of murdering seven babies and attempting to murder 10 more, using methods such as insulin poisoning and injection of air bubbles.
The similarities go beyond statistics to the way Letby has been vilified. Social media commentary will “make your stomach turn,” Gill says. “People are saying we should bring back hanging, shoot the bitch.” The media have portrayed her as an “evil creature,” says Neil Mackenzie, a lawyer based in Edinburgh, Scotland, who specializes in medical negligence cases and co-authored the RSS report. “I think there’s possibly misogyny in there,” Mackenzie says. “The press loves bad women.”
The RSS report Gill and others published in September does not claim Letby is innocent, in part because public comment on the guilt or innocence of a person standing trial may be considered contempt of court in U.K. legal systems. “We’ve got to have no opinion on this case,” Green says, but “there’s potential here for miscarriage of justice.”
Gill says a deep cognitive bias works against defendants like Letby. People “don’t believe in chance, actually,” he says. “Quantum mechanics has been shouting at us for 100 years that the physical universe is built on randomness. … But we don’t understand this. It upsets us deeply. When a succession of bad things happens, we know there must have been an agent responsible. And so we naturally believe in devils and witches, gods and angels.”
NOT ALL MEDICAL MURDER cases are witch hunts, however. “This is an instance where there actually are some witches,” Thompson says.
In 2000, for example, a British physician named Harold Shipman was convicted of murdering 15 patients over a period of 3 years after an investigation yielded evidence that he had given overdoses of diamorphine—heroin, used in the United Kingdom for severe pain—and falsified the medical records of numerous patients, suggesting they had been sicker than they were to make their deaths appear less suspicious. (Shipman was in one patient’s will but his motives have not become clear.) Shipman, who is suspected of killing hundreds more, was sentenced to life in prison and died by suicide in 2004.
A 5-year government inquiry in the wake of the case identified ways to better protect patients, such as more oversight of death certificates. The case also led statisticians to explore a new question: Could statistics detect real murderers, based on a suspicious pattern of deaths alone?
Cambridge statistician David Spiegelhalter, who gave advice to the panel, believes so. He and his colleagues adapted a method from industrial quality control to compare the rate of death certificates signed by Shipman over time with deaths at other local doctors’ practices. They found they could have identified a worrying pattern in Shipman’s patients 13 years before he was arrested.
Such a system would produce false alarms; more people might die under the care of a doctor or nurse who works with particularly difficult cases, for example. But a robust method would prevent too many misfires, Spiegelhalter says, and a “ping” in the system should never be taken as anything more than a sign that a human should look at the data.
But implementing this kind of routine monitoring would be very complicated, says Bruce Guthrie, professor of general practice at the University of Edinburgh. The kind of data Spiegelhalter and his colleagues used is not routinely collected—it was pieced together as part of the Shipman investigation. And Shipman worked alone, which few family doctors do; many patients are likely to see multiple doctors. Only the “most horrendously prolific murderers” would be likely to show up, Guthrie wrote in an email to Science.
Meanwhile Thompson, Gill, and others are calling for cultural and institutional fixes to prevent unjust convictions. Many lawyers find statistics challenging, Mackenzie says. “This is one of the evils of the ‘two cultures’ myth,” he says: Some students are channeled into scientific subjects, and others into humanities, and “never the twain shall meet.” Niamh Nic Daeid, a forensic science researcher at the University of Dundee, says she routinely encounters anxiety and resistance about statistics. Nic Daeid, Spiegelhalter, and others have produced a range of statistics training materials, including an RSS “primer” and a free online course for lawyers.
But training is not enough, Thompson says, because the biases that underlie errors are “built into our perceptual processes.” Instead, he says, it’s crucial to change investigative procedures. The RSS report recommends that investigators be blinded. For example, pathologists should classify deaths as suspicious or not without knowing which medical personnel were in attendance, adapting the standardized blinding methods used in epidemiology to study disease outbreaks.
But blinding has proved to be a hard sell among forensic scientists, in part because it’s often more challenging than it seems, says Peter Stout, CEO of the Houston Forensic Science Center and a strong advocate of blinding and other measures to improve forensic science. It can mean, for example, that a forensics lab—already strapped for funding and time—needs an extra person to serve as a case manager who screens possibly biasing information from a blinded analyst. And the line between relevant and irrelevant information is not always clear. Decades ago, before opioids were rampant in the United States, Stout and his colleagues spent weeks running every test they could think of on a sample, before an investigator finally told them to look for fentanyl. “Masking created a huge cost,” he says.
Adele Quigley-McBride, a cognitive bias researcher at Duke University, trains analysts in a technique called sequential unmasking. The method gets around the blurred line between relevant and irrelevant information by giving investigators access to increasing amounts of information with each round of analysis. Analysts note their observations and conclusions in each round; if new information changes their opinion, they have to explain why.
GILL WAS AT THE COURT of appeals in Arnhem in 2010 when De Berk’s exoneration was announced. “It was one of the biggest events of my life,” he says. “It was really joyful.” De Berk was immediately rushed off by her lawyers and journalists swarmed Derksen and De Noo. “I bought a marijuana cigarette,” Gill says, and then he took a train to The Hague and went to the beach. “I smoked my joint, and I ate a dish of oysters, and drank some white wine.”
De Berk later received a written apology from the Dutch minister of justice and an undisclosed financial compensation for the 6.5 years she spent in prison. Gill stays in contact with her; she likes his posts on Facebook sometimes. She told Gill she did not want to be interviewed for this story. “She’s managed to put it all far away, and she needs to keep it that way,” he says.
Thirteen years on, Gill, now retired, is watching the Letby case closely, but his obsession with forensic statistics has begun to subside. His retirement projects include a range of statistical kerfuffles with lower stakes, such as the rating of Dutch herring sellers. He has plenty of other things to occupy his attention—winemaking, an amateur distillery, grandchildren. “I think I’ve reached the point where I want to spend more time in the forest picking mushrooms, actually,” he says.
He hopes younger statisticians will feel compelled to help when bad statistics lead to injustice, as he did. “I sensed that in the Lucia case, I could make a difference,” Gill says. “And that therefore I must.”
Next month, a judge in Sydney will hear new expert testimony in a criminal case that has fascinated Australia for 2 decades: that of Kathleen Folbigg, who in 2003 was convicted of the murder of three of her infant children and manslaughter in the death of the fourth.
There is no medical evidence that Folbigg’s children were murdered. Her case rests partly on the vanishingly small chance that unexplained medical tragedy would strike the same family four times. Like some other infanticide cases, it parallels the murder convictions of doctors and nurses based on suspicious clusters of patient deaths. As those cases show, seemingly common-sense statistical assumptions can mislead—with horrifying consequences.
Folbigg’s children all died between 1989 and 1999, at ages between 19 days and 19 months. Her husband reported Folbigg to the police after discovering her diary, in which she had described anger and frustration with her children, and a sense of responsibility for their deaths: “With Sarah all I wanted was her to shut up. And one day she did.”
For each child, doctors found possible, but not definitive, evidence for natural causes of death. Yet taken together, expert witnesses said, the deaths were suspicious, because multiple cases of sudden infant death syndrome (SIDS) within a single family are extremely rare—let alone four of them. The New South Wales Supreme Court sentenced Folbigg to 40 years in prison, reduced to 30 years by a 2005 appeal. A 2019 inquiry upheld her conviction, and a 2021 appeal was dismissed.
Critics say the case rested heavily on the reasoning popularized in the 1990s by British pediatrician Roy Meadow, who asserted that with respect to child deaths, “one is a tragedy, two is suspicious, and three is murder unless there is proof to the contrary.” Pediatrician Susan Beal cited a variation of “Meadow’s law” during a 2003 hearing on what evidence could be admitted in Folbigg’s trial.
Meadow testified in court cases himself as well. But his reputation fell apart after the case of British solicitor Sally Clark, who in 1999 was convicted of murdering her two infant sons. Meadow testified that the chance of two SIDS deaths in a low-risk family like Clark’s was one in 73 million. That calculation assumed SIDS could not have inherited risk factors, statistician Phil Dawid of the University of Cambridge wrote in a report for Clark’s first appeal in 2000. He put the chance of the two deaths at a less outlandish one in 1 million, “or even much higher,” and pointed out that double infanticide is also vanishingly rare. The court should weigh both rare possibilities against each other, he says, along with all the other evidence.
Clark lost the appeal, but she was exonerated at a second appeal in 2003, partly because it came to light that pathologist Alan Williams had failed to disclose evidence that one of the babies had Staphylococcus aureus in his spinal fluid, a possible natural cause of death. The appeal judges said Meadow’s statistical evidence—which could have had “a major effect” on the jury—should not have been admitted.
Williams was barred from working for the U.K. Home Office for 3 years and Meadow lost his medical license, a decision later overturned by the U.K. High Court. After the scandal, the attorney general ordered a review of 297 infanticide cases, and decided to drop charges in three cases and review the convictions in 28 others.
There may be exculpatory medical evidence in Folbigg’s case as well. In 2020, a group of researchers led by Peter Schwartz at the Italian Auxological Institute published a paper showing Folbigg’s two daughters both had a newly discovered genetic variant that impairs cells’ ability to regulate calcium, leading to a greatly increased risk of cardiac arrhythmia and sudden death. The paper led the Australian Academy of Science and Folbigg’s lawyers to launch a petition in March 2021, signed by 90 scientists, asking New South Wales Governor Margaret Beazley to pardon Folbigg. Beazley ordered a new inquiry; hearings are due to begin in February.
Clark, despite her vindication, never recovered and died of acute alcohol poisoning in 2007. Her family and the coroner’s office attributed the death to severe distress from the “catastrophic experience.”
Help News from Science publish trustworthy, high-impact stories about research and the people who shape it. Please make a tax-deductible gift today.

If we’ve learned anything from the COVID-19 pandemic, it’s that we cannot wait for a crisis to respond. Science and AAAS are working tirelessly to provide credible, evidence-based information on the latest scientific research and policy, with extensive free coverage of the pandemic. Your tax-deductible contribution plays a critical role in sustaining this effort.
© 2023 American Association for the Advancement of Science. All rights reserved. AAAS is a partner of HINARI, AGORA, OARE, CHORUS, CLOCKSS, CrossRef and COUNTER.


Leave a Comment