Statistical Models - Theory and Practice

Chapter 1 - Observational Studies and Experiments

Summary #

This chapter introduces basic terminology and describes three separate experiments to introduce us to statistical methods in action.

We learn about the difference between randomized controlled experiments and observational experiments. A randomized controlled experiment is an experiment in which subjects are randomly assigned to two groups: experiment and control. Members of the experimental group receive the “treatment”, whereas members of the control do not. We expect that randomized controlled experiments minimize confounding – other factors seeping in to impact our results without us knowing – because we expect members of the experimental and control group to be on average the same outside of whatever difference the treatment produces.

The three experiments described are:

  1. The HIP Trial: A randomized control trial that proved the benefits of scanning women age 40-64 using mammography.
  2. Snow’s Cholera Study: A natural experiment in which Snow, after much effort and careful investigation, proved that cholera was spread via the water supply.
  3. Yule’s Pauper Study: A natural experiment in which Yule tried, but as the book argues, failed to show that policy choices affected the number of poor in England.

In 1, we see the importance of not allowing selection bias into what you treat as treatment and control results. Specifically, the researchers evaluating the HIP trial made the wise decision to not bias their evaluation by excluding data from non-compliant treatment patients.

In 2, we see how Snow carefully sought out ways to mimic controlled studies in the natural environment in order to eliminate confounding factors. One example of this is Snow studied cholera rates of individuals receiving water from two sources, one purified, one not, to show how the only explanation for the different rates of cholera among the two groups was water quality.

In 3, we see how Yule mistook correlation for causation in his results. The author hints at but doesn’t fully explain the difference between association and causal inference, concluding that Yule only showed the former.

Exercises #

  1. In the HIP trial (table 1), what is the evidence confirming that treatment has no effect on death from other causes?

    The fact that no significant difference between the Treatment and Control death rates from “All other” causes confirms that treatment does not impact death from other causes.

  2. Someone wants to analyze the HIP data by comparing the women who accept screening to the controls. Is this a good idea?

    No, this is a bad idea. If we look at Table 1’s “All other causes” column and compare the two treatment sub-groups’ rates, we see that the women who didn’t accept screening had much higher death rates from other causes. This hints at a confounding factor (outside of mammography screenings) that likely influences breast cancer rate as well. For example, the text mentions that women that accepted screening were on the whole more affluent, a factor which could also contribute to their greater overall health and, therefore, lower breast cancer rate independent of mammography screenings.

    To summarize, it’s a bad idea because we’d knowingly be comparing two groups while ignoring confounding factors that make one’s members, on average, healthier than the other’s, but treating the two groups as comparable.

  3. Was Snow’s study of the epidemic of 1853–54 (table 2) a randomized controlled experiment or a natural experiment? Why does it matter that the Lambeth company moved its intake point in 1852? Explain briefly.

    It was a natural experiment that had many of the positive qualities of a randomized controlled expreiment. Technically, we can’t prove that the subjects were selected randomly without bias, but given they didn’t select water providers at a time when the two companies were using different sources, we can say with high confidence the selection is mostly random.

    The Lambeth company moving its intake point in 1852 allows us to make two important assumptions:

    1. People didn’t select water providers based on taste or other confounding factors.
    2. The Lambeth company moving its water supply provided a convenient setup for an observational study with one group receiving pure water and one receiving sewage water.
  4. Was Yule’s study a randomized controlled experiment or an observational study?

    Yule’s study was an observational study. He was comparing outcomes for two different groups that differ in many key ways, income being the obvious one. That’s why he had to run a regression on his data.:

  5. In equation (2), suppose the coefficient of ΔOut had been –0.755. What would Yule have had to conclude? If the coefficient had been +0.005?

    Yule would’ve had to conclude that outrelief reduced poverty if ΔOut had been -0.755. If ΔOut had been +0.005, Yule would’ve had to conclude that outrelief did not impact poverty.

  6. Suppose X1, X2, …, Xn are independent random variables, with common expectation μ and variance σ2. Let Sn = X1 + X2 + · · · + Xn. Find the expectation and variance of Sn. Repeat for Sn/n.

    Expectation of $$S_n$$ is $$n\mu$$ and of $$\frac{S_n}{n}$$ is $$\mu$$.

    Variance of $$S_n$$ is $$n * \sigma^2$$. Variance of $$\frac{S_n}{n}$$ is $$\frac{\sigma^2}{n}$$. (Note: I got this wrong originally, because I didn’t understand that an n-factor change to a random variable produces an $$n^2$$ factor change to its variance).

  7. Suppose X1, X2, …, Xn are independent random variables, with a common distribution: P(Xi = 1) = p and P(Xi = 0) = 1 – p, where 0 < p < 1. Let Sn = X1 + X2 + · · · + Xn. Find the expectation and variance of Sn. Repeat for Sn/n.

    $$E(S_n) = np$$ and $$Var(S_n) = n * Var(X_i)$$. $$Var(X_i) = E((X - E(X))^2) = p(1 - p)$$.

    Similarly, $$E(\frac{S_n}{n}) = \frac{p}{n}$$. $$Var(\frac{X_i}{n}) = \frac{p(1-p)}{n^2}$$ and $$Var(\frac{S_n}{n}) = \frac{p(1-p)}{n}$$.

  8. What is the law of large numbers?

    The law of large numbers states that, for a given random variable, $$X_i$$ with expected mean, $$\mu$$, as the number of samples in a real-world situation that fits $$X_i$$ increases, the mean of the samples will move closer to $$\mu$$.

  9. Keefe et al (2001) summarize their data as follows:

    Thirty-five patients with rheumatoid arthiritis kept a diary for 30 days. The participants reported having spiritual experiences, such as a desire to be in union with God, on a frequent basis. On days that participants rated their ability to control pain using religious coping methods as high, they were much less likely to have joint pain.

    Does the study show that religious coping methods are effective at controlling joint pain? If not, how would you explain the data.

    No, the study could just as easily be explained as showing reverse causation, that people were more likely to do their religious coping exercises on days where they had less joint pain.

  10. According to many textbooks, association is not causation. To what extent do you agree? Discuss briefly.

    Association is often a good indicator that causation may be present, but it’s in no way sufficient for proving causation.