A caveat of ANOVA is that if we reject the null hypothesis we state that the means of the populations under consideration are not all the same. However, we may not decide which particular means are different nor the relation among the means.

In order to deal with that issue we apply methods called **multiple comparisons**. The problem with multiple comparisons is that the more hypotheses are tested on a particular data set, the more likely it is to incorrectly reject the null hypothesis. Thus, methods of multiple comparison require a higher significance threshold (\(\alpha\)) for individual comparisons, in order to compensate for the number of inferences being made.

A **family of tests** is the technical term for a series of tests performed on a data set. The family-wise error rate is the probability of making one or more false discoveries, or **Type I errors** when performing multiple hypotheses tests.

Recall that at a significance level of \(\alpha = 0.05\), the probability of making a Type I error is equal to \(0.05\) or \(5\)%. Consequently, the probability of not making a Type I error is \(1-\alpha = 1 - 0.05 = 0.95\). Moreover, the probability of observing two events that are independent is the product of their probabilities. Thus, if we conduct two independent tests, the probability of not making a Type I error on the first and the second tests is

\[(1-\alpha) \times (1-\alpha) = (1-\alpha)^2\] If \(\alpha = 0.05\), we get a probability of not making a Type I error on the first and the second test of \((1 - \alpha)^2 = (1-0.05)^2 = 0.95^2 \approx 0.902\).

Written more formally, for a family of \(C\) tests, the probability of not making a Type I error for the whole family is

\[(1-\alpha)^C\text{.}\]

Let us now consider \(C=10\) and \(\alpha = 0.05\). Thus, we make 10 multiple comparisons on one data set. Then, the probability of not making a Type I error on the family is

\[(1-\alpha)^C=(1-0.05)^{10} \approx 0.599\] Consequently, the probability of **making one or more Type I errors** on the family of tests is

\[1 - (1-\alpha)^C\] For our example, we find

\[1 - (1-\alpha)^C = 1 - (1-0.05)^{10} \approx 0.401\]

Thus, with \(\alpha = 0.05\) for each of the 10 multiple comparisons, the probability of incorrectly rejecting the null hypothesis is 0.401 or 40.1%.

To account for that problem there exist several statistical methods. In this section we discuss the **Bonferroni correction** and the **Tukey Multiple-Comparison Method**, also known as **Tukey’s HSD (honest significant difference) test**.

In this section redo the example from the previous section. There we applied a one-way ANOVA to **test if the mean annual salary of graduates is different among graduates of different study subjects**. However, this time we will make multiple comparisons in order to analyse the relation among all the group means.

Reload the *students* data set (you may download the `students.csv`

file here).

`students <- read.csv("http://www.geog.fu-berlin.de/~soga/200/2010_data_sets/students.csv")`

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

From the *students* data set we randomly sample 275 graduated students and reduce the data set to the two variables of interest, the categorical variable `major`

and the numeric variable `salary`

. For a better readability in the forthcoming analysis we replace the major study subject names with corresponding abbreviations.

```
graduated <- subset(students, graduated == 1)
n <- 275
data.idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data.idx, c('major','salary')]
# rename class labels
levels(data$major) <- c('Bio', # Biology
'EcFi', # Economics and Finance
'EnS', # Environmental Sciences
'MaSt', # Mathematics and Statistics
'PoS', # Political Science
'SoS') # Social Sciences
```

Further, we conduct a one-way ANOVA hypothesis test in R, by applying the `aov()`

function.

```
anova <- aov(salary ~ major, data = data)
summary(anova)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## major 5 1.448e+10 2.896e+09 43.7 <2e-16 ***
## Residuals 269 1.783e+10 6.628e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The Bonferroni correction compensates the increased likelihood of incorrectly rejecting a null hypothesis (making a Type I error) due to multiple comparisons by adjusting the significance level \(\alpha\) in the form of

\[\alpha = \frac{\alpha}{m}\text{,}\] where \(m\) corresponds to the number of comparisons given by

\[m=\frac{k(k-1)}{2}\text{,}\] where \(k\) corresponds to the levels of the factor, which is the categorical classification variable.

As mentioned in the previous section a one-way analysis of variance is the generalization of the pooled *t*-test to more than two populations. Thus, the R function to perform multiple comparisons is called `pairwise.t.test()`

. The `pairwise.t.test()`

function takes as arguments one response vector (`x`

), a grouping vector or factor (`g`

) and a specified method for adjusting *p*-values (`p.adjust.method`

). We may type `p.adjust.methods`

in the R console to list all available adjustment methods for that function.

`p.adjust.methods`

```
## [1] "holm" "hochberg" "hommel" "bonferroni" "BH"
## [6] "BY" "fdr" "none"
```

First, we conduct a **pairwise t-test without any adjustment**, thus increasing the probability of incorrectly rejecting the null hypothesis.

```
ptt <- pairwise.t.test(x = data$salary, g = data$major , p.adj = "none")
ptt
```

```
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$salary and data$major
##
## Bio EcFi EnS MaSt PoS
## EcFi 0.53782 - - - -
## EnS 9.9e-13 3.1e-12 - - -
## MaSt 0.92075 0.48823 5.2e-12 - -
## PoS 1.0e-12 3.0e-12 0.93951 5.2e-12 -
## SoS < 2e-16 < 2e-16 0.00056 < 2e-16 0.00081
##
## P value adjustment method: none
```

The pairwise *t*-test shows that at a significance level of 5% the means for 4 combinations are **not** statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, with *p*-values of 0.538, 0.921, 0.488, 0.94, respectively. For the remaining 11 combinations we reject \(H_0\); thus, for 11 combinations the means are different at a significance level of 5%!

Second, we conduct a **pairwise t-test with the Bonferroni adjustment**. We use the same data as for the pairwise

`p.adj = "bonferroni"`

in the function call.```
ptt <- pairwise.t.test(x =data$salary, g = data$major , p.adj = "bonferroni")
ptt
```

```
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$salary and data$major
##
## Bio EcFi EnS MaSt PoS
## EcFi 1.0000 - - - -
## EnS 1.5e-11 4.7e-11 - - -
## MaSt 1.0000 1.0000 7.8e-11 - -
## PoS 1.5e-11 4.6e-11 1.0000 7.9e-11 -
## SoS < 2e-16 < 2e-16 0.0084 < 2e-16 0.0121
##
## P value adjustment method: bonferroni
```

The pairwise *t*-test with the Bonferroni adjustment shows that at a significance level of 5% the means for 4 combinations are **not** statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, with *p*-values of 1, 1, 1, 1, respectively (in the Bonferroni procedure \(\alpha\) is divided by the number of tests or equivalently the *p*-value is multiplied by that number, and truncated back to 1 if the result is \(> 1\), and thus not a probability). For the remaining 11 combinations we rejected \(H_0\); thus, for 11 combinations the means are different at a significance level of 5%!

Please note, that in the case without a *p*-value adjustment we rejected the null hypothesis for 11 combinations. For the pairwise *t*-test with the Bonferroni adjustment we rejected the null hypothesis for 11 combinations. Quite some difference!

The **Tukey Multiple-Comparison Method**, also known as **Tukey’s HSD (honest significant difference) test** is based on the **studentized range distribution**, sometimes referred to as the **\(q\)-distribution**. The \(q\)-distribution is a right-skewed probability density curve, with two parameters, \(\kappa\) and \(\nu\), describing its shape. These parameters are given by

\[\kappa = k\] and \[\nu = n-k\text{,}\]

where \(n\) is the total number of observations and \(k\) the number of groups/classes.

Tukey’s test compares the means of every group to the means of every other group. It obtains the confidence interval for each

\[\mu_i-\mu_j\text{.}\]

If the confidence interval for a pairwise comparisons includes 0, \(H_0\) is not rejected, they are not assumed to be significantly different. All other pairs for which the confidence interval does not include 0 are significantly different, thus \(H_0\) is rejected.

In R Tukey’s honest significant differences are computed by the `TukeyHSD()`

function. The `TukeyHSD()`

function expects a model object, specifically a fitted ANOVA model as input argument. We already calculated such a model above. However, for illustration purposes we build the same model once again and assign it to the variable called `my.anova`

. Further, in order to specify the width of the confidence intervals, we provide the confidence level by the `conf.level`

argument to the function.

```
my.anova <- aov(salary ~ major, data = data)
alpha = 0.05
tukey <- TukeyHSD(x = my.anova, conf.level= 1-alpha)
tukey
```

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salary ~ major, data = data)
##
## $major
## diff lwr upr p adj
## EcFi-Bio 1076.6875 -3933.016 6086.3910 0.9897407
## EnS-Bio -11979.0790 -16569.592 -7388.5661 0.0000000
## MaSt-Bio -158.4205 -4724.592 4407.7511 0.9999986
## PoS-Bio -12105.3247 -16747.203 -7463.4465 0.0000000
## SoS-Bio -18123.4102 -23052.790 -13194.0307 0.0000000
## EnS-EcFi -13055.7665 -18185.374 -7926.1594 0.0000000
## MaSt-EcFi -1235.1080 -6342.943 3872.7275 0.9825011
## PoS-EcFi -13182.0122 -18357.637 -8006.3873 0.0000000
## SoS-EcFi -19200.0977 -24635.062 -13765.1328 0.0000000
## MaSt-EnS 11820.6585 7123.248 16518.0686 0.0000000
## PoS-EnS -126.2457 -4897.280 4644.7890 0.9999996
## SoS-EnS -6144.3312 -11195.521 -1093.1411 0.0073409
## PoS-MaSt -11946.9042 -16694.523 -7199.2852 0.0000000
## SoS-MaSt -17964.9897 -22994.069 -12935.9107 0.0000000
## SoS-PoS -6018.0855 -11116.001 -920.1697 0.0103798
```

The table/output shows the difference between each pair, the 95% confidence intervals and the *p*-value of the pairwise comparisons. Look carefully at the table and you will see that for all those 4 comparisons, where the confidence interval includes 0, the *p*-value is higher than the significance level \(\alpha\). If \(p > \alpha\) we do not reject \(H_0\), thus there is not statistically significant difference in the means of these two groups. In contrast, for all pairs where \(p \le \alpha\), we reject \(H_0\) and state, that there is a statistically significant difference in the means of these pairs. The `TukeyHSD()`

function provides a nice plotting functionality, which visualizes the confidence intervals for each pair.

```
par(mar = c(4,6,4,4))
plot(tukey, las = 1, col = "brown" )
```