# Measures of spread

Indeed, the five number summary provides extensive information as to where the data in a data set lies, but sometimes it is nice to have just two numbers characterizing a data set. If this is the case, complementing a measure of location with a measure of spread or variation is an appropriate choice. The data sets {10, 30, 50, 70, 90} and {40, 45, 50, 55, 60} both have the mean=median=midrange=50, but they differ in how much the data is spread out. There are several statistics which characterize the amount of spread:

## Range

The range is the extent of the data set; it is defined as the maximum minus the minimum. For data sets A and B the range is 7 and 8, respectively. For the weights of students the range is 235-105=130. Note that the range is a single number, the difference between the maximum and minimum, *not* an ordered pair specifying the minimum and maximum. Note also that the range is defined from the extreme individuals, hence does not measur how far a typical data value is from the middle.

You only need to know the maximum and minimum to calculate the midrange and range; if you know both the midrange and range, you can calculate the maximum and minimum.

## Inter-quartile range (IQR)

The inter-quartile range is defined as Q3-Q1. For the weights of students the inter-quartile range is 175-130 = 45. Note that the inter-quartile range is a single number and not the ordered pair consisting of the quartiles. [Different definitions for the quartiles will produce different inter-quartile ranges.] Since Q3 is the middle of the data above the median, and Q1 is the middle of the data below the median; Q3-Q1=(Q3-Q2)+(Q2-Q1) is twice the average distance of a datum from the median. (The semi-interquartile range has been defined as half the interquartile range, but is rarely used.)

Note that knowing the median and the inter-quartile range does not let you calculate the first or third quartile (or the minimum or maximum).

## (Variance and) standard deviation

There are several approaches to measuring the average distance from the mean. A first notion might be (1/n)*sum*(x(i) - x-bar), where there are n data points. However, it is readily verified that this quantity is always 0, hence it is not of much use. Negative distances cancelling out positive distances can be avoided by employing the absolute value: (1/n)*sum*|(x(i) - x-bar)|; this quantity is called the mean deviation (or the mean absolute deviation). It is a nice concept, but is not suitable for many mathematical manipulations, hence is not widely employed.

Another way to avoid negative summands is to square them. (1/(n-1))*sum*((x(i) - x-bar)^2) is called the variance, which is denoted by s^2. [The reason for dividing by n-1 rather than n, is that this is the estimate for the variance of a population based on a sample; if we had divided by n we would have still called it the variance, but denoted it with *sigma*^2 where *sigma* is lower case sigma.] Evaluating this expression for data set A yields ((2-5.4)^2 + (3-5.4)^2 + (5-5.4)^2 + (8-5.4)^2 + (9-5.4)^2)/4 = 9.3. This is not a good measure of the average distance from the mean, but its square root 3.05 is (taking the square root essentially undoes the previous squaring). The square root of the variance is called the standard deviation, and denoted by s. For the weights of students the variance is 881.77, and the standard deviation is 29.69.

Competencies: For the data set {2 5 9 4 6 7 6 8 8}, calculate the variance, standard deviation, range, and inter-quartile range.

Reflection: For the above data set, which of the above statistics best describes the spread of the data?

Challenge: Is the variance always greater than the standard deviation? Is the interquartile range always greater than the standard deviation? When will the variance, standard deviation, range, and interquartile range be equal?