Statistics that Measure Central Tendency
Mean
Your have
probably used the
mean since elementary school. There it is was called the
average. The mean
(or average) of a collection of numbers is
computed by adding the numbers and dividing by the number of
numbers. For example the mean of the numbers 2,3,3,4,5,6
is 23/6=3.8 rounded to the nearest tenth. In formula
form, the mean of n numbers, x1, x2,
..., xn is given by the sum of the numbers (x's)
divided by n, the number of numbers, or
For a data set presented as numbers
together with the frequency of occurrence of each number, as
in the next table, the computation of the mean is slightly
modified.
Number |
Frequency |
2 |
2 |
3 |
6 |
4 |
7 |
5 |
3 |
7 |
3 |
9 |
2 |
Add another column consisting of each
number multiplied by the frequency of occurrence of that
number to the table. Then find the sum of this column as shown:
Number |
Frequency |
Number*Frequency |
2 |
2 |
4 |
3 |
6 |
18 |
4 |
7 |
28 |
5 |
3 |
15 |
7 |
3 |
21 |
9 |
2 |
18 |
Sum
of (Numbers*Frequencies)= |
104 |
The mean is the (Sum of
Numbers*Frequencies)/(Sum of Frequencies). In the
example the sum of the frequencies is 23, so the mean is
104/23=4.5. In formula form, the mean of numbers x which
occur with frequency f is given by
The mean is easy to compute, and as
mentioned above, you have
probably used it before, but it has one major drawback--an extremely
large or small number will cause a larger than desired
change in the mean. For example the mean of
2,3,4,5, and 6 is 4. However, if another number, say 20, is added to the set, the mean of the new
set of numbers, 2,3,4,5,6, and 20 is now 40/6=6.7.
Certainly the mean should increase but increasing from 4 to
6.7 might be considered to be too much of a change.
In presenting housing prices in the newspaper the mean price of a
home is usually not used, simply
because the mean is made too high by the relatively few expensive
homes in a typical community. Median home prices are
used instead of mean home prices. The
next section discusses the median.
At the bottom of this page is a link
to the FOCUS dataset. Open it and under the STAT
menu you will find a choice called Summary
Stats. Use that to find the mean of each of
variable in the FOCUS dataset. Other descriptive statistics introduced below are also
computed for each of the variables. Make a
histogram of each variable and see how the descriptive
statistics relate to the shape of the histogram.
Also, you can verify the example computations in these
notes by opening Webstat--push the orange button at
the bottom of this page, select the Clear Data
choice under the Data menu, and type the numbers for
which you want the mean or other statistics into
a column. Once you have the numbers in a column, you
can make any of the Webstat graphs and compute any
numerical statistics on your numbers by selecting
Histogram under the Graphs menu and Summary Stats
under the Stats menu.
Median
The
median of a collection of numbers is, in a certain sense, the
'middle' number of that set. For example the median of
the numbers 2,3,4,5,8 is 4 because 4 is the 'middle' number.
The numbers 2,3,4,5,8,10 don't have a single middle
value. What is the median of them? It is
defined as the average of
the two middle numbers, 4 and 5. The median is then
(4+5)/2=4.5.
The process for computing the median
of a set of n numbers is:
-
Sort the numbers and arrange them from
smallest to largest.
-
Consider the smallest number to be in position 1, the next number in the sorted list
to be in position 2, the next in position 3, etc.
-
The median will be the number in
position (n+1)/2. If (n+1)/2 is a whole number,
the median will be the number lying in that position.
If (n+1)/2 is a fraction, say 7.5, the median will be
the average of the two numbers in positions 7 and 8.
Example: Find the median of the numbers
2,3,1,4,4,5,7,2,3, and 8.
-
In sorted order the numbers are
1,2,2,3,3,4,4,5,7,8
-
The numbers with their positions are
Position |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Number |
1 |
2 |
2 |
3 |
3 |
4 |
4 |
5 |
7 |
8 |
-
The median is the number in position
(10+1)/2=5.5. Since 5.5 is not a whole number, the
median is the average of the numbers in positions 5 and
6, or the average of 3 and 4 which equals 3.5. The
median is 3.5.
Mode
The
mode is the number that occurs most frequently. For the
set of numbers 2,3,4,5,5,6, the mode is 5. The set of
numbers 2,3,4,5,5,6,6 has two modes, 5 and 6. It is
bimodal. However, when all numbers in a set occur with
the same frequency, the set of numbers has no mode. For
example, the numbers 2,2,3,3,4,4,5,5 have no mode.
Quartiles and Percentiles
The
median divides a set of numbers into halves. Quartiles
divide a set of numbers into quarters and percentiles divide a
set of numbers into hundredths. You may taken achievement tests
in school and received your result in the form of a percentile
score.
If you were told that you were at the 92nd percentile, then
92% of the test scores were equal to or lower than your score and
8% of the test scores were equal to or higher than your score.
There are three quartiles for a set of
numbers, the 1st quartile, denoted by Q1, the 2nd quartile
denoted by Q2, and the 3rd quartile denoted by Q3. The
2nd quartile is also usually called the median, and you have seen how
to compute it. The quartiles divide the dataset
into quarters. To compute the 1st quartile, Q1, simply find
the median of all numbers in the dataset that are less than or
equal to the median. To compute the 3rd quartile, Q3,
find the median of all numbers in the dataset that are greater
than or equal to the median.
Position |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Number |
1 |
2 |
2 |
3 |
3 |
4 |
4 |
5 |
7 |
8 |
The median of the numbers in the table just
above was found to be the average of the numbers in positions 5 and
6, that is (3+4)/2=3.5. Then the 1st quartile is the
median of the numbers that are less than or equal to 3.5, that
is the median of 1,2,2,3,3. These numbers are sorted and
the positions are the same as in the last table. Since
there are 5 numbers, the median is the
number in position (5+1)/2=3, and this number is 2. Q1=2. The
3rd quartile is the median of the numbers greater than or
equal to 3.5, or the median of 4,4,5,7,8. Again, since
there are 5 numbers here, the median of this set of 5 numbers
is the number in position 3, that is 5. Q3=5.
Resources
A demonstration page for descriptive statistics showing the relationship
between the histogram of a set of numbers and the corresponding descriptive statistics is
found by following this link
to a page designed by Eric Scheide. The following display shows the
page.
Statistics that Measure Variability
Range
The
range of a set of numbers equals the largest number minus the
smallest number. The range of the numbers 3,5,9,9,10,13
is 13-3=10. Like the mean, range had the disadvantage of
changing by too much when an extremely large or small
value is added to a dataset. The next statistic, the interquartile range
does not have this drawback.
Interquartile Range (IQR)
The
interquartile range is the third quartile minus the first
quartile, IQR=Q3-Q1. For the set of numbers
1,2,2,3,3,4,4,5,6,7, in the examples above Q1 was found to be
2 and Q3 was found to be 5. Thus the interquartile range
is 5-2=3. Compare this with the range=7-1=6.
Standard Deviation
The measure of variability used most often
is called the standard
deviation. The standard deviation is roughly the average
of squared deviations from the mean. The formula for
the standard deviation of x1, x2,
...,xn is
where x-bar is the mean of the numbers.
As an example consider the numbers
2,3,4,5,6. The mean is 4. Then the differences
between each of the numbers and the mean are (2-4)=-2,
(3-4)=-1, (4-4)=0, (5-4)=1, and (6-4)=2, respectively.
The formula indicates that these numbers must be squared and
added. The squares are 4,1,0,1, and 4, and the sum is
10. Finally the formula directs you to divide this sum
by the number of numbers-1, i.e. n-1, and take the square
root. This results in the square root of 10/4 or the
square root of 2.5 which is approximately 1.58.
The square of the standard deviation is
called the variance of the set of numbers. The
variance has the drawback that the units of standard
deviation are the square of the units of the numbers used to
compute variance. For example, if the units of the
numbers shown in the last example are inches, the units of
the variance are square inches.
An easier formula for computing the standard
deviation is
and the easy formula for computing standard deviation for
numbers, x, given along with frequencies, f, is
Other Statistics and Displays
Boxplots (Also called Box and Dot or Box and
Whisker Plots)
A boxplot displays the center (as
given by the median) of a dataset, the range, and the
quartiles. The next picture shows two boxplots,
one of the SAT Verbal and the other the SAT Math
scores from the FOCUS dataset.
The white line in the box lies above
the median value for that variable. You can see
that the median SAT Verbal score is around 460 and the
median SAT Math score is about 540. The left
side of the box lies above the 1st quartile and the
right side of the box is positioned above the 3rd
quartile of the variable. So for SAT Math the
first quartile is about 460 while the third quartile
is approximately 590. Since 25% of the data
values are less than the first quartile and 25% of the
data values are greater than the third quartile, the
boxes indicate the range of values in which the middle
50% of the numbers lie. From the above graph you
can see that the middle 50% of the SAT Math values are
more spread out than the middle 50% of SAT Verbal
scores. The horizontal line from the right of
each box stops where the short vertical line
positioned above the largest number for that variable,
and the horizontal line from the left of each box
stops at the short vertical line over the smallest
value for the variable. The distance from the
smallest value to the largest value, the range is
shown in the graph.
The boxplot displays variability,
center, and shape of a dataset. In the above
graph of SAT Math and Verbal scores, you can see that
both variable have approximately the same amount of
variability, the center of the SAT Math scores is
greater than the center of the verbal scores, and both
of the variables have an approximately symmetric
shape. The next boxplot of the billionaires92
wealth variable shows a dataset that is strongly
skewed to the right. Even the position of the
median within the box shows a right skew for the
middle 50% of the wealth data.
What is the relationship between the
histogram and the boxplot of a set of numbers?
To experiment with histograms and the corresponding
boxplots open this link.
When the link opens select Relative Frequency in the
left dropdown menu and Boxplot from the right dropdown
menu. Then, by pointing at the axis with your
mouse cursor and clicking, you can add numbers.
The vertical red bars show the histogram of the
numbers that you have added and the horizontal red
display below the histogram shows the boxplot that
goes with the histogram. Try various shaped
histograms and see how the boxplot corresponds with
the histogram.
Standard Scores
Suppose you and a friend are both taking
Statistics 1 but are in different sections. You both take a
midterm examination and wish to compare your performances on the
exam. You received a score of 80 in a section that had a
mean of 76 and a standard deviation of 5, while your friend
received a score of 76 in a section that had a mean of 66 and a
standard deviation of 8. Who performed better? In
order to determine this, the scores need to be placed on the same
footing, that is be modified as if they both came from a test with
the same mean and standard deviation. This can be done by
subtracting the mean of the section and dividing by the standard
deviation of the section. That is (x-mean)/(standard
deviation) is computed for each score. For your score
of 80 this results in (80-76)/5=0.8 while for your friend's score you
get (76-66)/8=1.25. This means that your friend had a better
performance.
The standard score corresponding to a number x, denoted by z,
is given by the next formula:
where x is the actual score, x-bar is the mean of the set of numbers,
and s is the standard
deviation of the numbers. The standard score indicates how
many standard deviations above (if z is positive) or below the
mean (if z is negative) the number, x, falls.
Sample
and Population Statistics
All
of the statistics used above apply to samples--they are
called sample statistics. The related statistics
for populations are slightly different. The
following notations and differences in formulas apply:
Descriptive measures for a
population are called parameters of the population
while related measures for a sample are called
statistics of the sample.
-
The size of a sample is usually
denoted by n while the size of the population is
given by N
-
The sample mean is written as
x-bar while the population mean is usually denoted
by µ.
-
The sample standard deviation is
called s and the population standard deviation is
called sigma.
-
The formula for sample standard
deviation is
but the formula for population
standard deviation is
There are two differences.
First, the sample mean is replaced by the
population mean. This isn't surprising.
The second difference, the divisor for the population standard deviation
is N, while the divisor for the sample standard deviation is
n-1 is harder to explain. There is a
good statistical reason for the difference but
that reason will be left to another statistics
course. You should simply use the formula
that is
appropriate for the situation. If you are
told that you have a population, use the second
formula, and for a sample use the first formula.
An easier-to-use formula for population standard
deviation is
If the numbers are given along
with frequencies the formula to use is
where N is the sum of the
frequencies.
Resources
See Section 3.5 in the Weiss textbook.
To work with the entire Focus Database
from within WebStat use the next link.
|
|
| |