Tuesday, February 2, 2016

Baseball and Stats: What is average?

Baseball is a numbers games. People love to throw around stats. However, to get a better understanding of stats, you need to understand statistics. In this post, I'd like to discuss average.

How many home runs did the average Major League team hit in 2015? You can find this information on Baseball Reference.

In everyday language, average is means typical. In statistics, we're looking for the single number that best describes an entire set of data. There are several ways to come up with an average.

The most commonly used method is a mean. A mean is calculated by adding up each item in a data set and dividing the sum by the number of items in the set. In 2015, Major League teams had a mean of 163.663 home runs. You can't half .663 home runs, so we round up to 164.

A second method is median. A median is found by arranging the data in order and then finding the middle item. Since there are an even number of teams, we look at the two middle items and find the midpoint between them. This happens to be the mean of the two. For 2015, the median teams the Red Sox (161 home runs) and the Twins (156). The median comes out to 158.5 which we round up to 159.

Mode is a third method. The mode is the item that appears most often in the data set. A data set can have more than one mode. The modes of our data are 177 (the Mets and the Nationals), 167 (the Reds and the Rays), and 136 (the White Sox and the Giants). Because we have three modes, our data set is called trimodal or multimodal. Note that all three modes would be reported. It wouldn't be right to find an average among them and call it the mode.

Finally, there's the mid-range. The mid-range is the point equally distant from the highest number and lowest number in the data set. To find the mid-range, find the mean of the smallest and largest number in the set. For us, that's 232 (the Blue Jays) and 100 (the Braves). We get a mid-range of 166.

Each of these methods give us a different answer (or in the case of the mode, multiple different answers). How can we determine which one best describes the data set?

One method is to look at mean absolute deviation (MAD). To find MAD, you subtract each data point from the average you're examining (usually the mean or median). Next you add the absolute values of all the results together. The absolute value is the distance of that number from 0. For example, both 5 and -5 have an absolute value of 0. Finally, the total is divided by the number of items in the set. Here are the results for each method:

MeanMedianMode 1Mode 2Mode 3Mid-range
25.2125.0328.2325.4331.525.37

A smaller MAD means that overall, the measure fits each data set better. The modes fit the entire data set the least well. The mid-range isn't much better. Next is the mean and the median best describes the entire set, according to MAD.

A second method is mean squared error (MSE). To find MSE, find the difference between each data point and square the difference. Add the results up and divide the sum by the number of items in the data set.

MeanMedianMode 1Mode 2Mode 3Mid-range
979.01005.31157.6990.31742.6984.6

According to MSE, the most accurate measure is our second mode (167), followed by the mean, and then the mid-range. After that is the median and the other two modes. What happened? How did the median and mode 2 switch spots?

Because of the way that MSE is calculated, larger deviations are penalized more harshly than small deviations. As an example, let's compare the Red Sox's deviation from the mean (3 runs) and the Blue Jays' (68 runs). Let's see how much they each contribute to the final total.

MADMSE
Red Sox0.10.3
Blue Jays2.267154.13

As you can see, the same deviation is valued quite differently between MAD and MSE. Which is better? It depends on your point of view. MAD is better for detecting total deviance. MSE is better for detecting large deviations.

No comments:

Post a Comment