Monday, February 22, 2016

Baseball Forecasting

As spring training is about to begin, people begin to put out forecasts. How are these forecasts put together? There are three main methods.

One method is the wisdom of the crowds. Fangraphs FAN projections is the purest instance of this method. The method is to have many people estimate something and average the estimates together. The idea is that, while people are going to be wrong, the way they're wrong will form a bell curve. So, by getting many estimates, we can get an estimate close to the real number.

A second method is expert opinion. You get together a group of people who really know baseball and have them come to a conclusion. This is how Sports Illustrated does it.

The final method an most common forecast is statistical projection. Stats can be projected taking into account past performance, the reliability of the data, and the effect of aging on the player.

As an example, I'll forecast Miguel Cabrera's slugging percentage for 2016. The system I'll be using is the Marcels system, which is the simplest system I know of that will give reasonable results.

First, I need to find Cabrera's runs for the past three years.

2015 - .534
2014 - .524
2013 - .636

Next, we take a weighted average. We assign last year's score a weight of 5, the year before a weight of 4, and the year before a weight of 3.

wSLG = (5 * SLG15 + 4 * SLG14 + 3 * SLG13)/12

(5 * .534 + 4 * .524 + 3 * .636)/12 = .556

Next, we take a look at his plate appearances for those years to determine how reliable the information is. Those appearances are once again weighed.

(5 * PA15 + 4 * PA14 + 3 * PA13)/(5 * PA15 + 4 * PA14 + 3 * PA13 +1200)

(5 * 511 + 4 * 685 + 3 * 652)/(5 * 511 + 4 * 685 + 3 * 652+1200) = 0.858

Next, we need a weighed league average:

wLgSLG = (5 * LgSLG15 + 4 * LgSLG14 + 3 * LgSLG13)/12

(5*.405+4*.386+3*.396)/12 = .396

Now, we factor together the reliability rating and weighted average and add it to the factor of the league average and 1 - reliability rate:

adjSLG = (wSLG * r) + (wLgSLG * (1 - r))

(.556 * .858) + ((1 - .858) * .396) = .533

Next, we need to find the factor to account for his age.

ageFactor = 1 - ((age - 29) * .003)

1 - ((32 - 29) * .003) = .991

Finally we multiply are adjusted number and our age factor:

final = adjSLG * ageFactor

.991 * .533 = .529

There's a few things to note. First, the formula for age factor given above is only for people above the age 30. For younger players, use the following instead:

ageFactor = 1 + ((29 - age) * .006)

The other problem is the system is designed to use three years of stats. For those without, use the league average for the stat being examined. However, don't do the same with plate appearances.

Clearly this system could be improved. Rookies could be evaluated using minor league equivalencies and park factors instead of the league average. The weights given to the various years could probably be fine tuned. The same is probably true of the formula for age factor. In reality, it'd probably do better to have a table of factors.

Saturday, February 6, 2016

Baseball and Stats: Units of Analysis

A unit of analysis is the unit that's being studied. Looking at different units of analysis can give a different view of things.

As an example, let's compare attendance of MLB games in 2014 and 2015. If I use the team as my unit of analysis, things don't seem too good. 57% of teams had lower attendance in 2015 than they did in 2014. If I look at the Major League as a whole, it looks better. Attendance has increased (not by much, but a little). 

Tuesday, February 2, 2016

Baseball and Stats: What is average?

Baseball is a numbers games. People love to throw around stats. However, to get a better understanding of stats, you need to understand statistics. In this post, I'd like to discuss average.

How many home runs did the average Major League team hit in 2015? You can find this information on Baseball Reference.

In everyday language, average is means typical. In statistics, we're looking for the single number that best describes an entire set of data. There are several ways to come up with an average.

The most commonly used method is a mean. A mean is calculated by adding up each item in a data set and dividing the sum by the number of items in the set. In 2015, Major League teams had a mean of 163.663 home runs. You can't half .663 home runs, so we round up to 164.

A second method is median. A median is found by arranging the data in order and then finding the middle item. Since there are an even number of teams, we look at the two middle items and find the midpoint between them. This happens to be the mean of the two. For 2015, the median teams the Red Sox (161 home runs) and the Twins (156). The median comes out to 158.5 which we round up to 159.

Mode is a third method. The mode is the item that appears most often in the data set. A data set can have more than one mode. The modes of our data are 177 (the Mets and the Nationals), 167 (the Reds and the Rays), and 136 (the White Sox and the Giants). Because we have three modes, our data set is called trimodal or multimodal. Note that all three modes would be reported. It wouldn't be right to find an average among them and call it the mode.

Finally, there's the mid-range. The mid-range is the point equally distant from the highest number and lowest number in the data set. To find the mid-range, find the mean of the smallest and largest number in the set. For us, that's 232 (the Blue Jays) and 100 (the Braves). We get a mid-range of 166.

Each of these methods give us a different answer (or in the case of the mode, multiple different answers). How can we determine which one best describes the data set?

One method is to look at mean absolute deviation (MAD). To find MAD, you subtract each data point from the average you're examining (usually the mean or median). Next you add the absolute values of all the results together. The absolute value is the distance of that number from 0. For example, both 5 and -5 have an absolute value of 0. Finally, the total is divided by the number of items in the set. Here are the results for each method:

MeanMedianMode 1Mode 2Mode 3Mid-range
25.2125.0328.2325.4331.525.37

A smaller MAD means that overall, the measure fits each data set better. The modes fit the entire data set the least well. The mid-range isn't much better. Next is the mean and the median best describes the entire set, according to MAD.

A second method is mean squared error (MSE). To find MSE, find the difference between each data point and square the difference. Add the results up and divide the sum by the number of items in the data set.

MeanMedianMode 1Mode 2Mode 3Mid-range
979.01005.31157.6990.31742.6984.6

According to MSE, the most accurate measure is our second mode (167), followed by the mean, and then the mid-range. After that is the median and the other two modes. What happened? How did the median and mode 2 switch spots?

Because of the way that MSE is calculated, larger deviations are penalized more harshly than small deviations. As an example, let's compare the Red Sox's deviation from the mean (3 runs) and the Blue Jays' (68 runs). Let's see how much they each contribute to the final total.

MADMSE
Red Sox0.10.3
Blue Jays2.267154.13

As you can see, the same deviation is valued quite differently between MAD and MSE. Which is better? It depends on your point of view. MAD is better for detecting total deviance. MSE is better for detecting large deviations.