### How projection systems work, Part I

1 Comments Published by Colin Wyers on Saturday, February 9, 2008 at 1:51 AM.It's probably the most interesting question to the statistically-inclined baseball fan - how well will certain players perform next season? (Followed closely by the almost-as-vexing question, how well did they perform last year?)

To that end, a lot of various modeling systems have been developed to project player performance. Some of the good, widely available ones are:

- Dan Szymborski's ZiPS
- Sean Smith's CHONE
- Nate Silver's PECOTA
- SG's CAIRO
- Marcels (Tango refuses to take credit for them)

PECOTA is the best available; ZiPS is the best system you don't have to pay for. CAIRO has the best toys. Marcels is probably good enough. [If you want to really get down to brass tacks on just how good they are, Tango, Silver and Smith have a great discussion on those issues.]

During this post, I'll be referring to none of them in particular. All of them work on the same basic principles, and its those principles that I'll be discussing. (There are other projection systems that don't work on these principles - or at least, don't talk about what principles they're based on. Let's put it this way - if they're marketed to a fantasy baseball audience, there are some... problems. We won't discuss those systems.)

There's not going to be enough detail here for you to go out and implement your own projection system, and I'm not going to spend a lot of time on the differences between systems. This is just supposed to be a cursory, general overview.

The most simple method possible

Assuming you have no time to look up projections or generate your own, what's the best way to project a ballplayer's performance? Generally speaking, last year's numbers are the starting point you want to look at.

Does that work? Sure, I guess. Performance for baseball players is pretty consistent year to year during their peak years, and if you skip batting average and go to stats like OBP and SLG the year-to-year correlation is pretty good. (.36 for OBP and OPS, .38 for SLG, if you were wondering).

So, basically, what I'm telling you is that you can expect that good baseball players will outperform bad baseball players, most of the time. In other words, nothing you already didn't know.

You want more accuracy than that? Well, there's a few things we can do.

**Increase the sample size**

You often hear baseball stats geeks yammering on about "sample size." What that means is - the more information you have, the more likely it is to be accurate.

This is one of those things you already know and understand, whether or not you realize it yet. The odds of a coin toss being heads are right about 50-50 (there's actually about a 1% bonus for whatever side is upright prior to the coin toss in actual practice), but in a series of 10 coin tosses, almost anything CAN happen. In order to make sure you have an even distribution of heads and tails, you need to run thousands of coin tosses.

Baseball is a lot more complicated than a simple coin toss, so even in a full season (where a player could accumulate anywhere from 600-700 at-bats) there is a lot of random "noise" to account for. The movie Bull Durham explains it pretty well:

"Twenty-five hits a year in 500 at-bats is fifty points. Okay? There's six months in a season, that's about 25 weeks. You get one extra flare a week—just one, a gork, a ground ball with eyes, a dying quail—just one more dying quail a week and you're in Yankee Stadium!"

So preferably we'd use more than one season of data in our projections.

We have to be careful, though: a baseball player is not an unchanging thing, and if we use too much data - or use it improperly - we can draw the wrong conclusions. For example, here's Barry Bonds, career 1986-2006 (translated into a seasonal basis), compared to how he did in 2007:

G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | BA | OBP | SLG |

157 | 523 | 118 | 156 | 32 | 4 | 40 | 106 | 28 | 8 | 133 | 82 | .299 | .471 | .608 |

126 | 340 | 75 | 94 | 14 | 0 | 28 | 66 | 5 | 0 | 132 | 54 | .276 | .480 | .565 |

Some of it looks pretty close, and so we might not notice some of the incongruities right away. The first thing you have to notice is speed; in his career he was pretty speedy, hence the triples and the stolen bases. Old Barry does not hit triples, score runs or steal bases as well as he used to. Another thing Old Barry doesn't do: play as often. Or play baseball with other guys that can hit - that drives down his RBIs.

Barry Bonds is the extreme case, and I know that - I chose him because I figured there would be some things that would pop out of his stat line when looked at. But this applies to every baseball player - players do not stay the same. Maybe they give up switch hitting, maybe they get a new hitting coach and discover religion, maybe they suffer an injury that limits them in ways long after they've come off the DL. The most recent data is always the most relevant, and so that data is simply more important than the other data.

So you use a weighted average, which means that you multiply the stats you're using by a constant depending on which season they're from. If you're using three years of stats, you could use a 5/4/3 weighted average, where you multiply all of a player's 2007 stats by 5, his 2006 stats by 4, and his 2005 stats by 3. Then you divide by 12.

But using Barry brings up another interesting point, so we'll go there next.

**Account for age**

We know that young players are likely to get better, and old players are likely to get worse. But what does that mean, exactly? Let's look at a graph.

That, friends, is the average aging curve of a baseball player. I use wOBA because of my well-documented love affair with all things linear weights... and because Tango already did the work for me.

There's the upswing of the chart, a flattish-looking bit from roughly age 24 to 28, and then begins the downward slope. Projection systems account for this by adjusting projections based upon the age of the player.

**Regress to the mean**

The curve above is a representation of a very large sample size, and as such looks very smooth. A graph of a single ballplayer's career looks a lot less smooth - players have career years, players have off years, players miss time due to injury. When Luis Gonzales puts up a 57-homer season almost entirely out of the blue, we need to adjust for that.

That's accomplished by regression toward the mean. Now, you'll often hear people when the speak of baseball players talk about players regressing as a negative; it's important to remember that players who perform below the mean regress *upwards*; it's a double-edged sword.

When regressing to the mean, you have to answer two questions:

- What mean do you regress to?
- How much do you regress?

For a simplistic projection system, you'd use the league average as your mean. How much you regress is dependant on playing time - the more playing time a player has, the less you regress.

**What the simple projection ignores**

The simple projection, incidentally, is Marcels. If you want to look into the guts and play around, there it is - every equation you need is spelled out for you

Everything we've talked about so far assumes that all ballplayers develop pretty much the same way. We know that's not exactly true, and so we can get better accuracy by taking into account more specific information about ballplayers

We also haven't looked at how players with a large amount of minor-league playing time - and little or no major league experience - are handled.

And we haven't touched upon how you can use batted-ball information to improve your projections.

Oh, and we've left out pitchers. That may be important. We'll get to all of that next time.

Labels: Baseball, Linear Weights, Projections

Colin,

Thanks for the great explanation.