Fun with regression to the mean

Let's talk a bit about sample size and regression to the mean, shall we? I feel like I owe some actual examples of what I'm talking about given my recent screed on the topic.

Let's go ahead and do a throwdown-to-showdown, between Cubs centerfield options Felix Pie and Reed Johnson. (It's a hot topic of discussion among Cubs fans who aren't too busy pining after Ronny Cedeno.) And let's use OBP as our stat of choice for comparing them for the moment. How much more often will Johnson get on base than Pie?

We're simply going to focus on 2008 production for just a minute, because people seem to be much more excited about 2008 production to date than they are about the latest build of the ZiPS projections, for instance.

According to, Reed Johnson has a .478 OBP in 23 plate appearances; Pie has a .217 OBP in 23 plate appearances. Obviously Johnson is a better choice as a center fielder, just based upon OBP, right?

At 23 plate appearances apiece, we know no such thing. What we do know, on the other hand, is that both of them are major league baseball players, and the central tendancy of OBP talent among major league players is roughly .330. That's the mean; now we can regress to the mean. (A great primer on how to do that is available from Sal at Athletics Nation.)

Once we go ahead and regress Reed Johnson's OBP to the mean, we end up with a .338 OBP. For Pie, once we regress to the mean, we end up with a .321 OBP.

Fun, right?

