About the simplest forecasting system available - that are worthy of the term “projection system,” at least - are the Marcel projections. As simple as they are, they match up very well with the results of the more complex forecasters.
Sal Baxamusa over at The Hardball Times has kindly provided us with Excel spreadsheets to calculate a player’s in-season Marcel. But I wanted the ability to bulk produce forecasts of players for the remainder of the season. So I set about reimplementing Sal’s spreadsheet in SQL.
Data prior to this season was taken from the Baseball Databank. Data from this season was screenscraped from the Baseball Reference league pages. The convenient thing about that is, Baseball Reference player IDs are mapped in the BDB to the BDB IDs.
If you have the Baseball Databank in MySQL, and can handle scraping the data from BBRef yourself, then you can generate your own Marcels like so:
CREATE TABLE batting_pos
AS
SELECT ( CASE WHEN b.playerID is not null
THEN b.playerID
ELSE b.Player END ) AS playerID
, 2008 AS yearID
, 1 AS weight
, b.Ag AS Age
, SUM(b.G) AS G
, SUM(b.H) AS H
, SUM(b.2B) AS 2B
, SUM(b.3B) AS 3B
, SUM(b.HR) AS HR
, SUM(b.BB) AS BB
, SUM(b.SO) AS SO
, SUM(b.IBB) AS IBB
, 0 AS HBP
, SUM(b.SB) AS SB
, SUM(b.CS) AS CS
, SUM(b.AB+b.BB+b.SH+b.SF) AS PA
from ( SELECT * from 7_29_08_batting b
LEFT JOIN ( select Player AS PPlayer, BFP from 7_29_08_pitching ) p
ON b.Player = p.PPlayer
LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
ON b.Player = m.bbrefID ) b
WHERE ( (b.AB+b.BB+b.SH+b.SF) > (b.BFP) OR b.BFP IS NULL )
AND (b.AB+b.BB+b.SH+b.SF) > 0
GROUP BY Player
UNION ALL
SELECT b.playerID
, b.yearID
, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - b.yearID) ) ) / 5.62 AS weight
, (CASE WHEN m.birthMonth < 7 THEN ( b.yearID - m.BirthYear ) ELSE ( b.yearID - m.BirthYear - 1 ) END) AS Age
, SUM(b.G) AS G
, SUM(b.H) AS H
, SUM(b.2B) AS 2B
, SUM(b.3B) AS 3B
, SUM(b.HR) AS HR
, SUM(b.BB) AS BB
, SUM(b.SO) AS SO
, SUM(b.IBB) AS IBB
, SUM(b.HBP) AS HBP
, SUM(b.SB) AS SB
, SUM(b.CS) AS CS
, SUM(b.PA) AS PA
FROM
( SELECT * from bdb.batting b
LEFT JOIN ( select playerID AS pplayerID, yearID AS pyearID, BFP
from bdb.pitching p WHERE p.yearID > 2004 ) p
ON b.playerID = p.pplayerID AND b.yearID = p.pyearID
WHERE b.yearID > 2004) b, bdb.master m
WHERE b.playerID = m.playerID
AND ( (b.PA) > (b.BFP) OR b.BFP IS NULL )
AND (b.PA) > 0
GROUP BY yearID, playerID;CREATE TABLE average_pos
AS
SELECT yearID
, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - yearID) ) ) / 5.62 AS weight
, SUM(G) AS G
, SUM(H) AS H
, SUM(2B) AS 2B
, SUM(3B) AS 3B
, SUM(HR) AS HR
, SUM(BB) AS BB
, SUM(SO) AS SO
, SUM(IBB) AS IBB
, SUM(HBP) AS HBP
, SUM(SB) AS SB
, SUM(CS) AS CS
, SUM(PA) AS PA
FROM batting_pos bp
GROUP BY yearID;CREATE TABLE player_league_average
AS
SELECT b.playerID
, (SUM(a.weight)*SUM(a.H)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS H
, (SUM(a.weight)*SUM(a.2B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 2B
, (SUM(a.weight)*SUM(a.3B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 3B
, (SUM(a.weight)*SUM(a.HR)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HR
, (SUM(a.weight)*SUM(a.BB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS BB
, (SUM(a.weight)*SUM(a.SO)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SO
, (SUM(a.weight)*SUM(a.IBB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS IBB
, (SUM(a.weight)*SUM(a.HBP)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HBP
, (SUM(a.weight)*SUM(a.SB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SB
, (SUM(a.weight)*SUM(a.CS)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS CS
, (SUM(a.weight)*SUM(a.PA)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS PA
FROM batting_pos b, average_pos a
WHERE b.yearID = a.yearID
GROUP BY playerID;CREATE TABLE player_league_average_prorated
AS
SELECT playerID
, ( H / PA * 214 ) AS H
, ( 2B / PA * 214 ) AS 2B
, ( 3B / PA * 214 ) AS 3B
, ( HR / PA * 214 ) AS HR
, ( BB / PA * 214 ) AS BB
, ( SO / PA * 214 ) AS SO
, ( IBB / PA * 214 ) AS IBB
, ( HBP / PA * 214 ) AS HBP
, ( SB / PA * 214 ) AS SB
, ( CS / PA * 214 ) AS CS
, 214 AS PA
FROM player_league_average;CREATE TABLE player_Age_2008
AS
SELECT playerID
, yearID
, MAX(Age)
, ( CASE
WHEN ( MAX(Age)+(2008-yearID) ) > 29
THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM batting_pos
GROUP BY playerID;CREATE TABLE player_pa_2008
AS
SELECT playerID, G, PA, (PA/G) AS PA_G, ROUND((PA/G)*52) AS PA_LEFT from batting_pos WHERE yearID = 2008;CREATE TABLE hitter_marcels_2008
AS
SELECT b.playerID
, ROUND(( SUM(a.weight)*SUM(b.H) + w.H ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS H
, ROUND(( SUM(a.weight)*SUM(b.2B) + w.2B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 2B
, ROUND(( SUM(a.weight)*SUM(b.3B) + w.3B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 3B
, ROUND(( SUM(a.weight)*SUM(b.HR) + w.HR ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HR
, ROUND(( SUM(a.weight)*SUM(b.BB) + w.BB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS BB
, ROUND(( SUM(a.weight)*SUM(b.SO) + w.SO ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SO
, ROUND(( SUM(a.weight)*SUM(b.IBB) + w.IBB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS IBB
, ROUND(( SUM(a.weight)*SUM(b.HBP) + w.HBP ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HBP
, ROUND(( SUM(a.weight)*SUM(b.SB) + w.SB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SB
, ROUND(( SUM(a.weight)*SUM(b.CS) + w.CS ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS CS
, p.PA_LEFT AS PA
FROM batting_pos b, average_pos a, player_league_average_prorated w, player_Age_2008 c, player_pa_2008 p
WHERE b.yearID = a.yearID
AND b.playerID = w.playerID
AND b.playerID = p.playerID
AND b.playerID = c.playerID
GROUP BY playerID;
I hate publishing code because most of what I write would get me hunted down and burned at the stake in any CS department in the country, but there it is. If you don’t grog SQL, here’s what’s going on here:
- First we combine the data from the Databank with the data from this year, excluding hitters who have had fewer plate appearances than plate appearances against - in other words, pitchers. The other thing we do is compute a weight for each year – more recent seasons are worth more in the projection, and the weight is what governs that.
- We calculate the average of non-pitchers hitting from those seasons.
- Each player is then given a weighted average of the league from those four seasons, prorated out to 214 plate appearances. That’s our regression to the mean component.
- We figure out an aging curve for each player.
- We guesstimate how many plate appearances a player will receive the rest of 2008. If you think you know better than my guesstimate, you’re probably right. It’s the rates that I’m projecting.
- This is the part that actually does a projection – it takes a weighted average of the past four seasons, mixing in the regression component from step three, and prorates it out to our guesstimated playing time.
If you don’t want to put in that amount of effort, well, Mr. Widget, take us home!
I've done a cursory look at the projections, but I haven't done any serious validation. All projections are provided as-is. The good news is, now that I have the code ironed out, all I need is half an hour or so and I can generate Marcels for over 500 players. (Quicker if I could automate the screenscraping – I’ll have to look into that.)
I’m hopeful but not optimistic that I’ll have pitcher projections done similarly by the weekend.
Labels: Projections
Jeff Samardzija in Pitch F/X
4 Comments Published by Colin Wyers on Friday, July 25, 2008 at 11:40 PM.If you want actual, well, good analysis, go over to Harry’s and take a look. He’s been doing this pitch ID stuff a lot longer than I have.
But I think I was able to duplicate one of the graphs from Harry’s page, or at least come close.
I used Mat Kovach’s parser to download data from MLB’s servers. (It seems to work fine for me, but it’s “pre-alpha” and not documented as of yet, so caveat emptor. Also, I Am Not A Programmer, so all code samples that follow are to be taken with more than a hint of salt.)
Then, in MySQL, I ran the following query against the data:
SELECT a.*, p.*
FROM gameday_atbat a, gameday_pitch p
WHERE a.gameid = p.gameid
AND a.num = p.atbat_num
AND a.pitcher = 502188;
Not the prettiest SQL I’ve ever written, and it returns more data than I need, but that’s fine. Then I export the data to a CSV file. There’s one pitch out in the dataset that I remove.
Well, now what? I use GNU R, personally, for all my graphing and K-means clustering needs. Code:
Samardzija <- read.table("C:/Retrosheet/saved queries/pitchfx/Samardzija first start.csv", header=TRUE, sep=",")
cl <- KMeans(model.matrix(~-1 + pfx_x + pfx_z, Samardzija), centers = 3, iter.max = 10, num.seeds = 10)
plot(Samardzija$pfx_z~Samardzija$pfx_x, col=cl$cluster, xlim=c(-20,20), ylim=c(-20,20))
Which produces the following graph:
In fairness to Harry, I cheated – in the second line of the program, I tell the clustering algorithm how many “center” to look for – in this case, how many pitches I want it to look for. I told it three. Why? Because that’s what Harry’s graph shows. I don’t really know how to determine the “right” number of centers as of yet.
Even so, I have one pitch that differs from his – I think he changed that ID manually, but I’m not sure. I can tell you that one cluster is green and one is black, but as far as calling one a splitter and one a slider, that’s something I have to work on.
(That graph, by the way, is ugly, and I know it’s ugly. I know I can make it look better, but in this case it’s a question of how much time I really want to invest in prettying up Pitch F/X graphs before I figure out what it is I’m actually doing with them. It’s called premature optimization.)
Labels: Enhanced Gameday, Jeff Samardzija, Pitch F/X, Pitching
There are two breeds of vanilla, free-as-in-beer zone rating available in the world: STATS and BIS. I already have a dumb projection system for STATS ZR, which could be refined (aging curves and speed/tools scores are the two major refinements I’m musing over.)
But first I wanted to introduce BIS’s RZR into it. And therein lies a dilemma, folks. Here’s the averages for RZR and OZR (OOZ divided by BIZ) over the years available at The Hardball Times:
POS | YEAR | Plays | OOZ | BIZ | RZR | OZR |
1B | 2004 | 4070 | 1783 | 5406 | .753 | .330 |
1B | 2005 | 4343 | 1940 | 5493 | .791 | .353 |
1B | 2006 | 3877 | 2012 | 4851 | .799 | .415 |
1B | 2007 | 4963 | 1048 | 6695 | .741 | .157 |
1B | 2008 | 2871 | 847 | 3815 | .753 | .222 |
1B | Total | 20124 | 7630 | 26260 | .766 | .291 |
2B | 2004 | 9863 | 1203 | 12129 | .813 | .099 |
2B | 2005 | 10403 | 1478 | 12825 | .811 | .115 |
2B | 2006 | 10401 | 1211 | 12679 | .820 | .096 |
2B | 2007 | 10120 | 1412 | 12192 | .830 | .116 |
2B | 2008 | 6313 | 649 | 7693 | .821 | .084 |
2B | Total | 47100 | 5953 | 57518 | .819 | .103 |
SS | 2004 | 9872 | 1919 | 11995 | .823 | .160 |
SS | 2005 | 10484 | 1948 | 12821 | .818 | .152 |
SS | 2006 | 10809 | 1659 | 13218 | .818 | .126 |
SS | 2007 | 10625 | 1912 | 13019 | .816 | .147 |
SS | 2008 | 6353 | 999 | 7627 | .833 | .131 |
SS | Total | 48143 | 8437 | 58680 | .820 | .144 |
3B | 2004 | 6215 | 2074 | 9007 | .690 | .230 |
3B | 2005 | 6813 | 2396 | 9271 | .735 | .258 |
3B | 2006 | 7686 | 1636 | 10880 | .706 | .150 |
3B | 2007 | 7221 | 1717 | 10623 | .680 | .162 |
3B | 2008 | 4444 | 1003 | 6344 | .701 | .158 |
3B | Total | 32379 | 8826 | 46125 | .702 | .191 |
CF | 2004 | 9478 | 2034 | 11905 | .796 | .171 |
CF | 2005 | 10266 | 1963 | 12590 | .815 | .156 |
CF | 2006 | 10316 | 2002 | 11534 | .894 | .174 |
CF | 2007 | 10886 | 1944 | 12264 | .888 | .159 |
CF | 2008 | 5922 | 1583 | 6468 | .916 | .245 |
CF | Total | 46868 | 9526 | 54761 | .856 | .174 |
LF | 2004 | 7710 | 847 | 12242 | .630 | .069 |
LF | 2005 | 8686 | 718 | 13712 | .633 | .052 |
LF | 2006 | 7723 | 1634 | 8971 | .861 | .182 |
LF | 2007 | 8014 | 1614 | 9373 | .855 | .172 |
LF | 2008 | 4475 | 1076 | 5060 | .884 | .213 |
LF | Total | 36608 | 5889 | 49358 | .742 | .119 |
RF | 2004 | 8736 | 781 | 13442 | .650 | .058 |
RF | 2005 | 9181 | 695 | 14161 | .648 | .049 |
RF | 2006 | 8376 | 1686 | 9436 | .888 | .179 |
RF | 2007 | 8418 | 1575 | 9597 | .877 | .164 |
RF | 2008 | 4802 | 1205 | 5321 | .902 | .226 |
RF | Total | 39513 | 5942 | 51957 | .760 | .114 |
(2008 numbers will be slightly different from Studes’ numbers, as these are a few days old.) The projections for infielders are doable. But, as it stands, those outfield numbers are a horror show, taken by themselves.
So before we can make projections based upon RZR data, we first need to normalize it. I’m sure there are better ways than the one I’m using, but I don’t think I’m using the worst way either and it’s very expedient for my needs.
What I’m doing is dividing Plays, OOZ and BIZ by the totals for that season, and then multiplying by the averaged totals of all five years.
And, since I was rather short with the explanation the last time out, I’ll go ahead and spell out what I’m doing in full:
- First, as above, every player’s performance is “normalized” to an average of the past five seasons.
- Then, a weighted average of their past four seasons (05-08) is taken, with the most recent season being given a weight of 5, then 4, then 3, then 2.
- Two weights worth of a full season’s average defensive performance of the season is added as a regression to the mean component.
- 5 + 4 + 3 + 2 + 2 = 16, so everything gets divided by 16. I wouldn’t exactly call it a playing time projection, but it’s a rough guide to how much playing time a player might be expected to receive.
- Plays and Runs above average are figured for a full season’s performance, given the number of chances of the average player at that position from 04 through 08.
And… here are the projections. You can compare them to the STATS ZR projections, if you’d like.
(Note: Currently only players with a Baseball Databank ID who have appeared in 2008 are included in either projection set. The next step is to take the rest of the players in the RZR set, map them to the appropriate STATS ID, and run both projections side by side for all players who played in 2008, and maybe some who haven’t yet but could.)
So what’s next? Like I said before, these could really benefit from aging curves. (While I’m on the topic, Jon Shepherd over at Camden Depot has published RZR aging curves which are worth taking a look at. I have my own ZR aging curves which I should really try and get straightened out.) I really should probably run “projections” for seasons past and see how they match up with what actually happened.
And I want to work on combining data from multiple positions; I’ve done some comparisons of players who have played multiple positions, and my feeling from looking at the data is that in projecting a player’s zone rating, there really isn’t a lot of difference in difficulty in playing the different outfield positions – it’s not really much harder to catch fly balls in center field than it is anywhere else, but there’s a lot more fly balls to catch and so a good fielder is worth a lot more. But that’s worth exploring more, and there are some noteworthy sampling issues in that data; I find it hard to believe that a center fielder is below average as a first baseman defensively, for example. I should rerun this query on the RZR dataset here soon, see what that looks like.
Labels: Defense, Projections
So, you want to talk about a player’s defense?
Remember: a good sabermetrician is like a good hunter when cleaning his kill: he throws away as little as possible, taking care to use most of the animal. We have decades of information about players; why should we ever use only three and a half months worth of data in evaluating a player?
My process is based heavily off of Tango’s Marcels forecasting system; that said, he had nothing to do with this, and screwups in it are mine, not his. (For background on how a projection system works, here’s a decent writeup. If I don’t say so myself.)
Before going any further, I should note that I made this in about two hours. And I also made dinner in those two hours. And I had a side dish. So don’t expect anything on the order of PECOTA as far as complexity goes.
Here’s how it works. Every player’s zone rating data from 2005-2008 (yep, everything pre-All Star break from this year) is thrown into a mixer and weighted. I used a 5/4/3/2 weighting; I have no empirical basis for these weights other than it’s what Marcel uses. Then throw in two season’s worth of the league average for the position. There’s your regression to the mean.
Aging curves are… forthcoming. Maybe. I’m still hashing out the details. (I’ve started work on zone rating based aging curves for fielders, but there are questions about how accurate they are, and before they can be used in a projection system they need to be smoothed out a bit more.)
So, data. Plays and runs above or below average are figured using the Dial method. For that, each player is assumed to have a full season’s worth of chances at the position, not the number of chances used to compute zone rating.
The next step beyond aging curves would probably be to incorporate at least some measure of speed scores into the projection. But I was hungry, and so instead you have the best projection system I could make in two hours, while still making dinner. It’s a start, at least.
(Also, lemme take this chance to plug my hitter and pitcher evaluations on GROTA, if you have an interest in such things regarding the Cubs. Hitter and pitcher projections are next on my plate.)
Labels: Defense, Projections
I'm going to try a little experiment here. I wish my problem was writer's block - instead, there's just so many ideas that I've got running around in my head that I'm having a hard time prioritizing them. And with the All-Star Break coming up, there'll be time to catch up on some things.
So, if you'd like, you can help me figure out what I'll be writing about during the All-Star Break, with this handy widget here:
Isn't it exciting? (It’s also available in the right-hand sidebar.) Here's a link to my Skribit page, if for some reason the widget doesn't do it for you. I may start seeding it with some topics of my own, but for right now I’m simply interested in seeing what – if anything – y’all are interested in. I'd like, obviously, as much participation as possible; that said, asking things like "Why do you hate Christmas, Santa Clause and Ryan Theriot?" aren't likely to be answered immediately. In fact, non-Ryan Theriot suggestions are probably your best bet.