About the simplest forecasting system available - that are worthy of the term “projection system,” at least - are the Marcel projections. As simple as they are, they match up very well with the results of the more complex forecasters.
Sal Baxamusa over at The Hardball Times has kindly provided us with Excel spreadsheets to calculate a player’s in-season Marcel. But I wanted the ability to bulk produce forecasts of players for the remainder of the season. So I set about reimplementing Sal’s spreadsheet in SQL.
Data prior to this season was taken from the Baseball Databank. Data from this season was screenscraped from the Baseball Reference league pages. The convenient thing about that is, Baseball Reference player IDs are mapped in the BDB to the BDB IDs.
If you have the Baseball Databank in MySQL, and can handle scraping the data from BBRef yourself, then you can generate your own Marcels like so:
CREATE TABLE batting_pos
AS
SELECT ( CASE WHEN b.playerID is not null
THEN b.playerID
ELSE b.Player END ) AS playerID
, 2008 AS yearID
, 1 AS weight
, b.Ag AS Age
, SUM(b.G) AS G
, SUM(b.H) AS H
, SUM(b.2B) AS 2B
, SUM(b.3B) AS 3B
, SUM(b.HR) AS HR
, SUM(b.BB) AS BB
, SUM(b.SO) AS SO
, SUM(b.IBB) AS IBB
, 0 AS HBP
, SUM(b.SB) AS SB
, SUM(b.CS) AS CS
, SUM(b.AB+b.BB+b.SH+b.SF) AS PA
from ( SELECT * from 7_29_08_batting b
LEFT JOIN ( select Player AS PPlayer, BFP from 7_29_08_pitching ) p
ON b.Player = p.PPlayer
LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
ON b.Player = m.bbrefID ) b
WHERE ( (b.AB+b.BB+b.SH+b.SF) > (b.BFP) OR b.BFP IS NULL )
AND (b.AB+b.BB+b.SH+b.SF) > 0
GROUP BY Player
UNION ALL
SELECT b.playerID
, b.yearID
, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - b.yearID) ) ) / 5.62 AS weight
, (CASE WHEN m.birthMonth < 7 THEN ( b.yearID - m.BirthYear ) ELSE ( b.yearID - m.BirthYear - 1 ) END) AS Age
, SUM(b.G) AS G
, SUM(b.H) AS H
, SUM(b.2B) AS 2B
, SUM(b.3B) AS 3B
, SUM(b.HR) AS HR
, SUM(b.BB) AS BB
, SUM(b.SO) AS SO
, SUM(b.IBB) AS IBB
, SUM(b.HBP) AS HBP
, SUM(b.SB) AS SB
, SUM(b.CS) AS CS
, SUM(b.PA) AS PA
FROM
( SELECT * from bdb.batting b
LEFT JOIN ( select playerID AS pplayerID, yearID AS pyearID, BFP
from bdb.pitching p WHERE p.yearID > 2004 ) p
ON b.playerID = p.pplayerID AND b.yearID = p.pyearID
WHERE b.yearID > 2004) b, bdb.master m
WHERE b.playerID = m.playerID
AND ( (b.PA) > (b.BFP) OR b.BFP IS NULL )
AND (b.PA) > 0
GROUP BY yearID, playerID;CREATE TABLE average_pos
AS
SELECT yearID
, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - yearID) ) ) / 5.62 AS weight
, SUM(G) AS G
, SUM(H) AS H
, SUM(2B) AS 2B
, SUM(3B) AS 3B
, SUM(HR) AS HR
, SUM(BB) AS BB
, SUM(SO) AS SO
, SUM(IBB) AS IBB
, SUM(HBP) AS HBP
, SUM(SB) AS SB
, SUM(CS) AS CS
, SUM(PA) AS PA
FROM batting_pos bp
GROUP BY yearID;CREATE TABLE player_league_average
AS
SELECT b.playerID
, (SUM(a.weight)*SUM(a.H)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS H
, (SUM(a.weight)*SUM(a.2B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 2B
, (SUM(a.weight)*SUM(a.3B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 3B
, (SUM(a.weight)*SUM(a.HR)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HR
, (SUM(a.weight)*SUM(a.BB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS BB
, (SUM(a.weight)*SUM(a.SO)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SO
, (SUM(a.weight)*SUM(a.IBB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS IBB
, (SUM(a.weight)*SUM(a.HBP)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HBP
, (SUM(a.weight)*SUM(a.SB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SB
, (SUM(a.weight)*SUM(a.CS)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS CS
, (SUM(a.weight)*SUM(a.PA)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS PA
FROM batting_pos b, average_pos a
WHERE b.yearID = a.yearID
GROUP BY playerID;CREATE TABLE player_league_average_prorated
AS
SELECT playerID
, ( H / PA * 214 ) AS H
, ( 2B / PA * 214 ) AS 2B
, ( 3B / PA * 214 ) AS 3B
, ( HR / PA * 214 ) AS HR
, ( BB / PA * 214 ) AS BB
, ( SO / PA * 214 ) AS SO
, ( IBB / PA * 214 ) AS IBB
, ( HBP / PA * 214 ) AS HBP
, ( SB / PA * 214 ) AS SB
, ( CS / PA * 214 ) AS CS
, 214 AS PA
FROM player_league_average;CREATE TABLE player_Age_2008
AS
SELECT playerID
, yearID
, MAX(Age)
, ( CASE
WHEN ( MAX(Age)+(2008-yearID) ) > 29
THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM batting_pos
GROUP BY playerID;CREATE TABLE player_pa_2008
AS
SELECT playerID, G, PA, (PA/G) AS PA_G, ROUND((PA/G)*52) AS PA_LEFT from batting_pos WHERE yearID = 2008;CREATE TABLE hitter_marcels_2008
AS
SELECT b.playerID
, ROUND(( SUM(a.weight)*SUM(b.H) + w.H ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS H
, ROUND(( SUM(a.weight)*SUM(b.2B) + w.2B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 2B
, ROUND(( SUM(a.weight)*SUM(b.3B) + w.3B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 3B
, ROUND(( SUM(a.weight)*SUM(b.HR) + w.HR ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HR
, ROUND(( SUM(a.weight)*SUM(b.BB) + w.BB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS BB
, ROUND(( SUM(a.weight)*SUM(b.SO) + w.SO ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SO
, ROUND(( SUM(a.weight)*SUM(b.IBB) + w.IBB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS IBB
, ROUND(( SUM(a.weight)*SUM(b.HBP) + w.HBP ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HBP
, ROUND(( SUM(a.weight)*SUM(b.SB) + w.SB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SB
, ROUND(( SUM(a.weight)*SUM(b.CS) + w.CS ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS CS
, p.PA_LEFT AS PA
FROM batting_pos b, average_pos a, player_league_average_prorated w, player_Age_2008 c, player_pa_2008 p
WHERE b.yearID = a.yearID
AND b.playerID = w.playerID
AND b.playerID = p.playerID
AND b.playerID = c.playerID
GROUP BY playerID;
I hate publishing code because most of what I write would get me hunted down and burned at the stake in any CS department in the country, but there it is. If you don’t grog SQL, here’s what’s going on here:
- First we combine the data from the Databank with the data from this year, excluding hitters who have had fewer plate appearances than plate appearances against - in other words, pitchers. The other thing we do is compute a weight for each year – more recent seasons are worth more in the projection, and the weight is what governs that.
- We calculate the average of non-pitchers hitting from those seasons.
- Each player is then given a weighted average of the league from those four seasons, prorated out to 214 plate appearances. That’s our regression to the mean component.
- We figure out an aging curve for each player.
- We guesstimate how many plate appearances a player will receive the rest of 2008. If you think you know better than my guesstimate, you’re probably right. It’s the rates that I’m projecting.
- This is the part that actually does a projection – it takes a weighted average of the past four seasons, mixing in the regression component from step three, and prorates it out to our guesstimated playing time.
If you don’t want to put in that amount of effort, well, Mr. Widget, take us home!
I've done a cursory look at the projections, but I haven't done any serious validation. All projections are provided as-is. The good news is, now that I have the code ironed out, all I need is half an hour or so and I can generate Marcels for over 500 players. (Quicker if I could automate the screenscraping – I’ll have to look into that.)
I’m hopeful but not optimistic that I’ll have pitcher projections done similarly by the weekend.
Labels: Projections
0 Responses to “Marcels for hitters”