About the simplest forecasting system available - that are worthy of the term “projection system,” at least - are the Marcel projections. As simple as they are, they match up very well with the results of the more complex forecasters.

Sal Baxamusa over at The Hardball Times has kindly provided us with Excel spreadsheets to calculate a player’s in-season Marcel. But I wanted the ability to bulk produce forecasts of players for the remainder of the season. So I set about reimplementing Sal’s spreadsheet in SQL.

Data prior to this season was taken from the Baseball Databank. Data from this season was screenscraped from the Baseball Reference league pages. The convenient thing about that is, Baseball Reference player IDs are mapped in the BDB to the BDB IDs.

If you have the Baseball Databank in MySQL, and can handle scraping the data from BBRef yourself, then you can generate your own Marcels like so:

CREATE TABLE batting_pos

AS

SELECT ( CASE WHEN b.playerID is not null

THEN b.playerID

ELSE b.Player END ) AS playerID

, 2008 AS yearID

, 1 AS weight

, b.Ag AS Age

, SUM(b.G) AS G

, SUM(b.H) AS H

, SUM(b.2B) AS 2B

, SUM(b.3B) AS 3B

, SUM(b.HR) AS HR

, SUM(b.BB) AS BB

, SUM(b.SO) AS SO

, SUM(b.IBB) AS IBB

, 0 AS HBP

, SUM(b.SB) AS SB

, SUM(b.CS) AS CS

, SUM(b.AB+b.BB+b.SH+b.SF) AS PA

from ( SELECT * from 7_29_08_batting b

LEFT JOIN ( select Player AS PPlayer, BFP from 7_29_08_pitching ) p

ON b.Player = p.PPlayer

LEFT JOIN ( select bbrefID, playerID from bdb.master ) m

ON b.Player = m.bbrefID ) b

WHERE ( (b.AB+b.BB+b.SH+b.SF) > (b.BFP) OR b.BFP IS NULL )

AND (b.AB+b.BB+b.SH+b.SF) > 0

GROUP BY Player

UNION ALL

SELECT b.playerID

, b.yearID

, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - b.yearID) ) ) / 5.62 AS weight

, (CASE WHEN m.birthMonth < 7 THEN ( b.yearID - m.BirthYear ) ELSE ( b.yearID - m.BirthYear - 1 ) END) AS Age

, SUM(b.G) AS G

, SUM(b.H) AS H

, SUM(b.2B) AS 2B

, SUM(b.3B) AS 3B

, SUM(b.HR) AS HR

, SUM(b.BB) AS BB

, SUM(b.SO) AS SO

, SUM(b.IBB) AS IBB

, SUM(b.HBP) AS HBP

, SUM(b.SB) AS SB

, SUM(b.CS) AS CS

, SUM(b.PA) AS PA

FROM

( SELECT * from bdb.batting b

LEFT JOIN ( select playerID AS pplayerID, yearID AS pyearID, BFP

from bdb.pitching p WHERE p.yearID > 2004 ) p

ON b.playerID = p.pplayerID AND b.yearID = p.pyearID

WHERE b.yearID > 2004) b, bdb.master m

WHERE b.playerID = m.playerID

AND ( (b.PA) > (b.BFP) OR b.BFP IS NULL )

AND (b.PA) > 0

GROUP BY yearID, playerID;CREATE TABLE average_pos

AS

SELECT yearID

, ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - yearID) ) ) / 5.62 AS weight

, SUM(G) AS G

, SUM(H) AS H

, SUM(2B) AS 2B

, SUM(3B) AS 3B

, SUM(HR) AS HR

, SUM(BB) AS BB

, SUM(SO) AS SO

, SUM(IBB) AS IBB

, SUM(HBP) AS HBP

, SUM(SB) AS SB

, SUM(CS) AS CS

, SUM(PA) AS PA

FROM batting_pos bp

GROUP BY yearID;CREATE TABLE player_league_average

AS

SELECT b.playerID

, (SUM(a.weight)*SUM(a.H)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS H

, (SUM(a.weight)*SUM(a.2B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 2B

, (SUM(a.weight)*SUM(a.3B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 3B

, (SUM(a.weight)*SUM(a.HR)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HR

, (SUM(a.weight)*SUM(a.BB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS BB

, (SUM(a.weight)*SUM(a.SO)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SO

, (SUM(a.weight)*SUM(a.IBB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS IBB

, (SUM(a.weight)*SUM(a.HBP)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HBP

, (SUM(a.weight)*SUM(a.SB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SB

, (SUM(a.weight)*SUM(a.CS)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS CS

, (SUM(a.weight)*SUM(a.PA)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS PA

FROM batting_pos b, average_pos a

WHERE b.yearID = a.yearID

GROUP BY playerID;CREATE TABLE player_league_average_prorated

AS

SELECT playerID

, ( H / PA * 214 ) AS H

, ( 2B / PA * 214 ) AS 2B

, ( 3B / PA * 214 ) AS 3B

, ( HR / PA * 214 ) AS HR

, ( BB / PA * 214 ) AS BB

, ( SO / PA * 214 ) AS SO

, ( IBB / PA * 214 ) AS IBB

, ( HBP / PA * 214 ) AS HBP

, ( SB / PA * 214 ) AS SB

, ( CS / PA * 214 ) AS CS

, 214 AS PA

FROM player_league_average;CREATE TABLE player_Age_2008

AS

SELECT playerID

, yearID

, MAX(Age)

, ( CASE

WHEN ( MAX(Age)+(2008-yearID) ) > 29

THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003

ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve

FROM batting_pos

GROUP BY playerID;CREATE TABLE player_pa_2008

AS

SELECT playerID, G, PA, (PA/G) AS PA_G, ROUND((PA/G)*52) AS PA_LEFT from batting_pos WHERE yearID = 2008;CREATE TABLE hitter_marcels_2008

AS

SELECT b.playerID

, ROUND(( SUM(a.weight)*SUM(b.H) + w.H ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS H

, ROUND(( SUM(a.weight)*SUM(b.2B) + w.2B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 2B

, ROUND(( SUM(a.weight)*SUM(b.3B) + w.3B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 3B

, ROUND(( SUM(a.weight)*SUM(b.HR) + w.HR ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HR

, ROUND(( SUM(a.weight)*SUM(b.BB) + w.BB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS BB

, ROUND(( SUM(a.weight)*SUM(b.SO) + w.SO ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SO

, ROUND(( SUM(a.weight)*SUM(b.IBB) + w.IBB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS IBB

, ROUND(( SUM(a.weight)*SUM(b.HBP) + w.HBP ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HBP

, ROUND(( SUM(a.weight)*SUM(b.SB) + w.SB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SB

, ROUND(( SUM(a.weight)*SUM(b.CS) + w.CS ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS CS

, p.PA_LEFT AS PA

FROM batting_pos b, average_pos a, player_league_average_prorated w, player_Age_2008 c, player_pa_2008 p

WHERE b.yearID = a.yearID

AND b.playerID = w.playerID

AND b.playerID = p.playerID

AND b.playerID = c.playerID

GROUP BY playerID;

I hate publishing code because most of what I write would get me hunted down and burned at the stake in any CS department in the country, but there it is. If you don’t grog SQL, here’s what’s going on here:

- First we combine the data from the Databank with the data from this year, excluding hitters who have had fewer plate appearances than plate appearances against - in other words, pitchers. The other thing we do is compute a weight for each year – more recent seasons are worth more in the projection, and the weight is what governs that.
- We calculate the average of non-pitchers hitting from those seasons.
- Each player is then given a weighted average of the league from those four seasons, prorated out to 214 plate appearances. That’s our regression to the mean component.
- We figure out an aging curve for each player.
- We guesstimate how many plate appearances a player will receive the rest of 2008. If you think you know better than my guesstimate, you’re probably right. It’s the rates that I’m projecting.
- This is the part that actually does a projection – it takes a weighted average of the past four seasons, mixing in the regression component from step three, and prorates it out to our guesstimated playing time.

If you don’t want to put in that amount of effort, well, Mr. Widget, take us home!

I've done a cursory look at the projections, but I haven't done any serious validation. All projections are provided as-is. The good news is, now that I have the code ironed out, all I need is half an hour or so and I can generate Marcels for over 500 players. (Quicker if I could automate the screenscraping – I’ll have to look into that.)

I’m hopeful but not optimistic that I’ll have pitcher projections done similarly by the weekend.

Labels: Projections

## 0 Responses to “Marcels for hitters”

## Links to this post