The Other Fifteen

Another site announcement

0 Comments Published by Colin Wyers on Thursday, August 14, 2008 at 11:45 PM.

There's been quite a bit of radio silence around here recently, and hopefully that'll be corrected soon.

In the meantime, if you're interested in more data on Kosuke Fukudome than you can shake a stick at, well, find a stick and head on over to Goatriders.

Also, I've joined the crew over at Statistically Speaking, along with Brian Cartwright. I'm real excited to be working with Eric, Pizza and Brian. My first post is up, and I should be on a Friday schedule from here on out. If you're interested in things like FIP and BaseRuns, you should definitely take a look.

2008 pitcher Marcels

0 Comments Published by Colin Wyers on Friday, August 1, 2008 at 10:49 PM.

Like the hitters. Witness the power of this fully-operational widget!

I haven't updated my table since the 29th, in case you were wondering. New for the pitcher Marcels is a reliability score; higher is better. None of this is park-adjusted. ERA and FIP-ERA are both provided; I’d be happier with either Component ERA or BaseRuns converted to earned runs, I suspect, but both of those sound like work. (And, in fairness to me, would require a lot more data than I’m capturing now – I suspect I could get more pitching categories if I really, really desperately wanted to, but it would increase the amount of work involved exponentially. Remember – spidering these from Baseball-Reference gives me ID mappings and ages easily, both of which are insanely important to doing the projections.)

Remember: I am not projecting playing time, I am extrapolating playing time.

SQL code:

CREATE TABLE pitching
AS
SELECT ( CASE WHEN p.playerID is not null
        THEN p.playerID
        ELSE p.Player END ) AS playerID
    , 2008 AS yearID
    , 1 AS weight
    , p.Ag AS Age
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , SUM(p.IP) AS IP
    , SUM(p.BFP) AS BFP
from ( SELECT * from 7_29_08_pitching p
    LEFT JOIN ( select Player AS BPlayer, (AB+BB+SH+SF) AS PA from 7_29_08_batting ) b
    ON p.Player = b.BPlayer
    LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
    ON p.Player = m.bbrefID ) p
WHERE ( p.PA < (p.BFP) OR p.PA IS NULL )
    AND p.BFP > 0
GROUP BY Player
UNION ALL
SELECT p.playerID
    , p.yearID
    , POW(0.999,(2008-p.yearID)*365) AS weight
    , (CASE WHEN m.birthMonth < 7 THEN ( p.yearID - m.BirthYear ) ELSE ( p.yearID - m.BirthYear - 1 ) END) AS Age
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , ROUND(SUM(p.IPouts)/3,1) AS IP
    , SUM(p.BFP) AS BFP
FROM
    ( SELECT * from bdb.pitching p
    LEFT JOIN ( select playerID AS bplayerID, yearID AS byearID, (b.AB+b.BB+b.SH+b.SF+b.HBP) AS PA
        from bdb.batting b WHERE b.yearID > 2004 ) b
        ON p.playerID = b.bplayerID AND p.yearID = b.byearID
        WHERE p.yearID > 2004) p, bdb.master m
WHERE p.playerID = m.playerID
    AND ( (p.PA) < (p.BFP) OR p.BFP IS NULL )
    AND (p.BFP) > 0
GROUP BY yearID, playerID;

CREATE TABLE average_pitch
AS
SELECT yearID
    , POW(0.999,(2008-p.yearID)*365) AS weight
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , SUM(p.IP) AS IP
    , SUM(p.BFP) AS BFP
FROM pitching p
GROUP BY yearID;

CREATE TABLE pitcher_league_average
AS
SELECT p.playerID
    , (SUM(a.weight)*SUM(a.G)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS G
    , (SUM(a.weight)*SUM(a.GS)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS GS
    , (SUM(a.weight)*SUM(a.H)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS H
    , (SUM(a.weight)*SUM(a.ER)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS ER
    , (SUM(a.weight)*SUM(a.HR)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS HR
    , (SUM(a.weight)*SUM(a.BB)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS BB
    , (SUM(a.weight)*SUM(a.SO)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS SO
    , (SUM(a.weight)*SUM(a.HBP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS HBP
    , (SUM(a.weight)*SUM(a.BFP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS BFP
    , (SUM(a.weight)*SUM(a.IP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS IP
FROM pitching p, average_pitch a
WHERE p.yearID = a.yearID
GROUP BY playerID;

CREATE TABLE pitcher_league_average_prorated
AS
SELECT playerID
    , ( G / IP * 318 ) AS G
    , ( GS / IP * 318 ) AS GS
    , ( H / IP * 318 ) AS H
    , ( ER / IP * 318 ) AS ER
    , ( HR / IP * 318 ) AS HR
    , ( BB / IP * 318 ) AS BB
    , ( SO / IP * 318 ) AS SO
    , ( HBP / IP * 318 ) AS HBP
    , ( BFP / IP * 318 ) AS BFP
    , 318 AS IP
FROM pitcher_league_average;

CREATE TABLE player_Age_2008_pitching
AS
SELECT playerID
    , yearID
    , MAX(Age)
    , ( CASE
        WHEN ( MAX(Age)+(2008-yearID) ) > 29
        THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
        ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM pitching
GROUP BY playerID;

CREATE TABLE player_ip_2008
AS
SELECT playerID, G, IP, (IP/110) AS IP_G, ROUND((IP/110)*52*3)/3 AS IP_LEFT from pitching WHERE yearID = 2008;

CREATE TABLE pitching_marcels_2008
AS
SELECT p.playerID
    , ROUND(( SUM(p.weight)*SUM(p.G) + w.G ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS G
    , ROUND(( SUM(p.weight)*SUM(p.GS) + w.GS ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS GS
    , ROUND(( SUM(p.weight)*SUM(p.H) + w.H ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS H
    , ROUND(( SUM(p.weight)*SUM(p.ER) + w.ER ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS ER
    , ROUND(( SUM(p.weight)*SUM(p.HR) + w.HR ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS HR
    , ROUND(( SUM(p.weight)*SUM(p.BB) + w.BB ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS BB
    , ROUND(( SUM(p.weight)*SUM(p.SO) + w.SO ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT * c.Curve) AS SO
    , ROUND(( SUM(p.weight)*SUM(p.HBP) + w.HBP ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS HBP
    , ROUND(( SUM(p.weight)*SUM(p.BFP) + w.BFP ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS BFP
    , i.IP_LEFT AS IP
    , (SUM(p.weight)*SUM(p.IP)) / ( SUM(p.weight)*SUM(p.IP) + 318 ) AS R
FROM pitching p, pitcher_league_average_prorated w, player_Age_2008_pitching c, player_ip_2008 i
WHERE p.playerID = w.playerID
    AND p.playerID = i.playerID
    AND p.playerID = c.playerID
GROUP BY playerID;

I’m less confident that I’m implementing Marcels correctly with these, simply because Tango has published less about them. But, outside of any errors I’ve made, the only thing left to do – I think - would be to implement real playing time projections. (I also have to backport the reliability calculation to the hitter projections, but that’s easy enough.) Past that… well, they wouldn’t be Marcels anymore. (Okay, so I should be projecting pitching stats by league as well.)

Later this weekend my intent is to take a couple of players and actually show you the calculations going on behind this code, so that anyone that’s unsure on exactly what’s going on here gets to see the guts of the system.

Labels: Projections

Marcels for hitters

0 Comments Published by Colin Wyers on Thursday, July 31, 2008 at 12:35 AM.

About the simplest forecasting system available - that are worthy of the term “projection system,” at least - are the Marcel projections. As simple as they are, they match up very well with the results of the more complex forecasters.

Sal Baxamusa over at The Hardball Times has kindly provided us with Excel spreadsheets to calculate a player’s in-season Marcel. But I wanted the ability to bulk produce forecasts of players for the remainder of the season. So I set about reimplementing Sal’s spreadsheet in SQL.

Data prior to this season was taken from the Baseball Databank. Data from this season was screenscraped from the Baseball Reference league pages. The convenient thing about that is, Baseball Reference player IDs are mapped in the BDB to the BDB IDs.

If you have the Baseball Databank in MySQL, and can handle scraping the data from BBRef yourself, then you can generate your own Marcels like so:

CREATE TABLE batting_pos
AS
SELECT ( CASE WHEN b.playerID is not null
        THEN b.playerID
        ELSE b.Player END ) AS playerID
    , 2008 AS yearID
    , 1 AS weight
    , b.Ag AS Age
    , SUM(b.G) AS G
    , SUM(b.H) AS H
    , SUM(b.2B) AS 2B
    , SUM(b.3B) AS 3B
    , SUM(b.HR) AS HR
    , SUM(b.BB) AS BB
    , SUM(b.SO) AS SO
    , SUM(b.IBB) AS IBB
    , 0 AS HBP
    , SUM(b.SB) AS SB
    , SUM(b.CS) AS CS
    , SUM(b.AB+b.BB+b.SH+b.SF) AS PA
from ( SELECT * from 7_29_08_batting b
    LEFT JOIN ( select Player AS PPlayer, BFP from 7_29_08_pitching ) p
    ON b.Player = p.PPlayer
    LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
    ON b.Player = m.bbrefID ) b
WHERE ( (b.AB+b.BB+b.SH+b.SF) > (b.BFP) OR b.BFP IS NULL )
    AND (b.AB+b.BB+b.SH+b.SF) > 0
GROUP BY Player
UNION ALL
SELECT b.playerID
    , b.yearID
    , ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - b.yearID) ) ) / 5.62 AS weight
    , (CASE WHEN m.birthMonth < 7 THEN ( b.yearID - m.BirthYear ) ELSE ( b.yearID - m.BirthYear - 1 ) END) AS Age
    , SUM(b.G) AS G
    , SUM(b.H) AS H
    , SUM(b.2B) AS 2B
    , SUM(b.3B) AS 3B
    , SUM(b.HR) AS HR
    , SUM(b.BB) AS BB
    , SUM(b.SO) AS SO
    , SUM(b.IBB) AS IBB
    , SUM(b.HBP) AS HBP
    , SUM(b.SB) AS SB
    , SUM(b.CS) AS CS
    , SUM(b.PA) AS PA
FROM
    ( SELECT * from bdb.batting b
    LEFT JOIN ( select playerID AS pplayerID, yearID AS pyearID, BFP
        from bdb.pitching p WHERE p.yearID > 2004 ) p
        ON b.playerID = p.pplayerID AND b.yearID = p.pyearID
        WHERE b.yearID > 2004) b, bdb.master m
WHERE b.playerID = m.playerID
    AND ( (b.PA) > (b.BFP) OR b.BFP IS NULL )
    AND (b.PA) > 0
GROUP BY yearID, playerID;

CREATE TABLE average_pos
AS
SELECT yearID
    , ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - yearID) ) ) / 5.62 AS weight
    , SUM(G) AS G
    , SUM(H) AS H
    , SUM(2B) AS 2B
    , SUM(3B) AS 3B
    , SUM(HR) AS HR
    , SUM(BB) AS BB
    , SUM(SO) AS SO
    , SUM(IBB) AS IBB
    , SUM(HBP) AS HBP
    , SUM(SB) AS SB
    , SUM(CS) AS CS
    , SUM(PA) AS PA
FROM batting_pos bp
GROUP BY yearID;

CREATE TABLE player_league_average
AS
SELECT b.playerID
    , (SUM(a.weight)*SUM(a.H)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS H
    , (SUM(a.weight)*SUM(a.2B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 2B
    , (SUM(a.weight)*SUM(a.3B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 3B
    , (SUM(a.weight)*SUM(a.HR)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HR
    , (SUM(a.weight)*SUM(a.BB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS BB
    , (SUM(a.weight)*SUM(a.SO)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SO
    , (SUM(a.weight)*SUM(a.IBB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS IBB
    , (SUM(a.weight)*SUM(a.HBP)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HBP
    , (SUM(a.weight)*SUM(a.SB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SB
    , (SUM(a.weight)*SUM(a.CS)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS CS
    , (SUM(a.weight)*SUM(a.PA)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS PA
FROM batting_pos b, average_pos a
WHERE b.yearID = a.yearID
GROUP BY playerID;

CREATE TABLE player_league_average_prorated
AS
SELECT playerID
    , ( H / PA * 214 ) AS H
    , ( 2B / PA * 214 ) AS 2B
    , ( 3B / PA * 214 ) AS 3B
    , ( HR / PA * 214 ) AS HR
    , ( BB / PA * 214 ) AS BB
    , ( SO / PA * 214 ) AS SO
    , ( IBB / PA * 214 ) AS IBB
    , ( HBP / PA * 214 ) AS HBP
    , ( SB / PA * 214 ) AS SB
    , ( CS / PA * 214 ) AS CS
    , 214 AS PA
FROM player_league_average;

CREATE TABLE player_Age_2008
AS
SELECT playerID
    , yearID
    , MAX(Age)
    , ( CASE
        WHEN ( MAX(Age)+(2008-yearID) ) > 29
        THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
        ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM batting_pos
GROUP BY playerID;

CREATE TABLE player_pa_2008
AS
SELECT playerID, G, PA, (PA/G) AS PA_G, ROUND((PA/G)*52) AS PA_LEFT from batting_pos WHERE yearID = 2008;

CREATE TABLE hitter_marcels_2008
AS
SELECT b.playerID
    , ROUND(( SUM(a.weight)*SUM(b.H) + w.H ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS H
    , ROUND(( SUM(a.weight)*SUM(b.2B) + w.2B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 2B
    , ROUND(( SUM(a.weight)*SUM(b.3B) + w.3B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 3B
    , ROUND(( SUM(a.weight)*SUM(b.HR) + w.HR ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HR
    , ROUND(( SUM(a.weight)*SUM(b.BB) + w.BB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS BB
    , ROUND(( SUM(a.weight)*SUM(b.SO) + w.SO ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SO
    , ROUND(( SUM(a.weight)*SUM(b.IBB) + w.IBB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS IBB
    , ROUND(( SUM(a.weight)*SUM(b.HBP) + w.HBP ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HBP
    , ROUND(( SUM(a.weight)*SUM(b.SB) + w.SB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SB
    , ROUND(( SUM(a.weight)*SUM(b.CS) + w.CS ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS CS
    , p.PA_LEFT AS PA
FROM batting_pos b, average_pos a, player_league_average_prorated w, player_Age_2008 c, player_pa_2008 p
WHERE b.yearID = a.yearID
    AND b.playerID = w.playerID
    AND b.playerID = p.playerID
    AND b.playerID = c.playerID
GROUP BY playerID;

I hate publishing code because most of what I write would get me hunted down and burned at the stake in any CS department in the country, but there it is. If you don’t grog SQL, here’s what’s going on here:

First we combine the data from the Databank with the data from this year, excluding hitters who have had fewer plate appearances than plate appearances against - in other words, pitchers. The other thing we do is compute a weight for each year – more recent seasons are worth more in the projection, and the weight is what governs that.
We calculate the average of non-pitchers hitting from those seasons.
Each player is then given a weighted average of the league from those four seasons, prorated out to 214 plate appearances. That’s our regression to the mean component.
We figure out an aging curve for each player.
We guesstimate how many plate appearances a player will receive the rest of 2008. If you think you know better than my guesstimate, you’re probably right. It’s the rates that I’m projecting.
This is the part that actually does a projection – it takes a weighted average of the past four seasons, mixing in the regression component from step three, and prorates it out to our guesstimated playing time.

If you don’t want to put in that amount of effort, well, Mr. Widget, take us home!

I've done a cursory look at the projections, but I haven't done any serious validation. All projections are provided as-is. The good news is, now that I have the code ironed out, all I need is half an hour or so and I can generate Marcels for over 500 players. (Quicker if I could automate the screenscraping – I’ll have to look into that.)

I’m hopeful but not optimistic that I’ll have pitcher projections done similarly by the weekend.

Labels: Projections

Jeff Samardzija in Pitch F/X

4 Comments Published by Colin Wyers on Friday, July 25, 2008 at 11:40 PM.

If you want actual, well, good analysis, go over to Harry’s and take a look. He’s been doing this pitch ID stuff a lot longer than I have.

But I think I was able to duplicate one of the graphs from Harry’s page, or at least come close.

I used Mat Kovach’s parser to download data from MLB’s servers. (It seems to work fine for me, but it’s “pre-alpha” and not documented as of yet, so caveat emptor. Also, I Am Not A Programmer, so all code samples that follow are to be taken with more than a hint of salt.)

Then, in MySQL, I ran the following query against the data:

SELECT a.*, p.*
FROM gameday_atbat a, gameday_pitch p
WHERE a.gameid = p.gameid
AND a.num = p.atbat_num
AND a.pitcher = 502188;

Not the prettiest SQL I’ve ever written, and it returns more data than I need, but that’s fine. Then I export the data to a CSV file. There’s one pitch out in the dataset that I remove.

Well, now what? I use GNU R, personally, for all my graphing and K-means clustering needs. Code:

Samardzija <- read.table("C:/Retrosheet/saved queries/pitchfx/Samardzija first start.csv", header=TRUE, sep=",")
cl <- KMeans(model.matrix(~-1 + pfx_x + pfx_z, Samardzija), centers = 3, iter.max = 10, num.seeds = 10)
plot(Samardzija$pfx_z~Samardzija$pfx_x, col=cl$cluster, xlim=c(-20,20), ylim=c(-20,20))

Which produces the following graph:

In fairness to Harry, I cheated – in the second line of the program, I tell the clustering algorithm how many “center” to look for – in this case, how many pitches I want it to look for. I told it three. Why? Because that’s what Harry’s graph shows. I don’t really know how to determine the “right” number of centers as of yet.

Even so, I have one pitch that differs from his – I think he changed that ID manually, but I’m not sure. I can tell you that one cluster is green and one is black, but as far as calling one a splitter and one a slider, that’s something I have to work on.

(That graph, by the way, is ugly, and I know it’s ugly. I know I can make it look better, but in this case it’s a question of how much time I really want to invest in prettying up Pitch F/X graphs before I figure out what it is I’m actually doing with them. It’s called premature optimization.)

Labels: Enhanced Gameday, Jeff Samardzija, Pitch F/X, Pitching

Projecting RZR

0 Comments Published by Colin Wyers on Thursday, July 24, 2008 at 11:21 PM.

There are two breeds of vanilla, free-as-in-beer zone rating available in the world: STATS and BIS. I already have a dumb projection system for STATS ZR, which could be refined (aging curves and speed/tools scores are the two major refinements I’m musing over.)

But first I wanted to introduce BIS’s RZR into it. And therein lies a dilemma, folks. Here’s the averages for RZR and OZR (OOZ divided by BIZ) over the years available at The Hardball Times:

POS	YEAR	Plays	OOZ	BIZ	RZR	OZR
1B	2004	4070	1783	5406	.753	.330
1B	2005	4343	1940	5493	.791	.353
1B	2006	3877	2012	4851	.799	.415
1B	2007	4963	1048	6695	.741	.157
1B	2008	2871	847	3815	.753	.222
1B	Total	20124	7630	26260	.766	.291
2B	2004	9863	1203	12129	.813	.099
2B	2005	10403	1478	12825	.811	.115
2B	2006	10401	1211	12679	.820	.096
2B	2007	10120	1412	12192	.830	.116
2B	2008	6313	649	7693	.821	.084
2B	Total	47100	5953	57518	.819	.103
SS	2004	9872	1919	11995	.823	.160
SS	2005	10484	1948	12821	.818	.152
SS	2006	10809	1659	13218	.818	.126
SS	2007	10625	1912	13019	.816	.147
SS	2008	6353	999	7627	.833	.131
SS	Total	48143	8437	58680	.820	.144
3B	2004	6215	2074	9007	.690	.230
3B	2005	6813	2396	9271	.735	.258
3B	2006	7686	1636	10880	.706	.150
3B	2007	7221	1717	10623	.680	.162
3B	2008	4444	1003	6344	.701	.158
3B	Total	32379	8826	46125	.702	.191
CF	2004	9478	2034	11905	.796	.171
CF	2005	10266	1963	12590	.815	.156
CF	2006	10316	2002	11534	.894	.174
CF	2007	10886	1944	12264	.888	.159
CF	2008	5922	1583	6468	.916	.245
CF	Total	46868	9526	54761	.856	.174
LF	2004	7710	847	12242	.630	.069
LF	2005	8686	718	13712	.633	.052
LF	2006	7723	1634	8971	.861	.182
LF	2007	8014	1614	9373	.855	.172
LF	2008	4475	1076	5060	.884	.213
LF	Total	36608	5889	49358	.742	.119
RF	2004	8736	781	13442	.650	.058
RF	2005	9181	695	14161	.648	.049
RF	2006	8376	1686	9436	.888	.179
RF	2007	8418	1575	9597	.877	.164
RF	2008	4802	1205	5321	.902	.226
RF	Total	39513	5942	51957	.760	.114

(2008 numbers will be slightly different from Studes’ numbers, as these are a few days old.) The projections for infielders are doable. But, as it stands, those outfield numbers are a horror show, taken by themselves.

So before we can make projections based upon RZR data, we first need to normalize it. I’m sure there are better ways than the one I’m using, but I don’t think I’m using the worst way either and it’s very expedient for my needs.

What I’m doing is dividing Plays, OOZ and BIZ by the totals for that season, and then multiplying by the averaged totals of all five years.

And, since I was rather short with the explanation the last time out, I’ll go ahead and spell out what I’m doing in full:

First, as above, every player’s performance is “normalized” to an average of the past five seasons.
Then, a weighted average of their past four seasons (05-08) is taken, with the most recent season being given a weight of 5, then 4, then 3, then 2.
Two weights worth of a full season’s average defensive performance of the season is added as a regression to the mean component.
5 + 4 + 3 + 2 + 2 = 16, so everything gets divided by 16. I wouldn’t exactly call it a playing time projection, but it’s a rough guide to how much playing time a player might be expected to receive.
Plays and Runs above average are figured for a full season’s performance, given the number of chances of the average player at that position from 04 through 08.

And… here are the projections. You can compare them to the STATS ZR projections, if you’d like.

(Note: Currently only players with a Baseball Databank ID who have appeared in 2008 are included in either projection set. The next step is to take the rest of the players in the RZR set, map them to the appropriate STATS ID, and run both projections side by side for all players who played in 2008, and maybe some who haven’t yet but could.)

So what’s next? Like I said before, these could really benefit from aging curves. (While I’m on the topic, Jon Shepherd over at Camden Depot has published RZR aging curves which are worth taking a look at. I have my own ZR aging curves which I should really try and get straightened out.) I really should probably run “projections” for seasons past and see how they match up with what actually happened.

And I want to work on combining data from multiple positions; I’ve done some comparisons of players who have played multiple positions, and my feeling from looking at the data is that in projecting a player’s zone rating, there really isn’t a lot of difference in difficulty in playing the different outfield positions – it’s not really much harder to catch fly balls in center field than it is anywhere else, but there’s a lot more fly balls to catch and so a good fielder is worth a lot more. But that’s worth exploring more, and there are some noteworthy sampling issues in that data; I find it hard to believe that a center fielder is below average as a first baseman defensively, for example. I should rerun this query on the RZR dataset here soon, see what that looks like.

Labels: Defense, Projections

Projecting zone rating

3 Comments Published by Colin Wyers on Thursday, July 17, 2008 at 11:53 PM.

So, you want to talk about a player’s defense?

Remember: a good sabermetrician is like a good hunter when cleaning his kill: he throws away as little as possible, taking care to use most of the animal. We have decades of information about players; why should we ever use only three and a half months worth of data in evaluating a player?

My process is based heavily off of Tango’s Marcels forecasting system; that said, he had nothing to do with this, and screwups in it are mine, not his. (For background on how a projection system works, here’s a decent writeup. If I don’t say so myself.)

Before going any further, I should note that I made this in about two hours. And I also made dinner in those two hours. And I had a side dish. So don’t expect anything on the order of PECOTA as far as complexity goes.

Here’s how it works. Every player’s zone rating data from 2005-2008 (yep, everything pre-All Star break from this year) is thrown into a mixer and weighted. I used a 5/4/3/2 weighting; I have no empirical basis for these weights other than it’s what Marcel uses. Then throw in two season’s worth of the league average for the position. There’s your regression to the mean.

Aging curves are… forthcoming. Maybe. I’m still hashing out the details. (I’ve started work on zone rating based aging curves for fielders, but there are questions about how accurate they are, and before they can be used in a projection system they need to be smoothed out a bit more.)

So, data. Plays and runs above or below average are figured using the Dial method. For that, each player is assumed to have a full season’s worth of chances at the position, not the number of chances used to compute zone rating.

The next step beyond aging curves would probably be to incorporate at least some measure of speed scores into the projection. But I was hungry, and so instead you have the best projection system I could make in two hours, while still making dinner. It’s a start, at least.

(Also, lemme take this chance to plug my hitter and pitcher evaluations on GROTA, if you have an interest in such things regarding the Cubs. Hitter and pitcher projections are next on my plate.)

Labels: Defense, Projections

A little experiment

0 Comments Published by Colin Wyers on Friday, July 11, 2008 at 10:23 PM.

I'm going to try a little experiment here. I wish my problem was writer's block - instead, there's just so many ideas that I've got running around in my head that I'm having a hard time prioritizing them. And with the All-Star Break coming up, there'll be time to catch up on some things.

So, if you'd like, you can help me figure out what I'll be writing about during the All-Star Break, with this handy widget here:

Skribit: Social Suggestions

Isn't it exciting? (It’s also available in the right-hand sidebar.) Here's a link to my Skribit page, if for some reason the widget doesn't do it for you. I may start seeding it with some topics of my own, but for right now I’m simply interested in seeing what – if anything – y’all are interested in. I'd like, obviously, as much participation as possible; that said, asking things like "Why do you hate Christmas, Santa Clause and Ryan Theriot?" aren't likely to be answered immediately. In fact, non-Ryan Theriot suggestions are probably your best bet.

The Other Fifteen

Another site announcement

2008 pitcher Marcels

Marcels for hitters

Jeff Samardzija in Pitch F/X

Projecting RZR

Projecting zone rating

A little experiment

Search

Latest

Archives

Cubs Blog Coast Guard

Cubs Blogs/Communities