The Other Fifteen

Eighty-five percent of the f---in' world is working. The other fifteen come out here.


Another site announcement

There's been quite a bit of radio silence around here recently, and hopefully that'll be corrected soon.

In the meantime, if you're interested in more data on Kosuke Fukudome than you can shake a stick at, well, find a stick and head on over to Goatriders.

Also, I've joined the crew over at Statistically Speaking, along with Brian Cartwright. I'm real excited to be working with Eric, Pizza and Brian. My first post is up, and I should be on a Friday schedule from here on out. If you're interested in things like FIP and BaseRuns, you should definitely take a look.

2008 pitcher Marcels

Like the hitters. Witness the power of this fully-operational widget!

I haven't updated my table since the 29th, in case you were wondering. New for the pitcher Marcels is a reliability score; higher is better. None of this is park-adjusted. ERA and FIP-ERA are both provided; I’d be happier with either Component ERA or BaseRuns converted to earned runs, I suspect, but both of those sound like work. (And, in fairness to me, would require a lot more data than I’m capturing now – I suspect I could get more pitching categories if I really, really desperately wanted to, but it would increase the amount of work involved exponentially. Remember – spidering these from Baseball-Reference gives me ID mappings and ages easily, both of which are insanely important to doing the projections.)

Remember: I am not projecting playing time, I am extrapolating playing time.

SQL code:

CREATE TABLE pitching
AS
SELECT ( CASE WHEN p.playerID is not null
        THEN p.playerID
        ELSE p.Player END ) AS playerID
    , 2008 AS yearID
    , 1 AS weight
    , p.Ag AS Age
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , SUM(p.IP) AS IP
    , SUM(p.BFP) AS BFP
from ( SELECT * from 7_29_08_pitching p
    LEFT JOIN ( select Player AS BPlayer, (AB+BB+SH+SF) AS PA from 7_29_08_batting ) b
    ON p.Player = b.BPlayer
    LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
    ON p.Player = m.bbrefID ) p
WHERE ( p.PA < (p.BFP) OR p.PA IS NULL )
    AND p.BFP > 0
GROUP BY Player
UNION ALL
SELECT p.playerID
    , p.yearID
    , POW(0.999,(2008-p.yearID)*365) AS weight
    , (CASE WHEN m.birthMonth < 7 THEN ( p.yearID - m.BirthYear ) ELSE ( p.yearID - m.BirthYear - 1 ) END) AS Age
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , ROUND(SUM(p.IPouts)/3,1) AS IP
    , SUM(p.BFP) AS BFP
FROM
    ( SELECT * from bdb.pitching p
    LEFT JOIN ( select playerID AS bplayerID, yearID AS byearID, (b.AB+b.BB+b.SH+b.SF+b.HBP) AS PA
        from bdb.batting b WHERE b.yearID > 2004 ) b
        ON p.playerID = b.bplayerID AND p.yearID = b.byearID
        WHERE p.yearID > 2004) p, bdb.master m
WHERE p.playerID = m.playerID
    AND ( (p.PA) < (p.BFP) OR p.BFP IS NULL )
    AND (p.BFP) > 0
GROUP BY yearID, playerID;

CREATE TABLE average_pitch
AS
SELECT yearID
    , POW(0.999,(2008-p.yearID)*365) AS weight
    , SUM(p.G) AS G
    , SUM(p.GS) AS GS
    , SUM(p.H) AS H
    , SUM(p.ER) AS ER
    , SUM(p.HR) AS HR
    , SUM(p.BB) AS BB
    , SUM(p.SO) AS SO
    , SUM(p.HBP) AS HBP
    , SUM(p.IP) AS IP
    , SUM(p.BFP) AS BFP
FROM pitching p
GROUP BY yearID;

CREATE TABLE pitcher_league_average
AS
SELECT p.playerID
    , (SUM(a.weight)*SUM(a.G)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS G
    , (SUM(a.weight)*SUM(a.GS)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS GS
    , (SUM(a.weight)*SUM(a.H)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS H
    , (SUM(a.weight)*SUM(a.ER)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS ER
    , (SUM(a.weight)*SUM(a.HR)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS HR
    , (SUM(a.weight)*SUM(a.BB)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS BB
    , (SUM(a.weight)*SUM(a.SO)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS SO
    , (SUM(a.weight)*SUM(a.HBP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS HBP
    , (SUM(a.weight)*SUM(a.BFP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS BFP
    , (SUM(a.weight)*SUM(a.IP)*SUM(p.IP)) / (SUM(a.weight)*SUM(p.IP)) AS IP
FROM pitching p, average_pitch a
WHERE p.yearID = a.yearID
GROUP BY playerID;

CREATE TABLE pitcher_league_average_prorated
AS
SELECT playerID
    , ( G / IP * 318 ) AS G
    , ( GS / IP * 318 ) AS GS
    , ( H / IP * 318 ) AS H
    , ( ER / IP * 318 ) AS ER
    , ( HR / IP * 318 ) AS HR
    , ( BB / IP * 318 ) AS BB
    , ( SO / IP * 318 ) AS SO
    , ( HBP / IP * 318 ) AS HBP
    , ( BFP / IP * 318 ) AS BFP
    , 318 AS IP
FROM pitcher_league_average;

CREATE TABLE player_Age_2008_pitching
AS
SELECT playerID
    , yearID
    , MAX(Age)
    , ( CASE
        WHEN ( MAX(Age)+(2008-yearID) ) > 29
        THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
        ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM pitching
GROUP BY playerID;

CREATE TABLE player_ip_2008
AS
SELECT playerID, G, IP, (IP/110) AS IP_G, ROUND((IP/110)*52*3)/3 AS IP_LEFT from pitching WHERE yearID = 2008;

CREATE TABLE pitching_marcels_2008
AS
SELECT p.playerID
    , ROUND(( SUM(p.weight)*SUM(p.G) + w.G ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS G
    , ROUND(( SUM(p.weight)*SUM(p.GS) + w.GS ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS GS
    , ROUND(( SUM(p.weight)*SUM(p.H) + w.H ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS H
    , ROUND(( SUM(p.weight)*SUM(p.ER) + w.ER ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS ER
    , ROUND(( SUM(p.weight)*SUM(p.HR) + w.HR ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS HR
    , ROUND(( SUM(p.weight)*SUM(p.BB) + w.BB ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS BB
    , ROUND(( SUM(p.weight)*SUM(p.SO) + w.SO ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT * c.Curve) AS SO
    , ROUND(( SUM(p.weight)*SUM(p.HBP) + w.HBP ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT / c.Curve) AS HBP
    , ROUND(( SUM(p.weight)*SUM(p.BFP) + w.BFP ) / ( SUM(p.weight)*SUM(p.IP) + 318 ) * i.IP_LEFT ) AS BFP
    , i.IP_LEFT AS IP
    , (SUM(p.weight)*SUM(p.IP)) / ( SUM(p.weight)*SUM(p.IP) + 318 ) AS R
FROM pitching p, pitcher_league_average_prorated w, player_Age_2008_pitching c, player_ip_2008 i
WHERE p.playerID = w.playerID
    AND p.playerID = i.playerID
    AND p.playerID = c.playerID
GROUP BY playerID;

I’m less confident that I’m implementing Marcels correctly with these, simply because Tango has published less about them. But, outside of any errors I’ve made, the only thing left to do – I think - would be to implement real playing time projections. (I also have to backport the reliability calculation to the hitter projections, but that’s easy enough.) Past that… well, they wouldn’t be Marcels anymore. (Okay, so I should be projecting pitching stats by league as well.)

Later this weekend my intent is to take a couple of players and actually show you the calculations going on behind this code, so that anyone that’s unsure on exactly what’s going on here gets to see the guts of the system.

Labels:

Marcels for hitters

About the simplest forecasting system available - that are worthy of the term “projection system,” at least - are the Marcel projections. As simple as they are, they match up very well with the results of the more complex forecasters.

Sal Baxamusa over at The Hardball Times has kindly provided us with Excel spreadsheets to calculate a player’s in-season Marcel. But I wanted the ability to bulk produce forecasts of players for the remainder of the season. So I set about reimplementing Sal’s spreadsheet in SQL.

Data prior to this season was taken from the Baseball Databank. Data from this season was screenscraped from the Baseball Reference league pages. The convenient thing about that is, Baseball Reference player IDs are mapped in the BDB to the BDB IDs.

If you have the Baseball Databank in MySQL, and can handle scraping the data from BBRef yourself, then you can generate your own Marcels like so:

CREATE TABLE batting_pos
AS
SELECT ( CASE WHEN b.playerID is not null
        THEN b.playerID
        ELSE b.Player END ) AS playerID
    , 2008 AS yearID
    , 1 AS weight
    , b.Ag AS Age
    , SUM(b.G) AS G
    , SUM(b.H) AS H
    , SUM(b.2B) AS 2B
    , SUM(b.3B) AS 3B
    , SUM(b.HR) AS HR
    , SUM(b.BB) AS BB
    , SUM(b.SO) AS SO
    , SUM(b.IBB) AS IBB
    , 0 AS HBP
    , SUM(b.SB) AS SB
    , SUM(b.CS) AS CS
    , SUM(b.AB+b.BB+b.SH+b.SF) AS PA
from ( SELECT * from 7_29_08_batting b
    LEFT JOIN ( select Player AS PPlayer, BFP from 7_29_08_pitching ) p
    ON b.Player = p.PPlayer
    LEFT JOIN ( select bbrefID, playerID from bdb.master ) m
    ON b.Player = m.bbrefID ) b
WHERE ( (b.AB+b.BB+b.SH+b.SF) > (b.BFP) OR b.BFP IS NULL )
    AND (b.AB+b.BB+b.SH+b.SF) > 0
GROUP BY Player
UNION ALL
SELECT b.playerID
    , b.yearID
    , ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - b.yearID) ) ) / 5.62 AS weight
    , (CASE WHEN m.birthMonth < 7 THEN ( b.yearID - m.BirthYear ) ELSE ( b.yearID - m.BirthYear - 1 ) END) AS Age
    , SUM(b.G) AS G
    , SUM(b.H) AS H
    , SUM(b.2B) AS 2B
    , SUM(b.3B) AS 3B
    , SUM(b.HR) AS HR
    , SUM(b.BB) AS BB
    , SUM(b.SO) AS SO
    , SUM(b.IBB) AS IBB
    , SUM(b.HBP) AS HBP
    , SUM(b.SB) AS SB
    , SUM(b.CS) AS CS
    , SUM(b.PA) AS PA
FROM
    ( SELECT * from bdb.batting b
    LEFT JOIN ( select playerID AS pplayerID, yearID AS pyearID, BFP
        from bdb.pitching p WHERE p.yearID > 2004 ) p
        ON b.playerID = p.pplayerID AND b.yearID = p.pyearID
        WHERE b.yearID > 2004) b, bdb.master m
WHERE b.playerID = m.playerID
    AND ( (b.PA) > (b.BFP) OR b.BFP IS NULL )
    AND (b.PA) > 0
GROUP BY yearID, playerID;

CREATE TABLE average_pos
AS
SELECT yearID
    , ( 5.62 * EXP( -0.00066 * 365 * ( 2008 - yearID) ) ) / 5.62 AS weight
    , SUM(G) AS G
    , SUM(H) AS H
    , SUM(2B) AS 2B
    , SUM(3B) AS 3B
    , SUM(HR) AS HR
    , SUM(BB) AS BB
    , SUM(SO) AS SO
    , SUM(IBB) AS IBB
    , SUM(HBP) AS HBP
    , SUM(SB) AS SB
    , SUM(CS) AS CS
    , SUM(PA) AS PA
FROM batting_pos bp
GROUP BY yearID;

CREATE TABLE player_league_average
AS
SELECT b.playerID
    , (SUM(a.weight)*SUM(a.H)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS H
    , (SUM(a.weight)*SUM(a.2B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 2B
    , (SUM(a.weight)*SUM(a.3B)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS 3B
    , (SUM(a.weight)*SUM(a.HR)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HR
    , (SUM(a.weight)*SUM(a.BB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS BB
    , (SUM(a.weight)*SUM(a.SO)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SO
    , (SUM(a.weight)*SUM(a.IBB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS IBB
    , (SUM(a.weight)*SUM(a.HBP)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS HBP
    , (SUM(a.weight)*SUM(a.SB)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS SB
    , (SUM(a.weight)*SUM(a.CS)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS CS
    , (SUM(a.weight)*SUM(a.PA)*SUM(b.PA)) / (SUM(a.weight)*SUM(b.PA)) AS PA
FROM batting_pos b, average_pos a
WHERE b.yearID = a.yearID
GROUP BY playerID;

CREATE TABLE player_league_average_prorated
AS
SELECT playerID
    , ( H / PA * 214 ) AS H
    , ( 2B / PA * 214 ) AS 2B
    , ( 3B / PA * 214 ) AS 3B
    , ( HR / PA * 214 ) AS HR
    , ( BB / PA * 214 ) AS BB
    , ( SO / PA * 214 ) AS SO
    , ( IBB / PA * 214 ) AS IBB
    , ( HBP / PA * 214 ) AS HBP
    , ( SB / PA * 214 ) AS SB
    , ( CS / PA * 214 ) AS CS
    , 214 AS PA
FROM player_league_average;

CREATE TABLE player_Age_2008
AS
SELECT playerID
    , yearID
    , MAX(Age)
    , ( CASE
        WHEN ( MAX(Age)+(2008-yearID) ) > 29
        THEN 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.003
        ELSE 1 + ( 29 - ( MAX(Age)+(2008-yearID) ) )*0.006 END ) AS Curve
FROM batting_pos
GROUP BY playerID;

CREATE TABLE player_pa_2008
AS
SELECT playerID, G, PA, (PA/G) AS PA_G, ROUND((PA/G)*52) AS PA_LEFT from batting_pos WHERE yearID = 2008;

CREATE TABLE hitter_marcels_2008
AS
SELECT b.playerID
    , ROUND(( SUM(a.weight)*SUM(b.H) + w.H ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS H
    , ROUND(( SUM(a.weight)*SUM(b.2B) + w.2B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 2B
    , ROUND(( SUM(a.weight)*SUM(b.3B) + w.3B ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS 3B
    , ROUND(( SUM(a.weight)*SUM(b.HR) + w.HR ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HR
    , ROUND(( SUM(a.weight)*SUM(b.BB) + w.BB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS BB
    , ROUND(( SUM(a.weight)*SUM(b.SO) + w.SO ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SO
    , ROUND(( SUM(a.weight)*SUM(b.IBB) + w.IBB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS IBB
    , ROUND(( SUM(a.weight)*SUM(b.HBP) + w.HBP ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS HBP
    , ROUND(( SUM(a.weight)*SUM(b.SB) + w.SB ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS SB
    , ROUND(( SUM(a.weight)*SUM(b.CS) + w.CS ) / ( SUM(a.weight)*SUM(b.PA) + 214 ) * p.PA_LEFT * c.Curve) AS CS
    , p.PA_LEFT AS PA
FROM batting_pos b, average_pos a, player_league_average_prorated w, player_Age_2008 c, player_pa_2008 p
WHERE b.yearID = a.yearID
    AND b.playerID = w.playerID
    AND b.playerID = p.playerID
    AND b.playerID = c.playerID
GROUP BY playerID;

I hate publishing code because most of what I write would get me hunted down and burned at the stake in any CS department in the country, but there it is. If you don’t grog SQL, here’s what’s going on here:

  1. First we combine the data from the Databank with the data from this year, excluding hitters who have had fewer plate appearances than plate appearances against - in other words, pitchers. The other thing we do is compute a weight for each year – more recent seasons are worth more in the projection, and the weight is what governs that.
  2. We calculate the average of non-pitchers hitting from those seasons.
  3. Each player is then given a weighted average of the league from those four seasons, prorated out to 214 plate appearances. That’s our regression to the mean component.
  4. We figure out an aging curve for each player.
  5. We guesstimate how many plate appearances a player will receive the rest of 2008. If you think you know better than my guesstimate, you’re probably right. It’s the rates that I’m projecting.
  6. This is the part that actually does a projection – it takes a weighted average of the past four seasons, mixing in the regression component from step three, and prorates it out to our guesstimated playing time.

If you don’t want to put in that amount of effort, well, Mr. Widget, take us home!

I've done a cursory look at the projections, but I haven't done any serious validation. All projections are provided as-is. The good news is, now that I have the code ironed out, all I need is half an hour or so and I can generate Marcels for over 500 players. (Quicker if I could automate the screenscraping – I’ll have to look into that.)

I’m hopeful but not optimistic that I’ll have pitcher projections done similarly by the weekend.

Labels:

Jeff Samardzija in Pitch F/X

If you want actual, well, good analysis, go over to Harry’s and take a look. He’s been doing this pitch ID stuff a lot longer than I have.

But I think I was able to duplicate one of the graphs from Harry’s page, or at least come close.

I used Mat Kovach’s parser to download data from MLB’s servers. (It seems to work fine for me, but it’s “pre-alpha” and not documented as of yet, so caveat emptor. Also, I Am Not A Programmer, so all code samples that follow are to be taken with more than a hint of salt.)

Then, in MySQL, I ran the following query against the data:

SELECT a.*, p.*
FROM gameday_atbat a, gameday_pitch p
WHERE a.gameid = p.gameid
    AND a.num = p.atbat_num
    AND a.pitcher = 502188;

Not the prettiest SQL I’ve ever written, and it returns more data than I need, but that’s fine. Then I export the data to a CSV file. There’s one pitch out in the dataset that I remove.

Well, now what? I use GNU R, personally, for all my graphing and K-means clustering needs. Code:

Samardzija <- read.table("C:/Retrosheet/saved queries/pitchfx/Samardzija first start.csv", header=TRUE, sep=",")
cl <- KMeans(model.matrix(~-1 + pfx_x + pfx_z, Samardzija), centers = 3, iter.max = 10, num.seeds = 10)
plot(Samardzija$pfx_z~Samardzija$pfx_x, col=cl$cluster, xlim=c(-20,20), ylim=c(-20,20))

Which produces the following graph:

samardzija_072508

In fairness to Harry, I cheated – in the second line of the program, I tell the clustering algorithm how many “center” to look for – in this case, how many pitches I want it to look for. I told it three. Why? Because that’s what Harry’s graph shows. I don’t really know how to determine the “right” number of centers as of yet.

Even so, I have one pitch that differs from his – I think he changed that ID manually, but I’m not sure. I can tell you that one cluster is green and one is black, but as far as calling one a splitter and one a slider, that’s something I have to work on.

(That graph, by the way, is ugly, and I know it’s ugly. I know I can make it look better, but in this case it’s a question of how much time I really want to invest in prettying up Pitch F/X graphs before I figure out what it is I’m actually doing with them. It’s called premature optimization.)

Labels: , , ,

Projecting RZR

There are two breeds of vanilla, free-as-in-beer zone rating available in the world: STATS and BIS. I already have a dumb projection system for STATS ZR, which could be refined (aging curves and speed/tools scores are the two major refinements I’m musing over.)

But first I wanted to introduce BIS’s RZR into it. And therein lies a dilemma, folks. Here’s the averages for RZR and OZR (OOZ divided by BIZ) over the years available at The Hardball Times:

POS
YEAR
Plays
OOZ
BIZ
RZR
OZR
1B
2004
4070
1783
5406
.753
.330
1B
2005
4343
1940
5493
.791
.353
1B
2006
3877
2012
4851
.799
.415
1B
2007
4963
1048
6695
.741
.157
1B
2008
2871
847
3815
.753
.222
1B
Total
20124
7630
26260
.766
.291
2B
2004
9863
1203
12129
.813
.099
2B
2005
10403
1478
12825
.811
.115
2B
2006
10401
1211
12679
.820
.096
2B
2007
10120
1412
12192
.830
.116
2B
2008
6313
649
7693
.821
.084
2B
Total
47100
5953
57518
.819
.103
SS
2004
9872
1919
11995
.823
.160
SS
2005
10484
1948
12821
.818
.152
SS
2006
10809
1659
13218
.818
.126
SS
2007
10625
1912
13019
.816
.147
SS
2008
6353
999
7627
.833
.131
SS
Total
48143
8437
58680
.820
.144
3B
2004
6215
2074
9007
.690
.230
3B
2005
6813
2396
9271
.735
.258
3B
2006
7686
1636
10880
.706
.150
3B
2007
7221
1717
10623
.680
.162
3B
2008
4444
1003
6344
.701
.158
3B
Total
32379
8826
46125
.702
.191
CF
2004
9478
2034
11905
.796
.171
CF
2005
10266
1963
12590
.815
.156
CF
2006
10316
2002
11534
.894
.174
CF
2007
10886
1944
12264
.888
.159
CF
2008
5922
1583
6468
.916
.245
CF
Total
46868
9526
54761
.856
.174
LF
2004
7710
847
12242
.630
.069
LF
2005
8686
718
13712
.633
.052
LF
2006
7723
1634
8971
.861
.182
LF
2007
8014
1614
9373
.855
.172
LF
2008
4475
1076
5060
.884
.213
LF
Total
36608
5889
49358
.742
.119
RF
2004
8736
781
13442
.650
.058
RF
2005
9181
695
14161
.648
.049
RF
2006
8376
1686
9436
.888
.179
RF
2007
8418
1575
9597
.877
.164
RF
2008
4802
1205
5321
.902
.226
RF
Total
39513
5942
51957
.760
.114

(2008 numbers will be slightly different from Studes’ numbers, as these are a few days old.) The projections for infielders are doable. But, as it stands, those outfield numbers are a horror show, taken by themselves.

So before we can make projections based upon RZR data, we first need to normalize it. I’m sure there are better ways than the one I’m using, but I don’t think I’m using the worst way either and it’s very expedient for my needs.

What I’m doing is dividing Plays, OOZ and BIZ by the totals for that season, and then multiplying by the averaged totals of all five years.

And, since I was rather short with the explanation the last time out, I’ll go ahead and spell out what I’m doing in full:

  1. First, as above, every player’s performance is “normalized” to an average of the past five seasons.
  2. Then, a weighted average of their past four seasons (05-08) is taken, with the most recent season being given a weight of 5, then 4, then 3, then 2.
  3. Two weights worth of a full season’s average defensive performance of the season is added as a regression to the mean component.
  4. 5 + 4 + 3 + 2 + 2 = 16, so everything gets divided by 16. I wouldn’t exactly call it a playing time projection, but it’s a rough guide to how much playing time a player might be expected to receive.
  5. Plays and Runs above average are figured for a full season’s performance, given the number of chances of the average player at that position from 04 through 08.

And… here are the projections. You can compare them to the STATS ZR projections, if you’d like.

(Note: Currently only players with a Baseball Databank ID who have appeared in 2008 are included in either projection set. The next step is to take the rest of the players in the RZR set, map them to the appropriate STATS ID, and run both projections side by side for all players who played in 2008, and maybe some who haven’t yet but could.)

So what’s next? Like I said before, these could really benefit from aging curves. (While I’m on the topic, Jon Shepherd over at Camden Depot has published RZR aging curves which are worth taking a look at. I have my own ZR aging curves which I should really try and get straightened out.) I really should probably run “projections” for seasons past and see how they match up with what actually happened.

And I want to work on combining data from multiple positions; I’ve done some comparisons of players who have played multiple positions, and my feeling from looking at the data is that in projecting a player’s zone rating, there really isn’t a lot of difference in difficulty in playing the different outfield positions – it’s not really much harder to catch fly balls in center field than it is anywhere else, but there’s a lot more fly balls to catch and so a good fielder is worth a lot more. But that’s worth exploring more, and there are some noteworthy sampling issues in that data; I find it hard to believe that a center fielder is below average as a first baseman defensively, for example. I should rerun this query on the RZR dataset here soon, see what that looks like.

Labels: ,

Projecting zone rating

So, you want to talk about a player’s defense?

Remember: a good sabermetrician is like a good hunter when cleaning his kill: he throws away as little as possible, taking care to use most of the animal. We have decades of information about players; why should we ever use only three and a half months worth of data in evaluating a player?

My process is based heavily off of Tango’s Marcels forecasting system; that said, he had nothing to do with this, and screwups in it are mine, not his. (For background on how a projection system works, here’s a decent writeup. If I don’t say so myself.)

Before going any further, I should note that I made this in about two hours. And I also made dinner in those two hours. And I had a side dish. So don’t expect anything on the order of PECOTA as far as complexity goes.

Here’s how it works. Every player’s zone rating data from 2005-2008 (yep, everything pre-All Star break from this year) is thrown into a mixer and weighted. I used a 5/4/3/2 weighting; I have no empirical basis for these weights other than it’s what Marcel uses. Then throw in two season’s worth of the league average for the position. There’s your regression to the mean.

Aging curves are… forthcoming. Maybe. I’m still hashing out the details. (I’ve started work on zone rating based aging curves for fielders, but there are questions about how accurate they are, and before they can be used in a projection system they need to be smoothed out a bit more.)

So, data. Plays and runs above or below average are figured using the Dial method. For that, each player is assumed to have a full season’s worth of chances at the position, not the number of chances used to compute zone rating.

The next step beyond aging curves would probably be to incorporate at least some measure of speed scores into the projection. But I was hungry, and so instead you have the best projection system I could make in two hours, while still making dinner. It’s a start, at least.

(Also, lemme take this chance to plug my hitter and pitcher evaluations on GROTA, if you have an interest in such things regarding the Cubs. Hitter and pitcher projections are next on my plate.)

Labels: ,

A little experiment

I'm going to try a little experiment here. I wish my problem was writer's block - instead, there's just so many ideas that I've got running around in my head that I'm having a hard time prioritizing them. And with the All-Star Break coming up, there'll be time to catch up on some things.

So, if you'd like, you can help me figure out what I'll be writing about during the All-Star Break, with this handy widget here:

Isn't it exciting? (It’s also available in the right-hand sidebar.) Here's a link to my Skribit page, if for some reason the widget doesn't do it for you. I may start seeding it with some topics of my own, but for right now I’m simply interested in seeing what – if anything – y’all are interested in. I'd like, obviously, as much participation as possible; that said, asking things like "Why do you hate Christmas, Santa Clause and Ryan Theriot?" aren't likely to be answered immediately. In fact, non-Ryan Theriot suggestions are probably your best bet.