Jeff Samardzija in Pitch F/X
4 Comments Published by Colin Wyers on Friday, July 25, 2008 at 11:40 PM.If you want actual, well, good analysis, go over to Harry’s and take a look. He’s been doing this pitch ID stuff a lot longer than I have.
But I think I was able to duplicate one of the graphs from Harry’s page, or at least come close.
I used Mat Kovach’s parser to download data from MLB’s servers. (It seems to work fine for me, but it’s “pre-alpha” and not documented as of yet, so caveat emptor. Also, I Am Not A Programmer, so all code samples that follow are to be taken with more than a hint of salt.)
Then, in MySQL, I ran the following query against the data:
SELECT a.*, p.*
FROM gameday_atbat a, gameday_pitch p
WHERE a.gameid = p.gameid
AND a.num = p.atbat_num
AND a.pitcher = 502188;
Not the prettiest SQL I’ve ever written, and it returns more data than I need, but that’s fine. Then I export the data to a CSV file. There’s one pitch out in the dataset that I remove.
Well, now what? I use GNU R, personally, for all my graphing and K-means clustering needs. Code:
Samardzija <- read.table("C:/Retrosheet/saved queries/pitchfx/Samardzija first start.csv", header=TRUE, sep=",")
cl <- KMeans(model.matrix(~-1 + pfx_x + pfx_z, Samardzija), centers = 3, iter.max = 10, num.seeds = 10)
plot(Samardzija$pfx_z~Samardzija$pfx_x, col=cl$cluster, xlim=c(-20,20), ylim=c(-20,20))
Which produces the following graph:
In fairness to Harry, I cheated – in the second line of the program, I tell the clustering algorithm how many “center” to look for – in this case, how many pitches I want it to look for. I told it three. Why? Because that’s what Harry’s graph shows. I don’t really know how to determine the “right” number of centers as of yet.
Even so, I have one pitch that differs from his – I think he changed that ID manually, but I’m not sure. I can tell you that one cluster is green and one is black, but as far as calling one a splitter and one a slider, that’s something I have to work on.
(That graph, by the way, is ugly, and I know it’s ugly. I know I can make it look better, but in this case it’s a question of how much time I really want to invest in prettying up Pitch F/X graphs before I figure out what it is I’m actually doing with them. It’s called premature optimization.)
Labels: Enhanced Gameday, Jeff Samardzija, Pitch F/X, Pitching
Good stuff. I know you said there's no documentation, but do you have any links to more info about that parser?
James -
Mat wrote up a short post on the RetroSQL Yahoo Group; I believe you have to subscribe to access it. I seriously recommend joining to anyone that's interested in this sort of work.
Short version: you need to install MySQL and TCL. Use version 8.4.19. Setting up an SQL database is worth a whole topic of its own.
This comment has been removed by the author.
Colin,
For k-means, you can always try sweeping the number of centers [1-5] or some reasonable upperbound, and evaluating how the fit is by a metric such as the mean distance of each point to the center of the cluster.