The Other Fifteen

Eighty-five percent of the f---in' world is working. The other fifteen come out here.


Jeff Samardzija in Pitch F/X

If you want actual, well, good analysis, go over to Harry’s and take a look. He’s been doing this pitch ID stuff a lot longer than I have.

But I think I was able to duplicate one of the graphs from Harry’s page, or at least come close.

I used Mat Kovach’s parser to download data from MLB’s servers. (It seems to work fine for me, but it’s “pre-alpha” and not documented as of yet, so caveat emptor. Also, I Am Not A Programmer, so all code samples that follow are to be taken with more than a hint of salt.)

Then, in MySQL, I ran the following query against the data:

SELECT a.*, p.*
FROM gameday_atbat a, gameday_pitch p
WHERE a.gameid = p.gameid
    AND a.num = p.atbat_num
    AND a.pitcher = 502188;

Not the prettiest SQL I’ve ever written, and it returns more data than I need, but that’s fine. Then I export the data to a CSV file. There’s one pitch out in the dataset that I remove.

Well, now what? I use GNU R, personally, for all my graphing and K-means clustering needs. Code:

Samardzija <- read.table("C:/Retrosheet/saved queries/pitchfx/Samardzija first start.csv", header=TRUE, sep=",")
cl <- KMeans(model.matrix(~-1 + pfx_x + pfx_z, Samardzija), centers = 3, iter.max = 10, num.seeds = 10)
plot(Samardzija$pfx_z~Samardzija$pfx_x, col=cl$cluster, xlim=c(-20,20), ylim=c(-20,20))

Which produces the following graph:

samardzija_072508

In fairness to Harry, I cheated – in the second line of the program, I tell the clustering algorithm how many “center” to look for – in this case, how many pitches I want it to look for. I told it three. Why? Because that’s what Harry’s graph shows. I don’t really know how to determine the “right” number of centers as of yet.

Even so, I have one pitch that differs from his – I think he changed that ID manually, but I’m not sure. I can tell you that one cluster is green and one is black, but as far as calling one a splitter and one a slider, that’s something I have to work on.

(That graph, by the way, is ugly, and I know it’s ugly. I know I can make it look better, but in this case it’s a question of how much time I really want to invest in prettying up Pitch F/X graphs before I figure out what it is I’m actually doing with them. It’s called premature optimization.)

Labels: , , ,

4 Responses to “Jeff Samardzija in Pitch F/X”

  1. # Anonymous Anonymous

    Good stuff. I know you said there's no documentation, but do you have any links to more info about that parser?  

  2. # Blogger Colin Wyers

    James -

    Mat wrote up a short post on the RetroSQL Yahoo Group; I believe you have to subscribe to access it. I seriously recommend joining to anyone that's interested in this sort of work.

    Short version: you need to install MySQL and TCL. Use version 8.4.19. Setting up an SQL database is worth a whole topic of its own.  

  3. # Blogger Jason

    This comment has been removed by the author.  

  4. # Anonymous Anonymous

    Colin,

    For k-means, you can always try sweeping the number of centers [1-5] or some reasonable upperbound, and evaluating how the fit is by a metric such as the mean distance of each point to the center of the cluster.  

Post a Comment