Does Soriano hit better with the bases empty?

This came up for discussion on Tango's blog a while back, and it's been sort of sticking in the back of my mind.

Here's Tango:

His wOBA are: .379 with bases empty and .344 with men on base.  IIRC, the difference for the average player is a 5 point drop or so.  I’m sure someone can correct me.  But, he’s got a 35 point difference here (based on almost 3000 PA with bases empty and 2000 with men on base).  One standard deviation is roughly a 15 point difference, so we see here a difference of around 2 standard deviations. 

While that doesn’t necessarily mean that Soriano definitely prefers to bat with bases empty, it points very strongly toward that.

MGL's reply:

That is dead wrong. Sort of.  We don’t really care how many SD’s someone is off. By the time someone points something out to us, OF COURSE it is going to be unusual.  That is the classic example of selective sampling or cherry picking.  Take any distribution that is completely random (no skill whatsoever).  Within that distribution, 5% will be off the mean by more than 2 SD’s.  Well, someone can and will point those 5% (or 1%, or .1% if the sample is large enough) players out, and say, “Well, these guys are way off the mean - something must be going on!”

Until we figure out how much, if any, skill, there is, those SD’s mean NOTHING (because they are cherry picked, as is Soriano’s)!

If there is little or no skill for players batting with runners and and without, which I suspect is the case, as with most ‘splits’, then the # of SD from the mean means nothing, since everyone gets regressed 100% (or near 100%).  Even if there is a little skill, most of that 2 SD will get regressed toward the mean.

I didn't really have anything to add to the discussion at that point, but it stuck with me. I've never been fully convinced by the argument that Soriano doesn't perform well in RBI situations based upon some true talent level, but I've never really had an argument against the idea, and the benefit derived from moving Soriano down in the lineup didn't seem worth the risk.

Then Dave Pinto chimed in:

Since 2000, which represents all but a few PA of his career, Soriano is ever so slightly worse with a man on first than with the bases empty. So the idea that a man on first bothers him doesn't really hold water. However, I believe most batters do better with a man on first. In the National League in 2007, a man on first added sixteen points to a player's batting average, twelve points to a player's on base average and eighteen points to his slugging percentage.

Pretty much the same split data; Tango used the Retrosheet data on Baseball Reference, and I think Pinto uses BIS data, but other than that it's pretty much all the same.

Then Chuck gets in on the act:

This suggests that something else is up with Soriano. It's not sample size that is the problem as Soriano has 1,676 at bats with runners on base. What it suggests is that his concentration is bothered when men are on base.

In other words, if the situation isn't all about him, he's not the same player.

Statistical evidence of selfishness.

Statistical evidence of selfishness. Really?

Or, put another way - what do Soriano's splits really tell us?

That's the same data Pinto used, but with some additional stats added. They are Walks per PA, Strikeouts per PA, Isolated Slugging (Slugging minus batting average) and Batting Average on Balls in Play (H-HR)/(AB-K-HR+SF).

So what do we see here? Soriano's strikeouts go up a bit, his walks increase by more than a bit, and his power numbers go down by a bit. It's the walk rate that interests me - if he's walking more, then he should be getting on base more. But he's not, because his batting average drops.

Batting average is subject to a lot more randomness than a lot of other things we look at in baseball. That's because, when a player walks or strikes out, he's only interacting with the pitcher; when he gets a base hit, he's interacting with the defense as well, and that introduces a lot of variables.

It's the drop in BABIP that interests me the most; BABIP is subject to even more randomness than batting average; we remove strikeouts and home runs, the two components to batting average a player has the most control over. So what's causing the big change in BABIP for Soriano?

For that, batted ball data is the next place I looked. The following is Retrosheet data, 2007 only:

There's a lot going on here - we've added a few new stat categories, for one. FB, GB, LD and Pop are flyball, groundball, line drive and popup; they're followed by percentages.

The first thing is that these numbers seem to take on about the same shape as Soriano's career number - small spike in K rate, larger spike in walk rate, big change in BABIP.

Soriano's line drive percentage remains very consistent, which is very informative; line drive rate is probably the best predictor of future BABIP that we have. In fact, using our expected BABIP formula (.120 plus LD%) it looks like Soriano's change in BABIP rates between the splits is nothing but a fluke.

But Soriano is also a more extreme fly ball hitter when men are on base in front of him - can we use that data? Dave Studeman has a somewhat more technical expected BABIP formula which uses K rate and FB%, which I'll call xBABIP to differentiate between it and the simpler formula. Again, nothing to suggest that such a radical difference in BABIP is attributable to anything Soriano is doing.

My feeling right now - call it a hunch - is that we're seeing two things going on here at once. Pitchers will try to pitch around Soriano when men are on base in front of him, driving up his walk rates and giving him fewer good pitches to hit. At the same time, we have an entirely random variance in BABIP that exaggerates the impact of the former tendency.

I have a ways to go before I can validate that hunch, though. Next steps:

  1. Get the rest of Soriano's career data from the Retrosheet event files, and see if that makes a difference.
  2. Take a look at the major league average in those figures, to see how Soriano compares.
  3. Take a look and see if players like Soriano - high slugging, low OBP - tend to see their walk rates go up with men on base ahead of them.
  4. Take a look at pitch-by-pitch data and see if Soriano sees more balls and fewer strikes with men on base.

That's a tall order, admittedly. And my SQL-fu isn't what I'd like it to be, so expect the going to be slow at best.

[An aside: I took a look at Soriano’s top comps from his 2007 PECOTA (I don’t have 2008), and three of them exhibited the same tendency with their walk rates in their careers. (Andre Dawson, Jeff Kent, and Leon Wagner - Paul Blair didn’t, but he looks like a real odd comp to me.) Consider it anecdotal evidence for #3, if you'd like.]

8 Responses to “Does Soriano hit better with the bases empty?”

  1. # Anonymous pmayo

    Would it be useful to examine the pitch f/x data for Sori's plate appearances with men on base? Perhaps the reason for the anomalies in his BABIP, the slight up tick in walks, and slight increase in K is that Sori is seeing different sorts of pitches with men on base. Perhaps I missed this in the post (admittedly I mostly skimmed it...) but what happens to Sori's LD% in these PA's?  

  2. # Anonymous Maddog

    Nice work, Colin. Good to see someone finally dig into it and see what it going on rather than seeking confirmation of one's opinion.  

  3. # Blogger Samael2681

    Interesting post, it seems that what we all believed true most likely wasn't. I would like to see if Soriano could produce out of another spot in the lineup, and this could open up other opportunities as far as lineup construction goes as well. If Soriano can bat 2-hole, why not D-Lee? Something along the lines of a leadoff guy (obviously not Theriot)-D. Lee-Ramirez-Soriano-Fukudome would be an interesting combination to investigate.  

  4. # Blogger Colin Wyers

    pmayo - LD% is 0.192 with runners on, 0.186 with bags empty.

    Pitch F/X data would be great to have for something like this; it's something I'm working on, although I don't know if I'd get my hopes up - it's not my area of expertise.  

  5. # Anonymous DeRoMyHero


    Could you please explain why HRs are removed from BABIP? After all, the batter has made contact and kept the ball fair. HRs are good, aren't they?  

  6. # Blogger Colin Wyers

    A home run is the single-most valuable thing a hitter can do.

    But BABIP isn't a measure of a hitter's value; it's a measure of a hitter's ability.

    We remove walks, strikeouts and home runs because that's strictly a battle between hitter and pitcher. What BABIP tries to answer is how a hitter interacts with the defense - how often a ball in play becomes a hit or an out.

    By doing that, it gives us an idea of how often a hitter is getting "robbed" of base hits, or how often he's hitting grounders with eyes that are just barely hits. We're measuring two things: how well he hits the ball, and how lucky he is.

    From the batted ball data, we have an idea that Soriano is still hitting the ball well with men on base in front of him, and so it would appear that (at least part) of Soriano's splits are the result of nothing more or less than random chance.  

  7. # Anonymous Anonymous

    Two quick thoughts...

    First, while it doesn't explain all of the BABIP variance some of it could be explained by the higher fb rates. We don't have to explain all of the variance with a single explanation.

    Second, Soriano is fast enough where he could beat out infield hits with the bases empty that instead lead to force-outs with men on base which would cover some more of the BABIP variance.  

  8. # Blogger Colin Wyers

    Anon - Fly ball data is used to calculate xBABIP.

    Soriano's speed might play into it - that's an area where knowing what's "normal" for these splits for certain kinds of players might be useful.  

