Baseball Analysis 2

June 01, 2017

I found a baseball hitting data set that had more detailed statistics, including batting average, slugging percentage, and categorical data about what positions each player played during the 2015 season.

The parameters for this analysis with Cedar were: normalized data, 10 cluster bins, lens 1 is eccentricity with exponent of 2 and 12 partitions with 50% overlap, not equalized. Lens 2 is density with a Point-wise Gaussian kernel width of 1, and 12 partitions with 50% overlap, not equalized.

The conventional wisdom of baseball would say that pitchers are the least effective hitters, and that first and third basemen and some outfielders are typically higher impact or power hitters.

When we color by players who played pitcher during the 2015 season, we see the circled group consists only of pitchers. These pitchers averaged less than one plate appearance throughout the entire season. This matches the conventional wisdom that pitchers are typically low-impact hitters.

The next circled group also consisted primarily of pitchers, but with some other position players as well. These are still considered “low-impact” hitters and averaged less than 1 hit over 7.28 plate appearances.

The next group consisted of mostly position players, with some pitchers as well. These players averaged more plate appearances than the previous two groups, about, 33.5, to go along with 6 hits. These players are more likely to be pinch hitters or substitutes for everyday players.

Finally we see a similar structure to what we saw in the first analysis with role players, transitional, and high impact players along a spire. The high impact players had a mixture of different positions, but first basemen, and right and left fielders had a higher concentration than other positions, which supports the idea that more effective hitters often play those positions on defense.

I had to rearrange the nodes a little so I could zoom in enough for the numbers to appear on the nodes but below is the same plot as above, but with the number of points per cluster.

The results of a Kolmogorov-Sminov test comparing the long “snake” to the rest of the data showed the following results. Plate appearances (PA), at bats (AB), hits (H), runs scored (R), runs batted in (RBI), strike-outs (SO), two-base hits (X2B), and total bases (TB) had the highest kstatistics, and were all above .8. These players had more chances to hit (plate appearances) and thus produced higher numbers of the other statistics.

The key statistics for a player to move “up” along the snake towards being "high impact" players, are plate appearances (PA), hits (H), runs batted in (RBI), home runs (HR), two-base hits (X2B), and runs scored (R). The high-impact players have higher numbers in all of these categories, with decreasing amounts as you move along the “snake” towards the role players.

Plate appearances (PA), at bats (AB), and total bases (TB) best differentiate the two groups of low impact players, with kstatistics of 0.5827164 for PA, 0.5631305 for AB, 0.5448056 for H, and 0.5681337 for TB when comparing the two groups with a KS test. That being said, the difference in plate appearances between the two groups was only 25, with a difference of 5 hits and 8 total bases between the groups. This is a small difference over the course of an entire season, so I still considered both of these groups to be “low impact”.

Search This Blog

Paul Soma Undergraduate Research Blog

Baseball Analysis 2

Comments

Post a Comment

Popular posts from this blog

Presentation on Persistent Homology

Persistent Homology of Trigonometric Functions

Topological Analysis of Baseball Batting Data