In the Summer of 2021, the average MLB batting average was at an all time low of .236. Feeling pressure to ‘spice things up a bit,’ MLB began what has been called a ‘crackdown’ on sticky substances used by pitchers.

These substances such as Spider Tack gave pitchers a literally superhuman ability to grip a baseball, inducing incredible amounts of spin on pitches thrown and, consequently, increased movement on their pitches. To quote SI on the matter:

The use of Spider Tack and such gooey grip aids had grown so ridiculous that players could hear pitches ripping off the sticky fingers of pitchers, like Velcro.

https://www.si.com/mlb/2023/05/12/mlb-crackdown-on-pitchers-sticky-substances-making-the-game-fairer#:~:text=236%2C%20the%20league%20announced%20a,fingers%20of%20pitchers%2C%20like%20Velcro.

Now… there is far more to the story than just this. There were, and are, many opinions on MLB’s decision to do this, but that is beside the point.

As a matter of curiosity, I set out create a tool that would allow me (or anyone) to visually examine data that might be an indication of whether or not a pitcher was using the sticky stuff at the time of the crackdown. To do this, it looks at key statcast metrics, including spin rate and velocity, around the June 1 ban. My hypothesis is that pitch data for pitchers who did use the sticky stuff would be markedly different when comparing pre-ban and post-ban subsets.

The results are pretty cool, to say the least.

Before we get into the interesting stuff, let’s look at two awesome python libraries I used to make this happen….

pybaseball

pybaseball is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don’t have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the docs for a comprehensive list of data acquisition functions.

Install it the usual way via pip.

In my code, I pull two distinct data sets: pitch by pitch data for all pre-ban pitches thrown (i.e. before June 1, 2021) and all post-ban pitches thrown, or those thrown after June 1.

Each of these datasets is saved in the location of the .py file so that you do not need to re-download them whenever you want to run the code. Here’s how it happens… (note: I assume a working knowledge of pandas, which i insist is pronounced like the bear and not pon-DOS)

    try:
        pre_load = pd.read_csv('pre_ban_data.csv')
    except FileNotFoundError:
        print("\nPre-ban data file was not found. Loading now... Please be patient.")
        pre_load = pyb.statcast(start_dt="2021-04-01", end_dt="2021-06-11")
        print("\nSaving pre-ban data to csv...")
        pre_load.to_csv('pre_ban_data.csv')

    try:
        post_load = pd.read_csv('post_ban_data.csv')
    except FileNotFoundError:
        print("\nPost-ban data file was not found. Loading now... Please be patient.")
        post_load = pyb.statcast(start_dt="2021-06-11", end_dt="2021-10-03")
        print("\nSaving post-ban data to csv...")
        post_load.to_csv('post_ban_data.csv')

    finally:
        return pre_load, post_load

This section of code checks for the existence of the archived downloads and downloads them if they are not there. It returns two dataframes, pre-ban data and post-ban data.

Each of these dataframes is comprised of a ton of columns which you’ll have to research on your own to find. We care about pitcher name, pitch type, pitch movement, spin rate, and velocity, as well as some location data. It also contains a TON of cool hitter-related data!

Here’s a little sample with some summary stats:

       release_speed  release_spin_rate      spin_axis          pfx_x  \
count  272189.000000      272189.000000  272189.000000  272189.000000   
mean       88.823912        2275.859656     175.302951      -0.119085   
std         5.997288         342.054212      71.114300       0.878124   
min        35.700000          43.000000       0.000000      -2.470000   
25%        84.600000        2102.000000     134.000000      -0.860000   
50%        89.800000        2297.000000     198.000000      -0.180000   
75%        93.600000        2480.000000     220.000000       0.580000   
max       103.200000        3722.000000 

Neat, eh!?

Next, we ask for the user to input a player name. We do some validation, filter our data based on player name, do some prep work on column names, etc., then we are ready to jump into bokeh!

bokeh

Bokeh is a Python interactive visualization library that is designed for creating expressive and interactive web-based visualizations. Developed by the Bokeh Development Team, Bokeh enables users to generate dynamic and aesthetically appealing plots, charts, and dashboards for data analysis and presentation. It supports various output formats, including HTML, making it convenient for web-based applications. Bokeh emphasizes interactivity, allowing users to create responsive visualizations with tools such as pan, zoom, and hover. The library is particularly useful for creating data visualizations in fields like data science, scientific research, and analytics, where conveying complex information in an accessible and interactive manner is essential.

As we move forward, please keep in mind that I am very much a novice bokeh practitioner. It is a REALLY cool open source tool with a decent learning curve but tremendous capabilities in the field of data visualization.

I wanted to examine a few key data points:

  • pitch movement
  • spin rate
  • release point
  • velocity

To this end, I made some creative scatter plots. Using well-positioned and scaled images from a know baseball simulation, I was able to provide visual perspective and reference points for the strike zone and release points. Movement is shown on a basic x/y axis. Spin rate and velo comparisons are shown on side-by-side box and whisker plots.

I want to look at some results and some of the inferences they might lead me to…

Let’s look at three intriguing plots:

  • Jose Quintana, A very good soft-tosser for the Mets
  • Trevor Bauer, who according to my models, had the BEST and LEAST LIKELY good season in 2020, and,
  • Shohei Ohtani, baseball’s $700m $460m man.

First! Sorry for some minor quality of life issues. This is a 97% solution, not 100%.

Second! I am not accusing anyone of anything. I’m just checking out the data in a neat way. MLB is already monitoring this (see the SI link above for more info).

Third! Pay particular attention to the box plots for spin rate. Draw your own conclusions… outside of this one: looks like Quintana altered his delivery during the first months of the year, seen in the change in release point. Seems to have worked for him thus far.

If it is interesting to you, I also used an earlier iteration of this as my final project of Harvard’s CS50 Python course (highly recommended). Here is my video from that course:

Alright, that is all I have! Hope this was useful to you. This is merely the tip of the iceberg when it comes to neat things you can do with bokeh and pybaseball (or any data, really).

Comment a pitcher and I will post a grid for them 🙂

Categories: Blogcode

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *