Project
MLB Batter vs Pitcher
Part 1 Proposal submission
In this Moneyball era, baseball has not only showed the world the value of analytics, it is still a sports pioneer through its predictive methodologies. Sports is no longer all about might, and strength, and power, it is about understanding your team to a numerical degree. Yet, beyond the scope of your team it is about better understanding your opponent. From the New England Patriots filming their opponents to try and predict what play will be called, to breaking into hotel and locker rooms to steal play books. Teams have a long, rich tradition of trying to predict what their opponents will do in any given situation. But there is a more discreet, dare I say ethical way to predict what it is your competition will do - analyze it.
Picture yourself standing in the batters box, bases loaded, bottom of the 9th, 2 outs, full count and you’re down by 3. There is just one pitch left, just one more chance for you to either be the hero or the goat. Now imagine that in this moment you know with statistical significance what pitch (pitch type) your opponent will throw your way. Queue the predictive analytics.
I propose that with a handful of strategic variables you can predict pitch type. Put another way, at any given point in an at bat, you can predict what pitch a pitcher will throw.
The strategic business objective for this information is simple yet priceless. Basically, this knowledge can put a player on base and “runs” win games. With this information a batter could have additional confidence on when to swing for the fences, aim for the gap or simply watch the ball go by.
The database comes from Kaggle
(https://www.kaggle.com/pschale/mlb-pitch-data-20152018#games.csv - MLB Pitch Data 2015-2018).
*note: the information above should satify Part b of the Proposal submission request as you can verify the dataset from the link and see below how I will parse the data.
Three datasets needed - “pitches” “atbats” “player_names”
(corresponding data exists in 3 different tables - multiple joins will be needed)
Total dataset case size - 10K+
Projected dataset case size for project - 3,000 - 3,500
Strategic (dependent) Variables required for project -
1. Pitch Count (includes two variable sets)
a. Ball Count b_count
b. Strike Count s_count
2. Outs (the number of outs before each pitch is thrown)
3. Pitch Number
4. On Base (includes three variables + weighted sets)
a. 1st Base on_1b
b. 2nd Base on_2b (multiplied by 2 to add weight to variable)
c. 3rd Base on_3b (multiplied by 3 to add weight to variable)
i. These three variables are binary - on base > Ture / False
1. Will decide once model is created to see if weighted results dramatically affect the results
Descriptive Variables
1. Player Names – id (first and last name)
2. Atbats
a. Pitcher_id
b. Inning
c. Stand (which side the batter hits from)
d. Top (binary top of inning = True / bottom of inning = False)
e. Ab_id
Target Variables
|
Pitch Type (primary) |
Type (could reduce to just S or B to make binary) |
|
CH - Changeup CU - Curveball EP - Eephus* FC - Cutter FF - Four-seam Fastball FO - Pitchout (also PO)* FS - Splitter FT - Two-seam Fastball IN - Intentional ball KC - Knuckle curve KN - Knuckeball PO - Pitchout (also FO)* SC - Screwball* SI - Sinker SL - Slider UN - Unknown* · these pitch types occur rarely |
S (strike) B (ball) X (in play) |
Train & Validation
Even though the initial dataset is massive (over 10,000 cases) the pairing down to the necessary variables (mentioned above) will reduce the dataset to roughly 3,200+ cases. This is because I am going to focus on one pitcher. 3,200 is a rough estimation because if you take a single season with a 162 games per year, assuming a pitcher throws 81 pitches a game (a perfect game = 9 innings x 3 strikes per batter x 3 batters an inning) in a 4 pitcher rotation = 162/4 x 81.
Realistically, I will look at one of the top 5 most active pitchers within this dataset.
|
pitcher_id |
f_name |
l_name |
#of ab_ids |
|
453286 |
Max |
Scherzer |
3450 |
|
446372 |
Corey |
Kluber |
3373 |
|
519144 |
Rick |
Porcello |
3328 |
|
519242 |
Chris |
Sale |
3240 |
|
500779 |
Jose |
Quintana |
3234 |
For example: pitcher_id “453286” = Max Scherzer who has thrown 3,450 pitches in this dataset. There is an ab_id tied to each ptcher_id
To that end, and in light of what the professor said in class (October 3rd t:29min) a smaller dataset should be set up at 70/30 - train/validation. Therefore, a rough estimate of 3,200 cases Train = 2,240 & Validation = 960. If the variable count ends up being higher a 60/40 split may work better.
Data Mining Techniques
Since this is my first real introduction into data mining techniques to create a predictive approach I am going to stick to what is being taught in class. A decision tree could be interesting but I cannot completely see how helpful it could be in a situation as fluid as a live game. A scatterplot with linear regression, along with a heat map, and contour plots could be a great way to view the predictive model.
Software
I have to stick with what I know, which is not much but growing with each class. Thus SAS software suite will be used, along with excel.
No team members, I am going this alone.
Player
_
names
id
(
pkey
)
First
_
name
Last
_
name
atbats
ab
_
id
(
pkey
)
pitcher
_
id
(
skey
)
inning
p
_
throws
stand
top
pitches
ab
_
id
pitch
_
type
type
b
_
count
s
_
count
outs
pitch
_
num
on
_
1
b
on
_
2
b
on
_
3
b
id
=
pitcher
_
id
ab
_
id
=
ab
_
id
Player_names
id (pkey)
First_name
Last_name
atbats
ab_id (pkey)
pitcher_id (skey)
inning
p_throws
stand
top
pitches
ab_id
pitch_type
type
b_count
s_count
outs
pitch_num
on_1b
on_2b
on_3b
id = pitcher_id
ab_id = ab_id