Project

profileprNat0rals
Sample_Project_Proposal_1.docx

MLB Batter vs Pitcher

Part 1 Proposal submission

In this Moneyball era, baseball has not only showed the world the value of analytics, it is still a sports pioneer through its predictive methodologies. Sports is no longer all about might, and strength, and power, it is about understanding your team to a numerical degree. Yet, beyond the scope of your team it is about better understanding your opponent. From the New England Patriots filming their opponents to try and predict what play will be called, to breaking into hotel and locker rooms to steal play books. Teams have a long, rich tradition of trying to predict what their opponents will do in any given situation. But there is a more discreet, dare I say ethical way to predict what it is your competition will do - analyze it.

Picture yourself standing in the batters box, bases loaded, bottom of the 9th, 2 outs, full count and you’re down by 3. There is just one pitch left, just one more chance for you to either be the hero or the goat. Now imagine that in this moment you know with statistical significance what pitch (pitch type) your opponent will throw your way. Queue the predictive analytics. 

I propose that with a handful of strategic variables you can predict pitch type. Put another way, at any given point in an at bat, you can predict what pitch a pitcher will throw. 

The strategic business objective for this information is simple yet priceless. Basically, this knowledge can put a player on base and “runs” win games. With this information a batter could have additional confidence on when to swing for the fences, aim for the gap or simply watch the ball go by. 

The database comes from Kaggle

(https://www.kaggle.com/pschale/mlb-pitch-data-20152018#games.csv - MLB Pitch Data 2015-2018). 

*note: the information above should satify Part b of the Proposal submission request as you can verify the dataset from the link and see below how I will parse the data.

Three datasets needed - “pitches” “atbats” “player_names”

(corresponding data exists in 3 different tables - multiple joins will be needed)

Total dataset case size - 10K+

Projected dataset case size for project - 3,000 - 3,500

Strategic (dependent) Variables required for project - 

1. Pitch Count (includes two variable sets)

a. Ball Count b_count

b. Strike Count s_count

2. Outs (the number of outs before each pitch is thrown)

3. Pitch Number

4. On Base (includes three variables + weighted sets)

a. 1st Base on_1b

b. 2nd Base on_2b (multiplied by 2 to add weight to variable)

c. 3rd Base on_3b (multiplied by 3 to add weight to variable)

i. These three variables are binary - on base > Ture / False

1. Will decide once model is created to see if weighted results dramatically affect the results

Descriptive Variables

1. Player Names – id (first and last name)

2. Atbats

a. Pitcher_id

b. Inning

c. Stand (which side the batter hits from)

d. Top (binary top of inning = True / bottom of inning = False)

e. Ab_id

Target Variables

Pitch Type (primary)

Type (could reduce to just S or B to make binary)

CH - Changeup

CU - Curveball

EP - Eephus*

FC - Cutter

FF - Four-seam Fastball

FO - Pitchout (also PO)*

FS - Splitter

FT - Two-seam Fastball

IN - Intentional ball

KC - Knuckle curve

KN - Knuckeball

PO - Pitchout (also FO)*

SC - Screwball*

SI - Sinker

SL - Slider

UN - Unknown*

· these pitch types occur rarely

S (strike)

B (ball)

X (in play)

Train & Validation

Even though the initial dataset is massive (over 10,000 cases) the pairing down to the necessary variables (mentioned above) will reduce the dataset to roughly 3,200+ cases. This is because I am going to focus on one pitcher. 3,200 is a rough estimation because if you take a single season with a 162 games per year, assuming a pitcher throws 81 pitches a game (a perfect game = 9 innings x 3 strikes per batter x 3 batters an inning) in a 4 pitcher rotation = 162/4 x 81. 

Realistically, I will look at one of the top 5 most active pitchers within this dataset.

pitcher_id

f_name

l_name

#of ab_ids

453286

Max

Scherzer

3450

446372

Corey

Kluber

3373

519144

Rick

Porcello

3328

519242

Chris

Sale

3240

500779

Jose

Quintana

3234

For example: pitcher_id “453286” = Max Scherzer who has thrown 3,450 pitches in this dataset. There is an ab_id tied to each ptcher_id

To that end, and in light of what the professor said in class (October 3rd t:29min) a smaller dataset should be set up at 70/30 - train/validation. Therefore, a rough estimate of 3,200 cases Train = 2,240 & Validation = 960. If the variable count ends up being higher a 60/40 split may work better.

Data Mining Techniques

Since this is my first real introduction into data mining techniques to create a predictive approach I am going to stick to what is being taught in class. A decision tree could be interesting but I cannot completely see how helpful it could be in a situation as fluid as a live game. A scatterplot with linear regression, along with a heat map, and contour plots could be a great way to view the predictive model.

Software

I have to stick with what I know, which is not much but growing with each class. Thus SAS software suite will be used, along with excel.  

No team members, I am going this alone.

Player

_

names

id

(

pkey

)

First

_

name

Last

_

name

atbats

ab

_

id

(

pkey

)

pitcher

_

id

(

skey

)

inning

p

_

throws

stand

top

pitches

ab

_

id

pitch

_

type

type

b

_

count

s

_

count

outs

pitch

_

num

on

_

1

b

on

_

2

b

on

_

3

b

id

=

pitcher

_

id

ab

_

id

=

ab

_

id

Player_names

id (pkey)

First_name

Last_name

atbats

ab_id (pkey)

pitcher_id (skey)

inning

p_throws

stand

top

pitches

ab_id

pitch_type

type

b_count

s_count

outs

pitch_num

on_1b

on_2b

on_3b

id = pitcher_id

ab_id = ab_id