project

profilenirmak
PROJECTC.docx

PROJECT C:

· Read all documents in module

· Build data set with 1 Y dependent variable, 7 X independent quantitative variables, 2 X independent binary variables, and 1 X independent categorical variable.

· Run the multiple regression test on the Full Dataset.

· Correct any error messages.

· use "2020 Directions for Multiple Regression Test" to run the data and get to the Final Model

· Create Slides (Google version of PowerPoint) presentation

· Follow the step by step directions of "Project C Slides Directions"

Directions for Running Multiple Regression Test 2020

How to Move/Copy individual Sheets in Excel:

In Excel, your entire project is called the Workbook, or Book for short, and each tab in the Workbook is called a Spreadsheet, or Sheet for short.

Any time you want to make a change/edit/delete to the project C data set, rename and copy the individual Sheet you are working on before you make the change, then make the changes to the copy you just created. This ensures that you stay organized and that every change you make is recorded.

To do this, right-click on “Sheet1” at the bottom of your Book, then select “Rename” and name it something appropriate (Short names are better). Next, right-click on your newly named tab and select “Move or Copy…”. One here, you will click on the checkbox at the bottom that says, “Create a copy” and select where you want the copy to go, ( “(move to end)” is usually best) and click “OK.”  Repeat these steps every time you need to make a change/edit/delete to the data.

Process 1: Building the Dataset

Use what you have learned from the video lessons, the 2020 Excel tips, as well as the advice from Professor Decker and Emily to build your dataset. You will need 20 data points, 1 Y dependent variable, and 10 X independent variables: 7 quantitative, 2 binary, 1 categorical (11 variables total). The dataset with all 20 data points and 11 variables is called the “Full Dataset.”

Tips for building the data set:

· Do not use a topic about sports.

· USE GOLDMINE

· If you choose counties for your 20 data points, pick ones with populations over 80,000.

· The Y dependent variable is your most important decision, this is what your entire project is about (try to pick something other than population or area for this variable).

· Your 7 quantitative variables should be rates/percents (nothing should be 0). However, please do not pick percent female or male. Additionally, you can have the total population listed as a quantitative variable (this will be the only total allowed).  

· The binary variables answer a yes or no question, where 1=yes and 0=no. Your chosen variable must have at least three 1’s and at least three 0’s. 

· The categorical variable also answers a yes or no question, but these are broken into 3 groups with a reference level THAT IS NEVER PART OF YOUR MODEL. The reference level is chosen by you, just make sure you keep track of what you chose and why. 

· Please refer to “Multiple Regression Data Rules 2020” if you have any other questions about the original dataset. 

STOP NOW! EMAIL PROFESSOR DECKER AND EMILY! YOU MUST GET YOUR DATASET APPROVED BEFORE MOVING ON TO PROCESS 2! 

Process 2: Seek and Destroy Collinear Variables

Collinear variables are two variables that are correlated, so they should have a low p-value when they are run together in a simple regression test. Even one pair of colinear variables will ruin the study. Collinear variables must be avoided at all costs!

  

· Consider any p-value less than 0.10 to indicate that the variables are collinear.  

An easy way to do this is start with Independent X Variable 1 and use it to run a simple regression test against another independent variable that you think is collinear. For example, Independent X Variable 2. If the regression test’s p-value is less than 0.10, delete one of the two independent x variables that you tested. (REMEMBER, if you make any changes/edits/deletes create a copy of your sheet!)

· Test at least 5 pairs of variables. Choose which pairs to test by looking for any pairs that you think might have significant correlation. However, if you have any reason to believe there are other pairs of variables that correlate, test them too!

· The dataset after the collinear deletions is the “MC-free dataset,” even if no deletions are made. MC-free stands for multicollinearity-free because multicollinearity is a measure of how collinear the variables are in a multiple regression test. Your MC-free dataset must have at least 6 independent X variables. If you have less than 6, add new variables, but test them for being collinear to the old variables.

STOP NOW! EMAIL PROFESSOR DECKER AND EMILY! YOU MUST GET YOUR DATASET APPROVED BEFORE MOVING ON TO PROCESS 3! 

Process 3: Eliminating Insignificant Variables

· Run all the variables in the MC-free dataset in a multiple regression correlation test and delete the variables with the highest p-values until you have a total of 6 X variables remaining (If you begin this process with 6 variables, move on to the next bullet point).

· Next, run another multiple regression correlation test and delete the variable with the highest p-value. This will leave you with 5 X variables. 

· Lastly, run one more multiple regression correlation test and delete the variable with the highest p-value. This will leave you with exactly 4 independent X variables (or your “Significant Data Set”).  

STOP NOW! EMAIL PROFESSOR DECKER AND EMILY! YOU MUST GET YOUR DATASET APPROVED BEFORE MOVING ON TO PROCESS 4! 

Process 4: Finding a Final Model 

· A superior strategy for building a multiple regression model is to test all possible combinations of variables and choose the combination that has approximately the highest adjusted r2, but fewest number of variables.  

· This means that the best model has the highest adjusted r2 but if two or more models have similar adjusted r2 numbers, then choose the model with the least number of variables. If two models have the exact same number of variables, then choose the model with strictly the largest adjusted r2. (Adjusted r2 values are approximately the same if they are within 0.05).

Conduct 15 multiple regression tests; one test for each possible combination of the four remaining independent variables (V1, V2, V3, and V4).  Below is all the possible combinations of tests you need to do:

1. V1, V2, V3, V4

2. V1, V2, V3

3. V1, V2, V4

4. V1, V3, V4

5. V2, V3, V4

6. V1, V2

7. V1, V3

8. V1, V4

9. V2, V3

10. V2, V4

11. V3, V4

12. V1

13. V2

14. V3

15. V4

· Find the model with the highest adjusted r2 and any models that have adjusted r2 within 0.05 of the highest one. From those models, choose the one with the least number of variables.  If two models are tied for the least number of variables, choose the one with the highest r2 from those two.

· Your chosen model’s dataset is known as your “Final Model.”

Data Rules for Multiple Regression – Set 4A for Project C

Excel analyzes a data set in multiple regression by dividing the data into every possible combination of “boxes” (groups) based on what levels the data points are in for qualitative variables and the magnitude of their quantitative variables. It then calculates what the value of the dependent variables would be for each box. Problems arise when identical boxes are created because it makes the independent variables dependent on each other resulting in collinear variables.

Violating these rules will cause an error message in the p-value on your analysis print out.  One error message will ruin your project!  Contact the professor for help immediately if you cannot fix an error message in your print out.

The examples are for a model of real estate where the dependent variable is the price of the homes.

Rule #1: Data points may not have a value of zero for quantitative variables.

Reason and Solution: Zero is a very low number when compared to the values of the other data points.  This makes data points with zeros major outliers. The outlier will ruin the calculation. If only one of your data points is zero, remove it as an outlier. If you have several zeros, convert the quantitative variable to a qualitative variable by coding the data points that have values that are not zero as “1” and the data points that have values that are zero as “0.”

Example:  If some homes have an HOA fee of a few hundred dollars and some homes do not have an HOA so there is no HOA fee, make this variable qualitative by having homes with HOA fees coded as “1” and homes without HOA fees coded as “0” instead of entering the HOA fees as their quantitative numbers where homes without HOA fees entered as zeros.

Rule #2: Quantitative variables for rates cannot be complements or each other (add to 100%) and one quantitative variable cannot be determined by an algorithm (formula) of other quantitative variables.

Reason and Solution: Independent variables must be independent of each other. If one independent variable can be calculated based on other independent variables, then they are perfectly collinear. You must remove at least one of these variables and you can keep the rest.

Example: You cannot have the rates of all the races for a county because all these rates must sum to 100%. Remove at least one of the races with a high percent; removing a race with a low percent will still make the sum of the rest almost 100%. 

Rule #3: Every column of qualitative variables must have at least three 1’s and at least three 0’s.

Reason and Solution: If all data points have something (or don’t have something), then there is no information to be gained from this variable as all the data points would go in the same box. There is no point in including a variable like this and the variable should be removed.

Example: If all houses have pools, then having a pool does not play into the decision of which house to buy, since all the houses are the same with respect to pools. Use the same reasoning if no houses have pools. In both cases, there is nothing to be gained from having a pool as a qualitative variable, so it should not be included.

Rule #4:  Two columns of binary variables cannot be identical and cannot be perfect opposites (where one is 1 the other is 0 OR where one is 0 the other is 1).

Reason and Solution: Having the exact same columns will always put both variables together in the same box. These variables are redundant to each other and including the second one will give no more information than the first one. This can be solved by removing one of the redundant variables.

Example: If every house that has a pool has a screen porch and every house that has a screen porch has a pool, then the two features can be combined into one feature: pool with screen porch.  Since you can’t consider one feature without the other, there is no point in listing them as separate.

Rule #5: All data points of one binary variable cannot be in (or not in) one level of a categorical variable and vice versa.

Reason and Solution: If all (or none) of the data points of one qualitative variable go into the same box of another qualitative variable, then you have a box within a box and make it impossible to make independent calculations. You can try to add more data points (at least three) that are not (or are in) the binary variable but are in the level of the categorical variable. If you cannot do that, you will have to remove the binary variable.

Example:  If the categorical variable of location (by zip code) has three levels (three different zip codes), and all the homes in one zip code have a pool (or do not have a pool) then this makes “pool” collinear with one level of zip code.  You can add at least three homes that are in that zip code that have pools (or do not have pools). If you cannot do that then you will have to remove pools.

Rule #6: Don’t forget that the reference level of a categorical variable does not have a column.

Reason and Solution: The other levels of a categorical variable are all compared to the reference level.  To avoid having the reference level compared to itself, it does not get a column. To code for a data point in the reference level, enter zeros for the other levels.

Example: If the three levels of a location categorical variable are North, Downtown, and South and the homes in South are more expensive, then only have columns for North and Downtown with no column for South. If a data point is in South, code it as a 0 for North and 0 for Downtown.  If it is not in the North and not in Downtown, then it must be in the South.

Ideas for Data

Independent Variables

· Profession

· People

· Schools

· Foods

· Products (cars, clothes, movies)

· Real Estate (houses, condos, apartments)

· Politicians

Quantitative Rates

· Population

· Area

· Social rates (crime, literacy, graduation, health, poverty, etc.)

· Cost, profits, revenue

· Age

· Height/Weight

· Price (food, cars, product, services)

· Food properties (calories, fat, protein, etc.)

· Income

· Temperature

Binary/Categorical

· Presence of institutions (university [possibly with minimum enrollment], hospital, stadium)

· Capitals

· Laws 

· Presence of sports team

· Geographic area (region, state, country, continent)

· Conference or division

· Climate

· Education Level

· Brands of products

· Food groups

· Type of product (SUV, truck, sedan, etc.)

· Language

· Government type

· Political party of winner of election

Project C Multiple Regression Slides Presentation Checklist

Make a step by step presentation of how you developed your multiple regression model. 

Use this model to make real world and relevant predictions.

Evaluate, analyze, and critique the results.

Follow the procedure in PowerPoint and videos 4A Multiple Regression. However, there have been some 2020 changes, so please be careful to note them, as them have been explained below and denoted with *2020

The final project should be done completely in Slides (Google’s version of PowerPoint). You will put your Slides link in the dropbox and upload your Excel file in the dropbox as well.

If a part is in the Excel file, but not in the Slides presentation, it will not be considered part of the project and will not be graded. Please put everything in Slides; the Excel file is only for the professor’s reference.

Be sure to cite all sources of data and research.

Criteria

To be clear: It is unacceptable to leave out any processes. If you have trouble with a process, reach out to the professor or TA.  

Expectations:

· Directions are followed 100%. Students will reach out to the professor or TA immediately if there is a question about directions.

· Students read and follow all documents posted in the Project C module.

· All parts will receive a thoughtful and appropriate response. Students should use the checklist on the next page to make sure all parts are complete.

· Slides presentation is proofread, virtually free of errors, and is visually appealing.

· Students act as a professional business manager by treating the project as if it is a high priority task given by a boss in a business.

· Students collaborate with the professor or TA by e-mail, office hours, or help hours.

· Students may work with other students to help each other succeed.

· Students prioritize completion of this project in their busy schedule, do not procrastinate, and meet all deadlines.

·

Parts to Complete: Checklist

· Slides should not be cluttered with writing; less is more when deciding how much writing to put on one slide

· Students should meet all requirements for what to put on a slide but should keep writing concise, answering questions in as short a way as possible that still meets requirements; less is more!

· Do not copy any written content from the internet.  Students may take information, numbers, facts, and photographs from the internet without citation, but all written sentences should be written originally by students, not copied or paraphrased.

Introduction:

· Why you chose the topic

· How it is important 

· Why you chose your one dependent variable

· Explain what ALL variables mean, with units 

· Why you chose your alpha

· Your predictions on the results before building the model

Building “Full Dataset”

· At least 20 data points

· Exactly 7 independent quantitative variables

· Exactly 2 binary categorical variables

· Exactly 1 categorical variable with at least three levels

. Explain why the reference level was chosen

Seek and Destroy Collinear Variables and Outliers to create “MC-free Dataset”

· Run at least 5 pairs of independent variables through simple correlation tests to determine if any pairs are collinear *2020

· Check for outlier data points; if necessary, delete outlier

· Write 1 paragraph on outliers

. The existence, or absence of, outliers

. Your decision to remove them or not

· Must have at least 6 variables left

Eliminate variables to make “Significant Dataset”

· Run Excel data analysis multiple regression rest

· Delete variable with highest p-value *2020

· Repeat above two steps until there are exactly 4 variables left *2020

Determining “Final Model”

· Run 15 multiple regression tests on all possible combinations of including or excluding final 4 variables in model *2020

· Identify the model with highest adjusted r2 and any models with adjusted  r2 within 0.05 of the highest *2020

· From the models in the above step, choose the model with the least number of variables as “Final Model.”  If models are tied for the least number of variables, choose the model with the highest r2 from those. *2020

Checklist continues on next page →

Analysis based on the “Final Model”

· List significant variables in final model

· Interpret r2

· Explain the coefficients of the significant variables

· Give regression equation

· Use the regression equation to make  3 predictions for data points that are not in the model but were from the large online data set from where you choose your data points (for example, if your data points are states, make predictions for states you did not use in your Full Dataset.)  If there are no data points left that are not in the model, make up 3 data points that would have some significance and make predictions but you will not need to find residuals for these.

· Find residuals for the data points above

Conclusions based on the “Final Model.”

· If, in the end, you have a good, fair, or poor model

· The strength of your model to make predictions

· How your model can be used in the real world to make decisions

· Your reflection on why the variables were, or were not, significant

· If your final model was influenced by outliers

· You reflection on your choice of removing outliers

· What you learned about the topic

· What you learned about statistics

· How you can (or cannot) apply what you learned to your life, current career, or future  career