Assignment 1
10. Discriminant Analysis and Logistic Regression
2
CONTENT
Discriminant Analysis
Logistic Regression
Two cousins of multiple regression
3
Regression vs. Discriminant Analysis
Regression
Y = a + b1X1 + b2X2 + …..
Dependent variable Y: interval/ratio variables
Attitude toward brand/Sales volume
Independent variable X: Interval/ratio variables
Knowledge about brand/Expenditure on advertising, sales promotion
b: the impact of the independent variables (Xs) on (determining) the dependent variable (Y)
Discriminant analysis
Y (0/1) = a + b1X1 + b2X2 + …..
Discriminant function
Dependent variable Y: nominal variable
0: nonuser of a service/product
1: user of a service/product
Independent variable X: Interval/ratio variables
age, years of education
b: the impact of the independent variables on (determining) who are users/non-users (separating users from non-users)
4
Discriminant Analysis
Usage (0/1) = a + b1Age + b2Years of Education
Canonical discriminant function coefficients: b1 and b2
The impact of age and years of education on separating users from non-users
Hypothesis Test
Null hypothesis for b1
H0: Age can not separate users from non-users.
Null hypothesis for b2
H0: Years of education can not separate users from non-users.
Statistical decision
Accept H0: Can not separate
Reject H0: Can separate
5
Cut-off line
User
Non-user
1 1
1 1 1 1 1 1
1 1
0 0
0 0 0 0 0 0
0 0
Years of Education X2
Age X1
Cable TV User Analysis
0 = Non-user
1 = User
Discriminant function
Y(0/1) = X1
6
1 1
1 1 1 1 1
1 1 1
0 0
0 0 0 0 0
0 0 0
User
Non-user
Cut-off line
Cable TV User Analysis
0 = Non-user
1 = User
Years of Education X2
Age X1
Discriminant function
Y(0/1) = X2
7
1 1
1 1 1
1 0 1
1 1 1
0 0
0 0 0
0 0
0 0
Cable TV User Analysis
0 = Non-user
1 = User
Years of Education X2
Age X1
Discriminant function
Y(0/1) = a + b1X1 + b2X2
Non-user
User
Cut-off line
8
How good is the discriminant function?
At least 75% of the cases correctly classified
Cable TV User Analysis
Correctly classified: 16/20 = 80%
Predicted
Non-user
User
Actual
Non-user
User
7
1
3
9
Classification matrix
9
Discriminant Analysis
Sample data
10_Discriminant Logistic
Y: Cable TV Usage
0: Non-user
1: User
X1: Age
X2: Years of Education
10
Data View
11
Variable View
12
13
Analyze -> Classify -> Discriminant
14
Select Cable TV Usage (Y) from the variable list and next click on the top arrow to enter it into Grouping Variable
15
Cable TV Usage (Y) now in Grouping Variable, click on Define Range.
16
On Discriminant An…, enter 0 in Minimum, which refers to non-user in Y, and 1 in Maximum, which refers to User in Y. Next, click on Continue to close this window.
17
Select Age (X1) from the variable list, and click on the second arrow from the top to enter it into Independent(s).
18
Select Years of Education (X2) from the variable list, and click on the second arrow from the top again to enter it into Independent(s).
19
All independent variables are entered into Independent(s). Then click on Statistics.
20
Statistics
Means
Average age of non-users
Average age of users
Average years of education of non-users
Average years of education of users
ANOVA
Do users and non-users differ in age?
Can age be used to separate users from non-users?
Do users and non-users differ in years of education?
Can years of education be used to separate users from non-users?
21
On Discriminant Analysis…, select Means and Univariate ANOVA. Then click on Continue to close the window.
22
Click on Classify.
23
Classify
Summary
How many non-users are correctly identified as non-users?
How many users are correctly identified as users?
24
On Discriminant Analysis: Classific…, select Summary Table. Then click on Continue to close the window.
25
Click on Ok to run discriminant analysis.
26
SPSS gives you more outputs than you need. You only need the following tables to make the interpretation. The first table is Group Statistics, and it reports the averages of age and years of education of users and non-users.
27
The second table is Tests of Equality of Group Means. Users and non-users differ in age (Sig= .018 <.05). In other words, age can be used to separate users from non-users. Conversely, users and non-users do not differ in years of education (Sig = .564 > .05). In other words, years of education can not be used to separate users from non-users.
28
The third table is Standardized Canonical Discriminant Function Coefficients. Since we know from the previous slide that years of education can’t be used to separate users from non-users, the focus is on age. The co-efficient is positive, which indicates a positive relationship between age and usage: the older a person is, the more likely that person will be a user.
29
The last table is Classification Results. 80% of the cases are classified correctly, which indicates the discriminant analysis is successful.
Overall interpretation of the four tables: The average age of non-users is 25, and that of users is 28. The average years of education of non-users is 12.50, and that of users is 11.70. Whether an individual is a user or non-user of cable TV is determined by age, and the higher the age and the higher the likelihood to be a cable TV user. 80% of the cases are classified correctly, and this indicates the discriminant analysis is successful.
30
An additional note about Standardized Canonical Discriminant Function Coefficients. Hypothetically, we did another discriminant analysis separating users and non-users of YouTube. In this hypothetical analysis, years of education again can’t be used to separate users from non-users, the focus is again on age. But the co-efficient is negative, which indicates a negative relationship between age and usage: the younger a person is, the more likely that person will be a user of YouTube.
31
Regression/Discriminant Analysis/Logistic Regression
Regression
Y = a + b1X1 + b2X2 + …..
Discriminant analysis
Y (0/1) = a + b1X1 + b2X2 + …..
Logistic regression/logit regression
Y (0/1) = a + b1X1 + b2X2 + …..
X: covariates
b: regression coefficients
The impact of the independent variables on separating users from non-users
Null hypothesis for b1
H0: X1 can not separate users from non-users.
Null hypothesis for b2
H0: X2 can not separate users from non-users.
Statistical decision
Accept H0: Can not separate
Reject H0: Can separate
Compared to discriminant analysis
Easier to run
But…
32
Logistic Regression
Sample data
10_Discriminant Logistic
Y: Cable TV Usage
0: Non-user
1: User
X1: Age
X2: Years of Education
33
Analyze -> Regression -> Binary Logistic
34
Select Cable TV Usage (Y) from the variable list and next click on the top arrow to enter it into Dependent
35
Cable TV Usage (Y) now in Dependent. Select Age (X1) from the variable list, and click on the second arrow from the top to enter it into Covariates.
36
Select Years of Education (X2) from the variable list, and click on the second arrow from the top again to enter it into Covariates.
37
All independent variables are entered into Covariates. Then click on OK to run logistic regression.
38
SPSS gives you more outputs than you need. You only need the following tables to make the interpretation. One is Classification Table, and it reports that 70% of the cases are correctly classified.
39
The other is Variables in the Equation. Age can separate users from nonusers (Sig= .030 <.05). The coefficient (B) of Age is positive, which indicates a positive relationship between age and usage of cable TV. Years of education can not separate users from nonusers (Sig= .313 >.05), and its coefficients is disregarded.
Interpretation: Whether an individual is a user or non-user of cable TV is determined by age, and the higher the age and the higher the likelihood to be a cable TV user. 70% of the cases are classified correctly, and this indicates the logistic regression analysis is not very satisfactory.
40
Regression/Discriminant Analysis/Logistic Regression
Regression
Y = a + b1X1 + b2X2 + …..
Discriminant analysis
Y (0/1) = a + b1X1 + b2X2 + …..
Logistic regression/logit regression
Y (0/1) = a + b1X1 + b2X2 + …..
X: covariates
b: regression coefficients
The impact of the independent variables on separating users from non-users
Null hypothesis for b1
H0: X1 can not separate users from non-users.
Null hypothesis for b2
H0: X2 can not separate users from non-users.
Statistical decision
Accept H0: Can not separate
Reject H0: Can separate
Compared to discriminant analysis
Easier to run
But
Does not report group statistics
Small sample size leads to underperformance
PIC pictures illustrating concepts
41
Discriminant Analysis
Statistical Analyses
Logistic Regression
Dependent variable
Independent variable
Nominal
Interval
Ratio
Null hypothesis
significance value > .05
significance value ≤ .05
Null hypothesis
Null hypothesis
Classification rate
percentage correctly classified ≥ 75%
percentage correctly classified < 75%
Satisfactory
Unsatisfactory
Dependent variable
Independent variable
Nominal
Interval
Ratio
Null hypothesis
significance value > .05
significance value ≤ .05
Null hypothesis
Null hypothesis
Classification rate
percentage correctly classified ≥ 75%
percentage correctly classified < 75%
Satisfactory
Unsatisfactory
Pro: Easier to run; Con: No group statistics, underperformance due to small sample size
Discriminant Analysis
Statistical Analyses
Logistic Regression
Dependent variable
Independent variable
Nominal
Interval
Ratio
Null hypothesis
significance value > .05
significance value ≤ .05
Null hypothesis
Null hypothesis
Classification rate
percentage correctly classified ≥ 75%
percentage correctly classified < 75%
Satisfactory
Unsatisfactory
Dependent variable
Independent variable
Nominal
Interval
Ratio
Null hypothesis
significance value > .05
significance value ≤ .05
Null hypothesis
Null hypothesis
Classification rate
percentage correctly classified ≥ 75%
percentage correctly classified < 75%
Satisfactory
Unsatisfactory
Pro: Easier to run; Con: No group statistics, underperformance due to small sample size
Slides 3-30
Slides 31-40
Slide 40
44
WRAP-UP
Difference and similarity among Regression/Discriminant Analysis/Losgitic Regression
SPSS procedures for Discriminant Analysis/Losgitic Regression