Assignment 1

profileRock1027
DiscriminantAnalysis.pptx

10. Discriminant Analysis and Logistic Regression

2

CONTENT

Discriminant Analysis

Logistic Regression

Two cousins of multiple regression

3

Regression vs. Discriminant Analysis

Regression

Y = a + b1X1 + b2X2 + …..

Dependent variable Y: interval/ratio variables

Attitude toward brand/Sales volume

Independent variable X: Interval/ratio variables

Knowledge about brand/Expenditure on advertising, sales promotion

b: the impact of the independent variables (Xs) on (determining) the dependent variable (Y)

Discriminant analysis

Y (0/1) = a + b1X1 + b2X2 + …..

Discriminant function

Dependent variable Y: nominal variable

0: nonuser of a service/product

1: user of a service/product

Independent variable X: Interval/ratio variables

age, years of education

b: the impact of the independent variables on (determining) who are users/non-users (separating users from non-users)

4

Discriminant Analysis

Usage (0/1) = a + b1Age + b2Years of Education

Canonical discriminant function coefficients: b1 and b2

The impact of age and years of education on separating users from non-users

Hypothesis Test

Null hypothesis for b1

H0: Age can not separate users from non-users.

Null hypothesis for b2

H0: Years of education can not separate users from non-users.

Statistical decision

Accept H0: Can not separate

Reject H0: Can separate

5

Cut-off line

User

Non-user

1 1

1 1 1 1 1 1

1 1

0 0

0 0 0 0 0 0

0 0

Years of Education X2

Age X1

Cable TV User Analysis

0 = Non-user

1 = User

Discriminant function

Y(0/1) = X1

6

1 1

1 1 1 1 1

1 1 1

0 0

0 0 0 0 0

0 0 0

User

Non-user

Cut-off line

Cable TV User Analysis

0 = Non-user

1 = User

Years of Education X2

Age X1

Discriminant function

Y(0/1) = X2

7

1 1

1 1 1

1 0 1

1 1 1

0 0

0 0 0

0 0

0 0

Cable TV User Analysis

0 = Non-user

1 = User

Years of Education X2

Age X1

Discriminant function

Y(0/1) = a + b1X1 + b2X2

Non-user

User

Cut-off line

8

How good is the discriminant function?

At least 75% of the cases correctly classified

Cable TV User Analysis

Correctly classified: 16/20 = 80%

Predicted

Non-user

User

Actual

Non-user

User

7

1

3

9

Classification matrix

9

Discriminant Analysis

Sample data

10_Discriminant Logistic

Y: Cable TV Usage

0: Non-user

1: User

X1: Age

X2: Years of Education

10

Data View

11

Variable View

12

13

Analyze -> Classify -> Discriminant

14

Select Cable TV Usage (Y) from the variable list and next click on the top arrow to enter it into Grouping Variable

15

Cable TV Usage (Y) now in Grouping Variable, click on Define Range.

16

On Discriminant An…, enter 0 in Minimum, which refers to non-user in Y, and 1 in Maximum, which refers to User in Y. Next, click on Continue to close this window.

17

Select Age (X1) from the variable list, and click on the second arrow from the top to enter it into Independent(s).

18

Select Years of Education (X2) from the variable list, and click on the second arrow from the top again to enter it into Independent(s).

19

All independent variables are entered into Independent(s). Then click on Statistics.

20

Statistics

Means

Average age of non-users

Average age of users

Average years of education of non-users

Average years of education of users

ANOVA

Do users and non-users differ in age?

Can age be used to separate users from non-users?

Do users and non-users differ in years of education?

Can years of education be used to separate users from non-users?

21

On Discriminant Analysis…, select Means and Univariate ANOVA. Then click on Continue to close the window.

22

Click on Classify.

23

Classify

Summary

How many non-users are correctly identified as non-users?

How many users are correctly identified as users?

24

On Discriminant Analysis: Classific…, select Summary Table. Then click on Continue to close the window.

25

Click on Ok to run discriminant analysis.

26

SPSS gives you more outputs than you need. You only need the following tables to make the interpretation. The first table is Group Statistics, and it reports the averages of age and years of education of users and non-users.

27

The second table is Tests of Equality of Group Means. Users and non-users differ in age (Sig= .018 <.05). In other words, age can be used to separate users from non-users. Conversely, users and non-users do not differ in years of education (Sig = .564 > .05). In other words, years of education can not be used to separate users from non-users.

28

The third table is Standardized Canonical Discriminant Function Coefficients. Since we know from the previous slide that years of education can’t be used to separate users from non-users, the focus is on age. The co-efficient is positive, which indicates a positive relationship between age and usage: the older a person is, the more likely that person will be a user.

29

The last table is Classification Results. 80% of the cases are classified correctly, which indicates the discriminant analysis is successful.

Overall interpretation of the four tables: The average age of non-users is 25, and that of users is 28. The average years of education of non-users is 12.50, and that of users is 11.70. Whether an individual is a user or non-user of cable TV is determined by age, and the higher the age and the higher the likelihood to be a cable TV user. 80% of the cases are classified correctly, and this indicates the discriminant analysis is successful.

30

An additional note about Standardized Canonical Discriminant Function Coefficients. Hypothetically, we did another discriminant analysis separating users and non-users of YouTube. In this hypothetical analysis, years of education again can’t be used to separate users from non-users, the focus is again on age. But the co-efficient is negative, which indicates a negative relationship between age and usage: the younger a person is, the more likely that person will be a user of YouTube.

31

Regression/Discriminant Analysis/Logistic Regression

Regression

Y = a + b1X1 + b2X2 + …..

Discriminant analysis

Y (0/1) = a + b1X1 + b2X2 + …..

Logistic regression/logit regression

Y (0/1) = a + b1X1 + b2X2 + …..

X: covariates

b: regression coefficients

The impact of the independent variables on separating users from non-users

Null hypothesis for b1

H0: X1 can not separate users from non-users.

Null hypothesis for b2

H0: X2 can not separate users from non-users.

Statistical decision

Accept H0: Can not separate

Reject H0: Can separate

Compared to discriminant analysis

Easier to run

But…

32

Logistic Regression

Sample data

10_Discriminant Logistic

Y: Cable TV Usage

0: Non-user

1: User

X1: Age

X2: Years of Education

33

Analyze -> Regression -> Binary Logistic

34

Select Cable TV Usage (Y) from the variable list and next click on the top arrow to enter it into Dependent

35

Cable TV Usage (Y) now in Dependent. Select Age (X1) from the variable list, and click on the second arrow from the top to enter it into Covariates.

36

Select Years of Education (X2) from the variable list, and click on the second arrow from the top again to enter it into Covariates.

37

All independent variables are entered into Covariates. Then click on OK to run logistic regression.

38

SPSS gives you more outputs than you need. You only need the following tables to make the interpretation. One is Classification Table, and it reports that 70% of the cases are correctly classified.

39

The other is Variables in the Equation. Age can separate users from nonusers (Sig= .030 <.05). The coefficient (B) of Age is positive, which indicates a positive relationship between age and usage of cable TV. Years of education can not separate users from nonusers (Sig= .313 >.05), and its coefficients is disregarded.

Interpretation: Whether an individual is a user or non-user of cable TV is determined by age, and the higher the age and the higher the likelihood to be a cable TV user. 70% of the cases are classified correctly, and this indicates the logistic regression analysis is not very satisfactory.

40

Regression/Discriminant Analysis/Logistic Regression

Regression

Y = a + b1X1 + b2X2 + …..

Discriminant analysis

Y (0/1) = a + b1X1 + b2X2 + …..

Logistic regression/logit regression

Y (0/1) = a + b1X1 + b2X2 + …..

X: covariates

b: regression coefficients

The impact of the independent variables on separating users from non-users

Null hypothesis for b1

H0: X1 can not separate users from non-users.

Null hypothesis for b2

H0: X2 can not separate users from non-users.

Statistical decision

Accept H0: Can not separate

Reject H0: Can separate

Compared to discriminant analysis

Easier to run

But

Does not report group statistics

Small sample size leads to underperformance

PIC pictures illustrating concepts

41

Discriminant Analysis

Statistical Analyses

Logistic Regression

Dependent variable

Independent variable

Nominal

Interval

Ratio

Null hypothesis

significance value > .05

significance value ≤ .05

Null hypothesis

Null hypothesis

Classification rate

percentage correctly classified ≥ 75%

percentage correctly classified < 75%

Satisfactory

Unsatisfactory

Dependent variable

Independent variable

Nominal

Interval

Ratio

Null hypothesis

significance value > .05

significance value ≤ .05

Null hypothesis

Null hypothesis

Classification rate

percentage correctly classified ≥ 75%

percentage correctly classified < 75%

Satisfactory

Unsatisfactory

Pro: Easier to run; Con: No group statistics, underperformance due to small sample size

Discriminant Analysis

Statistical Analyses

Logistic Regression

Dependent variable

Independent variable

Nominal

Interval

Ratio

Null hypothesis

significance value > .05

significance value ≤ .05

Null hypothesis

Null hypothesis

Classification rate

percentage correctly classified ≥ 75%

percentage correctly classified < 75%

Satisfactory

Unsatisfactory

Dependent variable

Independent variable

Nominal

Interval

Ratio

Null hypothesis

significance value > .05

significance value ≤ .05

Null hypothesis

Null hypothesis

Classification rate

percentage correctly classified ≥ 75%

percentage correctly classified < 75%

Satisfactory

Unsatisfactory

Pro: Easier to run; Con: No group statistics, underperformance due to small sample size

Slides 3-30

Slides 31-40

Slide 40

44

WRAP-UP

Difference and similarity among Regression/Discriminant Analysis/Losgitic Regression

SPSS procedures for Discriminant Analysis/Losgitic Regression