Data analytics

profilejoypos
INSD5120ExploratoryDataAnalysis-Part2.pdf

 (Very) Quick introduction to categorical data analysis  Contingency tables  Notation

 Descriptive statistics  Difference in proportions  Relative risk  Odds ratio

 Understand notation used in contingency table analysis  Be able to calculate and interpret applicable summary

statistics for categorical data  Be able to use multi-layer tables to control for

confounding variables

 Categorical data is derived from assigning observations of individuals to nominal categories based on qualitative properties, or from observations of quantitative variables grouped within specified intervals

 Examples include demographic data, medical treatments and outcomes, political party affiliation, and survey responses

 Categorical data is often summarized in contingency tables or cross tabulations

 A contingency table or cross-tabulation is a tabular representation of the frequency counts for levels of categorical variables

Y

X Y Level 1 Y Level 2 Total

X Level 1 n11 n12 n1+

X Level 2 n21 n22 n2+

Total n+1 n+2 n++

Y

X

Y Level 1

Y Level 2

Total

X Level 1

n11

n12

n1+

X Level 2

n21

n22

n2+

Total

n+1

n+2

n++

Count

30 12 29 16 87 208 215 178 46 647

89 90 68 15 262 327 317 275 77 996

Low Middle High

SES

Total

Democrat Republican Independent DK/NA What Political Party Are You Affiliated?

Total

 Often not as interested in absolute counts as opposed to the relationship between the cell proportions

 To properly analyze cell proportions need to know experimental design and relationship between the variables  All variables can be considered response variables  One or more response variables with one or more

explanatory variables ▪ Prospective study ▪ Retrospective study

 Proportion notation (two variables)  {πij} gives the joint distribution  {πi+} and {π+j} represent the marginals (row or

column proportions)  {πj|i} is the conditional distribution of Y given

level i of X

Y

X Y Level 1 Y Level 2 Total

X Level 1 π11 (π1|1 )

π12 (π2|1 )

π1+

X Level 2 π21 (π1|2 )

π22 (π2|2 )

π2+

Total π+1 π+2 1

Y

X

Y Level 1

Y Level 2

Total

X Level 1

(11

((1|1 )

(12

((2|1 )

(1+

X Level 2

(21

((1|2 )

(22

((2|2 )

(2+

Total

(+1

(+2

1

30 12 29 16 87 34.5% 13.8% 33.3% 18.4% 100.0%

9.2% 3.8% 10.5% 20.8% 8.7%

3.0% 1.2% 2.9% 1.6% 8.7% 208 215 178 46 647

32.1% 33.2% 27.5% 7.1% 100.0%

63.6% 67.8% 64.7% 59.7% 65.0%

20.9% 21.6% 17.9% 4.6% 65.0% 89 90 68 15 262

34.0% 34.4% 26.0% 5.7% 100.0%

27.2% 28.4% 24.7% 19.5% 26.3%

8.9% 9.0% 6.8% 1.5% 26.3% 327 317 275 77 996

32.8% 31.8% 27.6% 7.7% 100.0%

100.0% 100.0% 100.0% 100.0% 100.0%

32.8% 31.8% 27.6% 7.7% 100.0%

Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total

Low

Middle

High

SES

Total

Democrat Republican Independent DK/NA What Political Party Are You Affiliated?

Total

 Subjects either select or are selected for treatment groups and then response is studied  Experimental

▪ Subjects are randomly allocated to treatment groups  Observational

▪ Subjects self-select treatment group  Principal aim is to compare conditional distribution of response for different levels of

explanatory variable(s)

Treatment

Control

 Findings from the Aspirin Component of the Ongoing Physicians’ Health Study

Heart Attack

No Attack

Placebo 188

10,845

Aspirin 104 10,933

%7.1

017. 11033 188

10845188 188

=

==

+

Placebo -> Heart Attack

%94.

0094. 110037

104 10933104

104

=

==

+

Aspirin -> Heart Attack

Heart Attack

No Attack

Placebo

188

10,845

Aspirin

104

10,933

 Given response, look back at levels of possible explanatory variables  Observational studies

 Typically “over-sample” from response level of interest  If know overall population proportion in each response level could use

Bayes theorem to estimate conditional distribution in direction of interest

 England-Wales 1968-1972 study on heart attacks and oral contraceptive use

Heart Attack

Oral Contraceptive

Practice

Yes No

Used 23 34

Never used 35 132

Total 58 166

23 23 + 34

= .404

35 35 + 132

= .21

Not appropriate to compare the “risks” for this study • Column marginals fixed

by design • Column marginals do not

reflect population marginal proportions

• Can use odds ratio to estimate desired relative risk value

Heart Attack

Oral Contraceptive Practice

Yes

No

Used

23

34

Never used

35

132

Total

58

166

 Comparing proportions for binary responses  Difference of proportions  Relative risk  Odds ratios

 I x J tables  No completely satisfactory way to summarize association  Pairs of odds ratios  Concentration coefficient  Uncertainty coefficient

 Independence  X and Y response: πij = πi+π+j, for all i,j

▪ That is, πj|i = π+j, for all i,j  X explanatory, Y response: πj|i = πj|h, for each j, for all i,h

Y

X Y Level 1 Y Level 2 Total

X Level 1 π11 (π1|1 )

π12 (π2|1 )

π1+

X Level 2 π21 (π1|2 )

π22 (π2|2 )

π2+

Total π+1 π+2 1

Y

X

Y Level 1

Y Level 2

Total

X Level 1

(11

((1|1 )

(12

((2|1 )

(1+

X Level 2

(21

((1|2 )

(22

((2|2 )

(2+

Total

(+1

(+2

1

 Binary response variable  Generally, compare response for different explanatory levels ▪ πj|i - πj|h

 Difference lies between -1 and 1  Independence when difference equals 0 for all i,h and response levels j

91.8 – 44.3 = 47.5%

• Suppose view party affiliation as “explanatory variable”

• Difference in proportion for viewing Iraq War as worth the cost

 What is a large difference?  A small difference may be more important when πj|i and πj|h are both

near 0 or 1 than when they are around .5

(.1 - .01) = .09

.1 is 10 times larger than .01

(.5 - .41) = .09

.5 is 1.2 times larger than .41

Heart Attack

No Attack

Placebo 188

10,845

Aspirin 104 10,933

.017 - .0094 = .0076

Heart Attack

No Attack

Placebo

188

10,845

Aspirin

104

10,933

 Used when relative difference between proportions more relevant than absolute difference

 RR = πj|i /πj|h  Relative risk of 1 corresponds to

independence  Usually can not be directly

calculated from retrospective studies

Heart Attack

No Attack

Placebo 188

10,845

Aspirin 104 10,933

𝑅𝑅𝑅𝑅 = .0094 .017

=.553

If you take an aspirin regularly you have .553 times the risk having a heart attack compared to those who do not

(1 - .553) x 100 = 44.7% less risk

Heart Attack

No Attack

Placebo

188

10,845

Aspirin

104

10,933

First child at age 25 or older

Breast cancer

No breast cancer

Total

yes 31 (1.9%) 1597 (98.1%) 1628 (100%)

no 65 (1.43%) 4475 (98.57%) 4540 (100%)

Total 96 (1.56%) 6072 (98.44%) 6168 (100%)

Risk for women having first child at 25 or older = .019 or 1.9%

Risk for women having first child before 25 = .0143 or 1.43%

Relative risk = .019/.0143 = 1.33 Increased risk = 33%

 Does the age that a woman has her first child affect the risk of developing breast cancer?

First child at age 25 or older

Breast cancer

No breast cancer

Total

yes

31 (1.9%)

1597 (98.1%)

1628 (100%)

no

65 (1.43%)

4475 (98.57%)

4540 (100%)

Total

96 (1.56%)

6072 (98.44%)

6168 (100%)

 For 2x2 table,  In row 1, odds of being column

1 instead of column 2: O1 = π1|1 /π2|1  In row 2, odds of being column

1 instead of column 2: O2 = π1|2 /π2|2  Odds ratio: O1/O2

Odds for women having first child at 25 or older = 31/1597 = .019/.981 = .0194

Odds for women having first child before 25 = 65/4475 = .0143/.9857 = .0145

Odds ratio = .0194/.0145 = 1.34

First child at age 25 or older

Breast cancer

No breast cancer

Total

yes (31) 1.9% (1597) 98.1% (1628) 100%

no (65) 1.43% (4475) 98.57% (4540) 100%

Total (96) 1.56% (6072) 98.44% (6168) 100%

First child at age 25 or older

Breast cancer

No breast cancer

Total

yes

(31) 1.9%

(1597) 98.1%

(1628) 100%

no

(65) 1.43%

(4475) 98.57%

(4540) 100%

Total

(96) 1.56%

(6072) 98.44%

(6168) 100%

 Takes values > 0  Sometimes look at log odds ratio

 Invariant to interchanging rows and columns  Unnecessary to specify response variable ▪ Unlike for relative risk

 Equally valid for retrospective, prospective and cross- sectional studies

 Odds ratio = RR x (1-π1|2)/(1-π1|1)  When probability of outcome of interest is small, regardless of row

condition, then can use odds ratio as an estimate of relative risk

Heart Attack

Oral Contraceptive

Practice

Yes No

Used 23 34

Never used 35 132

Total 58 166

𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝜃𝜃 = 23 132 35 34

= 2.55 ≈ 𝑅𝑅𝑅𝑅

Heart Attack

Oral Contraceptive Practice

Yes

No

Used

23

34

Never used

35

132

Total

58

166

 Assess baseline risk  Example: Men who drink 16 ounces of beer a day are three times more likely to

develop rectal cancer  Know time period of risk  Risks accumulate with time

▪ Example: 1 in 9 women will develop breast cancer over their lifetime. But annual risk of women in their 30’s is 1 in 3700 and women in their 70’s is 1 in 235

 Investigate confounding factors  Example: Older cars are almost 6 times as likely to be stolen than newer cars

22

Phenomenon in which a trend appears in different groups of data but disappears or reverses when these groups are combined

 Survival rates for a standard and a new treatment across two hospitals

All Patients

Survive Die Total

Standard 5 0 5 5 9 5 1 1 0 0

New 1 9 5 9 0 5 1 1 0 0

Total 7 0 0 1 5 0 0 2 2 0 0

 Grouped data from both hospitals  Risk of dying with standard treatment = 595/1100 = .54  Risk of dying with new treatment = 905/1100 = .82  Relative risk = .54/.82 = .66

 Hospital A:  Risk of dying with standard treatment = 95/100 = .95  Risk of dying with new treatment = 900/1000 = .90  Relative risk = .95/.90 = 1.06

Hospital A

Survive Die Total

Standard 5 9 5 1 0 0

New 1 0 0 9 0 0 1 0 0 0

Total 1 0 5 9 9 5 1 1 0 0

 Hospital B:  Risk of dying with standard treatment = 500/1000 = .5  Risk of dying with new treatment = 5/100 = .05  Relative risk = .5/.05 = 10.0

Hospital B

Survive Die Total

Standard 5 0 0 5 0 0 1 0 0 0

New 9 5 5 1 0 0

Total 5 9 5 5 0 5 1 1 0 0

 When data is combined, lose the information that the patients in Hospital A had BOTH a higher overall death rate AND a higher likelihood of receiving the new treatment

 Often misleading to summarize information over groups, especially if subjects were not randomly assigned to groups

 Over a given number of years the University of California, Berkeley admitted 44% of all men who applied to any one of six graduate programs and only 30% of women who applied

 Clearly there is gender discrimination in graduate admission at Berkeley – right?

Men Women

Major No. applicants

No. admitted

No. applicants

No. admitted

A 825 512 108 89

B 560 353 25 17

C 325 120 593 202

D 417 138 375 131

E 191 53 393 94

F 373 22 341 24

 Admission data for individual majors

Men

Women

Major

No. applicants

No. admitted

No. applicants

No. admitted

A

825

512

108

89

B

560

353

25

17

C

325

120

593

202

D

417

138

375

131

E

191

53

393

94

F

373

22

341

24

Men Women

Major Admission rate Admission rate

A 0.62 0.82

B 0.630 0.68

C 0.37 0.34

D 0.33 0.35

E 0.28 0.24

F 0.059 0.07

 Admission rates for individual majors

 Women are admitted at higher rates than men for most majors

 Generally, greater numbers of women than men apply to the most selective majors

Men

Women

Major

Admission rate

Admission rate

A

0.62

0.82

B

0.630

0.68

C

0.37

0.34

D

0.33

0.35

E

0.28

0.24

F

0.059

0.07

 Results of Florida study of whether race of homicide defendant affect likelihood that death penalty would receive death penalty

Defendant Race No Yes Grand Total Black 149 17 166 White 141 19 160 Grand Total 290 36 326

Defendant Race No Yes Grand Total Black 89.76% 10.24% 100.00% White 88.13% 11.88% 100.00% Grand Total 88.96% 11.04% 100.00%

Death Penalty

 Adding an additional variable layer: Victim Race

Death Penalty

V ic

ti m

R

ac e

V ic

ti m

R

ac e

Death Penalty

  • Exploratory Data Analysis�Part 2
  • Overview
  • Learning Objectives
  • Categorical Data Analysis
  • Contingency Tables
  • Example: Political Affiliation versus SES
  • Cell Proportions
  • Example: Political Affiliation versus SES
  • Prospective Study
  • Example: Aspirin a Day versus Heart Attack
  • Retrospective Study
  • Example: Early forms of Oral Contraceptives
  • Descriptive Statistics
  • Difference of Proportions
  • Interpreting Difference of Proportions
  • Relative Risk
  • Relative Risk: Example
  • Odds Ratio
  • Odds Ratio
  • Odds Ratio and Relative Risk
  • Interpreting Risks and Odds
  • Simpson’s Paradox
  • Example
  • Risks at Each Hospital
  • What is Going On?
  • Another Example: College Admission Bias
  • Example: College Admission Bias
  • Example: College Admission Bias
  • Example: Death Penalty Sentences
  • Example: Death penalty sentences