Data analytics
(Very) Quick introduction to categorical data analysis Contingency tables Notation
Descriptive statistics Difference in proportions Relative risk Odds ratio
Understand notation used in contingency table analysis Be able to calculate and interpret applicable summary
statistics for categorical data Be able to use multi-layer tables to control for
confounding variables
Categorical data is derived from assigning observations of individuals to nominal categories based on qualitative properties, or from observations of quantitative variables grouped within specified intervals
Examples include demographic data, medical treatments and outcomes, political party affiliation, and survey responses
Categorical data is often summarized in contingency tables or cross tabulations
A contingency table or cross-tabulation is a tabular representation of the frequency counts for levels of categorical variables
Y
X Y Level 1 Y Level 2 Total
X Level 1 n11 n12 n1+
X Level 2 n21 n22 n2+
Total n+1 n+2 n++
|
|
Y |
|
|
|
X |
Y Level 1 |
Y Level 2 |
Total |
|
X Level 1 |
n11 |
n12 |
n1+ |
|
X Level 2 |
n21 |
n22 |
n2+ |
|
Total |
n+1 |
n+2 |
n++ |
Count
30 12 29 16 87 208 215 178 46 647
89 90 68 15 262 327 317 275 77 996
Low Middle High
SES
Total
Democrat Republican Independent DK/NA What Political Party Are You Affiliated?
Total
Often not as interested in absolute counts as opposed to the relationship between the cell proportions
To properly analyze cell proportions need to know experimental design and relationship between the variables All variables can be considered response variables One or more response variables with one or more
explanatory variables ▪ Prospective study ▪ Retrospective study
Proportion notation (two variables) {πij} gives the joint distribution {πi+} and {π+j} represent the marginals (row or
column proportions) {πj|i} is the conditional distribution of Y given
level i of X
Y
X Y Level 1 Y Level 2 Total
X Level 1 π11 (π1|1 )
π12 (π2|1 )
π1+
X Level 2 π21 (π1|2 )
π22 (π2|2 )
π2+
Total π+1 π+2 1
|
|
Y |
|
|
|
X |
Y Level 1 |
Y Level 2 |
Total |
|
X Level 1 |
(11 ((1|1 )
|
(12 ((2|1 ) |
(1+ |
|
X Level 2 |
(21 ((1|2 ) |
(22 ((2|2 ) |
(2+ |
|
Total |
(+1 |
(+2 |
1 |
30 12 29 16 87 34.5% 13.8% 33.3% 18.4% 100.0%
9.2% 3.8% 10.5% 20.8% 8.7%
3.0% 1.2% 2.9% 1.6% 8.7% 208 215 178 46 647
32.1% 33.2% 27.5% 7.1% 100.0%
63.6% 67.8% 64.7% 59.7% 65.0%
20.9% 21.6% 17.9% 4.6% 65.0% 89 90 68 15 262
34.0% 34.4% 26.0% 5.7% 100.0%
27.2% 28.4% 24.7% 19.5% 26.3%
8.9% 9.0% 6.8% 1.5% 26.3% 327 317 275 77 996
32.8% 31.8% 27.6% 7.7% 100.0%
100.0% 100.0% 100.0% 100.0% 100.0%
32.8% 31.8% 27.6% 7.7% 100.0%
Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total Count % within SES % within Q.K In politics, would you say you are a % of Total
Low
Middle
High
SES
Total
Democrat Republican Independent DK/NA What Political Party Are You Affiliated?
Total
Subjects either select or are selected for treatment groups and then response is studied Experimental
▪ Subjects are randomly allocated to treatment groups Observational
▪ Subjects self-select treatment group Principal aim is to compare conditional distribution of response for different levels of
explanatory variable(s)
Treatment
Control
Findings from the Aspirin Component of the Ongoing Physicians’ Health Study
Heart Attack
No Attack
Placebo 188
10,845
Aspirin 104 10,933
%7.1
017. 11033 188
10845188 188
=
==
+
Placebo -> Heart Attack
%94.
0094. 110037
104 10933104
104
=
==
+
Aspirin -> Heart Attack
|
|
|
|
|
|
Heart Attack |
No Attack |
|
Placebo |
188
|
10,845 |
|
Aspirin |
104 |
10,933 |
Given response, look back at levels of possible explanatory variables Observational studies
Typically “over-sample” from response level of interest If know overall population proportion in each response level could use
Bayes theorem to estimate conditional distribution in direction of interest
England-Wales 1968-1972 study on heart attacks and oral contraceptive use
Heart Attack
Oral Contraceptive
Practice
Yes No
Used 23 34
Never used 35 132
Total 58 166
23 23 + 34
= .404
35 35 + 132
= .21
Not appropriate to compare the “risks” for this study • Column marginals fixed
by design • Column marginals do not
reflect population marginal proportions
• Can use odds ratio to estimate desired relative risk value
|
|
Heart Attack |
|
|
|
Oral Contraceptive Practice |
Yes |
No |
|
|
Used |
23
|
34 |
|
|
Never used |
35 |
132 |
|
|
Total |
58 |
166 |
|
Comparing proportions for binary responses Difference of proportions Relative risk Odds ratios
I x J tables No completely satisfactory way to summarize association Pairs of odds ratios Concentration coefficient Uncertainty coefficient
Independence X and Y response: πij = πi+π+j, for all i,j
▪ That is, πj|i = π+j, for all i,j X explanatory, Y response: πj|i = πj|h, for each j, for all i,h
Y
X Y Level 1 Y Level 2 Total
X Level 1 π11 (π1|1 )
π12 (π2|1 )
π1+
X Level 2 π21 (π1|2 )
π22 (π2|2 )
π2+
Total π+1 π+2 1
|
|
Y |
|
|
|
X |
Y Level 1 |
Y Level 2 |
Total |
|
X Level 1 |
(11 ((1|1 )
|
(12 ((2|1 ) |
(1+ |
|
X Level 2 |
(21 ((1|2 ) |
(22 ((2|2 ) |
(2+ |
|
Total |
(+1 |
(+2 |
1 |
Binary response variable Generally, compare response for different explanatory levels ▪ πj|i - πj|h
Difference lies between -1 and 1 Independence when difference equals 0 for all i,h and response levels j
91.8 – 44.3 = 47.5%
• Suppose view party affiliation as “explanatory variable”
• Difference in proportion for viewing Iraq War as worth the cost
What is a large difference? A small difference may be more important when πj|i and πj|h are both
near 0 or 1 than when they are around .5
(.1 - .01) = .09
.1 is 10 times larger than .01
(.5 - .41) = .09
.5 is 1.2 times larger than .41
Heart Attack
No Attack
Placebo 188
10,845
Aspirin 104 10,933
.017 - .0094 = .0076
|
|
|
|
|
|
Heart Attack |
No Attack |
|
Placebo |
188
|
10,845 |
|
Aspirin |
104 |
10,933 |
Used when relative difference between proportions more relevant than absolute difference
RR = πj|i /πj|h Relative risk of 1 corresponds to
independence Usually can not be directly
calculated from retrospective studies
Heart Attack
No Attack
Placebo 188
10,845
Aspirin 104 10,933
𝑅𝑅𝑅𝑅 = .0094 .017
=.553
If you take an aspirin regularly you have .553 times the risk having a heart attack compared to those who do not
(1 - .553) x 100 = 44.7% less risk
|
|
|
|
|
|
Heart Attack |
No Attack |
|
Placebo |
188
|
10,845 |
|
Aspirin |
104 |
10,933 |
First child at age 25 or older
Breast cancer
No breast cancer
Total
yes 31 (1.9%) 1597 (98.1%) 1628 (100%)
no 65 (1.43%) 4475 (98.57%) 4540 (100%)
Total 96 (1.56%) 6072 (98.44%) 6168 (100%)
Risk for women having first child at 25 or older = .019 or 1.9%
Risk for women having first child before 25 = .0143 or 1.43%
Relative risk = .019/.0143 = 1.33 Increased risk = 33%
Does the age that a woman has her first child affect the risk of developing breast cancer?
|
First child at age 25 or older |
Breast cancer |
No breast cancer |
Total |
|
yes |
31 (1.9%) |
1597 (98.1%) |
1628 (100%) |
|
no |
65 (1.43%) |
4475 (98.57%) |
4540 (100%) |
|
Total |
96 (1.56%) |
6072 (98.44%) |
6168 (100%) |
For 2x2 table, In row 1, odds of being column
1 instead of column 2: O1 = π1|1 /π2|1 In row 2, odds of being column
1 instead of column 2: O2 = π1|2 /π2|2 Odds ratio: O1/O2
Odds for women having first child at 25 or older = 31/1597 = .019/.981 = .0194
Odds for women having first child before 25 = 65/4475 = .0143/.9857 = .0145
Odds ratio = .0194/.0145 = 1.34
First child at age 25 or older
Breast cancer
No breast cancer
Total
yes (31) 1.9% (1597) 98.1% (1628) 100%
no (65) 1.43% (4475) 98.57% (4540) 100%
Total (96) 1.56% (6072) 98.44% (6168) 100%
|
First child at age 25 or older |
Breast cancer |
No breast cancer |
Total |
|
yes |
(31) 1.9% |
(1597) 98.1% |
(1628) 100% |
|
no |
(65) 1.43% |
(4475) 98.57% |
(4540) 100% |
|
Total |
(96) 1.56% |
(6072) 98.44% |
(6168) 100% |
Takes values > 0 Sometimes look at log odds ratio
Invariant to interchanging rows and columns Unnecessary to specify response variable ▪ Unlike for relative risk
Equally valid for retrospective, prospective and cross- sectional studies
Odds ratio = RR x (1-π1|2)/(1-π1|1) When probability of outcome of interest is small, regardless of row
condition, then can use odds ratio as an estimate of relative risk
Heart Attack
Oral Contraceptive
Practice
Yes No
Used 23 34
Never used 35 132
Total 58 166
𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝜃𝜃 = 23 132 35 34
= 2.55 ≈ 𝑅𝑅𝑅𝑅
|
|
Heart Attack |
|
|
|
Oral Contraceptive Practice |
Yes |
No |
|
|
Used |
23
|
34 |
|
|
Never used |
35 |
132 |
|
|
Total |
58 |
166 |
|
Assess baseline risk Example: Men who drink 16 ounces of beer a day are three times more likely to
develop rectal cancer Know time period of risk Risks accumulate with time
▪ Example: 1 in 9 women will develop breast cancer over their lifetime. But annual risk of women in their 30’s is 1 in 3700 and women in their 70’s is 1 in 235
Investigate confounding factors Example: Older cars are almost 6 times as likely to be stolen than newer cars
22
Phenomenon in which a trend appears in different groups of data but disappears or reverses when these groups are combined
Survival rates for a standard and a new treatment across two hospitals
All Patients
Survive Die Total
Standard 5 0 5 5 9 5 1 1 0 0
New 1 9 5 9 0 5 1 1 0 0
Total 7 0 0 1 5 0 0 2 2 0 0
Grouped data from both hospitals Risk of dying with standard treatment = 595/1100 = .54 Risk of dying with new treatment = 905/1100 = .82 Relative risk = .54/.82 = .66
Hospital A: Risk of dying with standard treatment = 95/100 = .95 Risk of dying with new treatment = 900/1000 = .90 Relative risk = .95/.90 = 1.06
Hospital A
Survive Die Total
Standard 5 9 5 1 0 0
New 1 0 0 9 0 0 1 0 0 0
Total 1 0 5 9 9 5 1 1 0 0
Hospital B: Risk of dying with standard treatment = 500/1000 = .5 Risk of dying with new treatment = 5/100 = .05 Relative risk = .5/.05 = 10.0
Hospital B
Survive Die Total
Standard 5 0 0 5 0 0 1 0 0 0
New 9 5 5 1 0 0
Total 5 9 5 5 0 5 1 1 0 0
When data is combined, lose the information that the patients in Hospital A had BOTH a higher overall death rate AND a higher likelihood of receiving the new treatment
Often misleading to summarize information over groups, especially if subjects were not randomly assigned to groups
Over a given number of years the University of California, Berkeley admitted 44% of all men who applied to any one of six graduate programs and only 30% of women who applied
Clearly there is gender discrimination in graduate admission at Berkeley – right?
Men Women
Major No. applicants
No. admitted
No. applicants
No. admitted
A 825 512 108 89
B 560 353 25 17
C 325 120 593 202
D 417 138 375 131
E 191 53 393 94
F 373 22 341 24
Admission data for individual majors
|
|
Men |
Women |
||
|
Major |
No. applicants |
No. admitted |
No. applicants |
No. admitted |
|
A |
825 |
512 |
108 |
89 |
|
B |
560 |
353 |
25 |
17 |
|
C |
325 |
120 |
593 |
202 |
|
D |
417 |
138 |
375 |
131 |
|
E |
191 |
53 |
393 |
94 |
|
F |
373 |
22 |
341 |
24 |
Men Women
Major Admission rate Admission rate
A 0.62 0.82
B 0.630 0.68
C 0.37 0.34
D 0.33 0.35
E 0.28 0.24
F 0.059 0.07
Admission rates for individual majors
Women are admitted at higher rates than men for most majors
Generally, greater numbers of women than men apply to the most selective majors
|
|
Men |
Women |
|
Major |
Admission rate |
Admission rate |
|
A |
0.62 |
0.82 |
|
B |
0.630 |
0.68 |
|
C |
0.37 |
0.34 |
|
D |
0.33 |
0.35 |
|
E |
0.28 |
0.24 |
|
F |
0.059 |
0.07 |
Results of Florida study of whether race of homicide defendant affect likelihood that death penalty would receive death penalty
Defendant Race No Yes Grand Total Black 149 17 166 White 141 19 160 Grand Total 290 36 326
Defendant Race No Yes Grand Total Black 89.76% 10.24% 100.00% White 88.13% 11.88% 100.00% Grand Total 88.96% 11.04% 100.00%
Death Penalty
Adding an additional variable layer: Victim Race
Death Penalty
V ic
ti m
R
ac e
V ic
ti m
R
ac e
Death Penalty
- Exploratory Data Analysis�Part 2
- Overview
- Learning Objectives
- Categorical Data Analysis
- Contingency Tables
- Example: Political Affiliation versus SES
- Cell Proportions
- Example: Political Affiliation versus SES
- Prospective Study
- Example: Aspirin a Day versus Heart Attack
- Retrospective Study
- Example: Early forms of Oral Contraceptives
- Descriptive Statistics
- Difference of Proportions
- Interpreting Difference of Proportions
- Relative Risk
- Relative Risk: Example
- Odds Ratio
- Odds Ratio
- Odds Ratio and Relative Risk
- Interpreting Risks and Odds
- Simpson’s Paradox
- Example
- Risks at Each Hospital
- What is Going On?
- Another Example: College Admission Bias
- Example: College Admission Bias
- Example: College Admission Bias
- Example: Death Penalty Sentences
- Example: Death penalty sentences