Data analytics
Why EDA? Data types Summary statistics
▪ Central tendency
▪ Dispersion
▪ Distribution shape
Relative position Outliers Data visualizations
A crucial first step in any analysis project is to gain an intuitive understanding of the data
EDA is an approach and philosophy to data analysis that postpones model building/machine learning by first allowing the data to reveal its underlying structure through descriptive statistics and visualizations
EDA is the connection point between data collection and formal analysis
Appreciate the importance and techniques for EDA
Understand definitions and uses of summary statistics
Appreciate the differences, advantages and disadvantages between measures
Be able to characterize data sets based on summary statistics and visualizations
Different types of concepts are represented with different types of data
▪ Level of measurement determines the kinds of analyses that can be performed with the data
Discrete & Continuous
▪ Discrete data can only assume values within some finite set
▪ Continuous data can take any value within some interval
Scale Examples Descriptive StatisticsDefinition
Nominal
Ordinal
Interval
Ratio
Non-ordered
categories
Ordered relation
between categories
Ordered relation,
equality of differences
Ordered relation,
equality of differences,
absolute zero
Race, gender
marital status
Attitudes,
social class
Temperature
Elapsed time,
costs, number
of customers
Percentages,
mode
Percentiles,
median
Range, mean,
standard deviation
All of above,
coefficient
of variation
Summarizing a set of interval data typically involves
describing three main attributes
▪ Central tendency
▪ Dispersion
▪ Shape
0
5
10
15
20
25
30
35
40
45
F r e q
u e n
c y
Measures of central tendency provide a focal point for
making decisions based on the data
▪ Mean (average)
▪ Median
▪ Mode
▪ Trimmed means
Mean is the arithmetic average of data set ▪ Data: 𝑥1,𝑥2,…,𝑥𝑛
▪ Average: (𝑥1+𝑥2+⋯+𝑥𝑛)
𝑛
▪ Can be applied to ratio and interval data
A dataset may represent a sample (subset) of elements from a population or the entire population of interest ▪ Sample mean:
▪ Population mean:
ҧ𝑥 =
𝑖=1
𝑛 𝑥𝑖 𝑛
𝜇 =
𝑖=1
𝑁 𝑥𝑖 𝑁
Mean serves as a measure of central tendency since it is the value that balances positive and negative deviations
5,9,14,16
Average 11
5-11 = -6
9-11 = -2
14-11 = 3
16-11 = 5
-6 -2 3 5
11
The mean is sensitive to outlier values in the data set ▪ Mean can change substantially because of a few very large
or small data points
▪ Mean is sensitive to data collection/recording errors
▪ Mean is not a robust estimator of central tendency Always check integrity of data before continuing with analysis
▪ Check reasonableness of max and min
▪ Construct histogram and/or boxplots
When possible plot data in the order that it was collected to help spot outliers and identify possible data collection errors
mean without outliers = 150.14 0
50
100
150
200
250
300
350
0 5 10 15 20 25 30
Data
V a
lu e
162
166
158
154
147
150
141
233
278
288
148
152
149
265
212
154
148
158
150
137
142
149
148
145
143
152
mean = 170.35
Median is that value such that half the data is less than the
median and half is greater
▪ Can be applied to ratio, interval and ordinal data
101.7 100.6 92.3 91.6 94.3 93.7 108.8 92.3 110.6 100.2 104.3
91.6 92.3 92.3 93.7 94.3 100.2 100.6 101.7 104.3 108.8 110.6
Median is a more robust measure of central tendency than mean
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30
Data
V a
lu e
median w/outliers= 151
median w/o outliers= 149
Trimmed mean is the arithmetic mean after excluding the smallest and greatest x% of the data ▪ More robust to outliers than standard mean
▪ Typically eliminate smallest/greatest 5% or 10%
48
65
68
70
70
71
73
74
78
83
85
87
87
90
92
93
94
98
99
101
109
111
112
114
115
115
117
118
129
133
138
144
Use median if ordinal data If ratio or interval data, can calculate mean and median Check data integrity
▪ Examine max and min
▪ Plot data
▪ If analyze data without outliers, report and explain outliers
▪ Use median or trimmed means if robust measure needed Many studies involve studying the difference between
population means
▪ So using the mean may be dictated by objective of study
If data is unimodal and fairly symmetric
▪ Mean is approximately equal to median
▪ Mean is a reasonable measure of central tendency
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
If data is unimodal and asymmetric
▪ Median is better measure of central tendency
▪ Often calculate and report both median and mean
▪ Difference between mean and median indicative of asymmetry
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F re
q u
e n
c y
Household income*
▪ Mean = $87,200, median = $46,700
Net worth*
▪ Mean = $534,500, median = $81,200
If data is not unimodal ▪ Then there is not a central
tendency to the data
▪ Neither mean nor median provide good summaries of data set
▪ Analyze data for distinct groups ▪ Identify groups and consider
determining characteristics for each group separately
0
5
10
15
20
25
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
Time series data is collected periodically over some time interval
Types of time series ▪ Stationary processes
▪ Data varies around some central value with approximately same variation over time
▪ Standard mean or median can be used as central tendency for stationary time series
▪ Nonstationary processes ▪ Data has trend and/or changes in variation
over time ▪ Moving averages can used to provide a
(moving) central tendency value for nonstationary time series
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Quarter
R e
v e n
u e
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Quarter
R e
v e n
u e
Central tendency measures can be misleading or non- informative if there is not a “central tendency” in the data ▪ Multi-modal
▪ U-shaped distributions
▪ Uniform distributions
▪ Highly skewed
BIMODAL
76.0 72.0
68.0 64.0
60.0 56.0
52.0 48.0
44.0 40.0
36.0 32.0
28.0 24.0
100
80
60
40
20
0
Std. Dev = 11.51
Mean = 50.0
N = 1000.00
USHAPE
5.04.03.02.01.0
500
400
300
200
100
0
Std. Dev = 1.85
Mean = 3.1
N = 1000.00
UNIFORM
20.0
19.0
18.0
17.0
16.0
15.0
14.0
13.0
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
80
70
60
50
40
30
20
10
0
Std. Dev = 5.65
Mean = 10.6
N = 1000.00
SKEWED
1.63 1.50
1.38 1.25
1.13 1.00
.88 .75
.63 .50
.38 .25
.13 0.00
300
200
100
0
Std. Dev = .19
Mean = .19
N = 1000.00
Any single number summary may not adequately represent
data and may hide differences between data sets
2 98
50 99
100 100
150 101
198 102
Measures of dispersion provide ways to quantify the amount of variation within a data set
Dispersion measures also provide context to evaluate significance of departures from central tendency
Types of measures ▪ Range
▪ Standard deviation
▪ IQR
The more spread out or dispersed the data, the larger the range, SD and IQR
The more concentrated or homogeneous the data, the smaller the range, SD, and IQR
Range: max - min 2 98
50 99
100 100
150 101
198 102
1962198 498102
Root mean square difference from the mean
▪ Data
▪ (𝑥1+𝑥2+⋯+𝑥𝑛)
𝑛 = 𝑚
n
mxmxmxmx SD
nn
22
1
2
2
2
1 )()(...)()(
𝑥1,𝑥2,…,𝑥𝑛
If the dataset represents a sample from a population, the
dispersion (sample standard deviation) of the sample can
be used to estimate the dispersion (population standard
deviation) of the population
▪ An unbiased estimator of the population SD is
▪ If the dataset is the entire population, the population SD is
𝑠 = 1
𝑛 − 1
𝑖=1
𝑛
(𝑥𝑖− ҧ𝑥) 2
𝜎 = 1
𝑁
𝑖=1
𝑁
(𝑥𝑖−𝜇) 2
Simple example 2 98
50 99
100 100
150 101
198 102
m = 100 m = 100
6.69
5
)100198()100150()100100()10050()1002( 22222
SD
4.1SD
While the form of standard deviation is not particularly intuitive, many data sets can be characterized using just the mean and SD ▪ If the values of the data set are distributed
approximately normal, then ▪ ~68% of the data will be within 1 SD unit of mean,
~95% will be within 2 SD units and nearly all will be within 3 SD units
Important to note that both range and standard deviation are sensitive to outliers
-3.0 0 -1.0 0 1.00 3.00
50
100
150
200
250
C o
u n
t
A robust measure of dispersion is the interquartile range (IQR) The IQR specifies the range over which the middle 50% of the
data is spread ▪ Q1 or 25
th percentile: value such that 25% of data less than, and 75% greater than
▪ Q3: value such that 75% less than, and 25% greater than
▪ IQR = Q3 - Q1
Example
▪ Like the median the IQR is less sensitive to outliers since it is based on relative ranking of data points as opposed to their actual values
1 98 99 100 100 100 102 102 104
95 98 99 100 100 100 102 102 104
98.5 102IQR = 102 – 98.5 = 3.5
Suppose a constant, c, is added to each data value
▪ What is the new mean and standard deviation?
Suppose each data value is multiplied by a constant c
▪ What is the new mean and standard deviation?
Often summary measures are given for groups of data Then statistics are needed for the data aggregated together
▪ Means
▪ SD’s
▪ Frequencies
𝑚1,𝑚2,…,𝑚𝑛
𝑆𝐷1,𝑆𝐷2,…,𝑆𝐷𝑛
𝑓1,𝑓2,…,𝑓𝑛
Aggregate mean
Aggregate SD
𝐹 = 𝑓1 + 𝑓2 + ⋯+ 𝑓𝑛
𝑀 = 𝑓1 𝐹 𝑚1 +
𝑓2 𝐹 𝑚2 + ⋯
𝑓𝑛 𝐹 𝑚𝑛
𝑆𝐷 = 𝑓1 𝐹
𝑆𝐷1 2 + 𝑚1
2 + 𝑓2 𝐹
𝑆𝐷2 2 + 𝑚2
2 + ⋯+ 𝑓𝑛 𝐹
𝑆𝐷𝑛 2 + 𝑚𝑛
2 − 𝑀2
The Wisconsin Breast Cancer data consists of attribute information on fine-needle aspirated breast tissue from 569 women. Information on each sample includes Which fatty acids can be used to distinguish area/regions?
Diagnosis (M = malignant, B = benign) 3-32) and summary statistics on ten features computed from each cell nucleus in the sample
▪ Radius, texture, perimeter, area, smoothness, compactness (perimeter^2 / area - 1.0), concavity , concave points, symmetry, and fractal dimension
▪ For each feature, the mean, standard deviation, and "worst" or largest (mean of the three largest values) were computed for each sample, resulting in 30 measurements.
▪ For example, in the dataset, column 3 is Mean Radius, column 13 is Radius SD, column 23 is Worst Radius
Find summary statistics (mean, median, SD, IQR) for each attribute feature – overall and with respect to diagnosis
Construct parallel boxplots for each attribute feature, comparing differences between diagnoses From the summary statistics and boxplots, suggest (and justify) guidelines for automating the
diagnosis using the attribute features
The National Science Foundation’s Higher Education Research and Development Survey is the primary source of information on R&D expenditures at U.S. colleges and universities ▪ The data in HERD FY16 gives research expenditures (dollars in thousands) for FY’s 2007 – 2016
Give summary statistics (mean, median, SD, IQR) and boxplots on R&D expenditures across higher education institutions for each fiscal year
Analyze and describe the distribution of FY16 R&D expenditures among institutions. For example, are R&D expenditures uniformly distributed across universities, or do a few universities account for the majority of expenditures?
Identify FY2015 – 2016 overall rankings and percentiles of Texas universities. Which Texas universities moved up in rankings and which moved down from FY2015 to FY2016
Based on the last five fiscal years, project R&D expenditures and overall rankings for Texas universities in FY17.
An important aspect of EDA is examining the relative position of select data points within the entire data set
▪ Standard units
▪ Percentile
Translating a data point into standard units indicates the position of the data relative to the mean with respect to standard deviation units
▪ The z-score of a data point is given by
SD
mx z
A z-score greater than 0 indicates the data point is greater than the mean
A z-score less than 0 indicates the data point is less than the mean
A z-score equal to 0 indicates the data point is equal to the mean
A z-score between –1 and 1 indicates that the data point is a fairly typical value
A z-score greater than ~ 2 or less than ~ –2 indicates a less than typical value
-3.0 0 -1.0 0 1.00 3.00
50
100
150
200
250
C o
u n
t
Typical
Percentiles give a way to compare the relative position of multiple data points
▪ Removes potential confusion introduced by scale of data
Percentile
▪ The pth percentile is that value such that p% of data are less than and (100-p)% are greater than
100 points data #
x points data # x of percentile
Intuitively, outlier values are atypical data points very different from the central tendency of the data set
No one absolute definition for data points to be classified as outliers ▪ Z-score greater than ~2.5 or 3; less than ~-2.5 or –3
▪ More than 1.5 times the IQR above the 75th percentile, or 1.5 times the IQR below the 25th percentile
▪ Extreme points are more than 3 times the IQR above the 75th
percentile, or 3 times the IQR below the 25th percentile
Symmetry
▪ Symmetrical distributions have means and medians that are approximately equal
▪ Skewed distributions have means and medians that are substantially different
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
Mean = Median
Positive or right-skewed
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F re
q u
e n
c y
Median < Mean
Negative or left-skewed
0
5
10
15
20
25
30
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
Mean < Median
Kurtosis concerns how heavy the tails of the distribution are, typically using the normal distribution as the reference distribution (excess kurtosis)
▪ Leptokurtic (K > 0)
▪ Mesokurtic (K = 0)
▪ Playkurtic (K < 0)
0
5
10
15
20
25
30
72 80 88 95 10 3
11 1
11 8
12 8
F re
q u
e n
c y
0
5
10
15
20
25
30
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
0
2
4
6
8
10
12
14
16
18
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
Visualization technique to represent relative proportion of data within distinct categories or intervals ▪ Frequency histogram (bar charts)
▪ Relative frequency histograms
▪ Density histograms
Considerations ▪ Continuous or discrete data
▪ Endpoint convention
▪ Horizontal scale
▪ Number of intervals
Example: Number of lawyers in Frisco law firms
Number of
Lawyers Frequency
Relative
Frequency
1 11 0.44
2 7 0.28
3 4 0.16
4 2 0.08
5 1 0.04 0
2
4
6
8
10
12
1 2 3 4 5
F r e q
u e n
c y
# of Lawyers
frequencyrelative frequency number of observations in the data set
Most visualization tools employ
“best practice” rules for generating
histograms
▪ Often useful to vary groups for
partitioning data
▪ Typically keep the number of classes between 5 and 20 (rule of thumb: average of ~5 data points per group)
Weights of a sample of UNT students
Analyze and describe the distribution of FY16 R&D expenditures among institutions. For example, are R&D expenditures uniformly distributed across universities, or do a few universities account for the majority of expenditures?
Five-number summary provides a useful snapshot of distribution shape
▪ Minimum, Q1, median, Q3 and maximum
▪ Can be used to detect asymmetries in distribution
▪ Useful in comparing distributions
▪ Visualized with boxplots
Basic boxplots
▪ Vertical scale includes min
and max values
▪ Box from Q1 to Q3
▪ Indication of median –
usually solid line at the
median
▪ Whiskers from Q1 to min and from Q3 to max
1 98 99 100 100 100 102 102 104 200
37 44 55 69 100 100 125 152 157 161
200
100
1
200
100
1
Vertical scale includes min and max Box from Q1 to Q3
Line at the median Whiskers from Q1 to Q1-(1.5)IQR and
from Q3 to Q3+(1.5)IQR Each mild outlier between (1.5)IQR
and 3IQR from Q1 and Q3 is marked by an open circle
Each extreme outlier further than 3IQR from Q1 and Q3 is marked by a filled circle
Symmetric distribution ▪ Distance from Q1 to
median approximately same as median to Q3
▪ Distance from minimum to Q1 approximately same as from Q3 to maximum
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F re
q u
e n
c y
median Q1 Q3 maxmin
Positive (right) skewed ▪ Distance from Q1 to median
smaller than distance from median to Q3
▪ Distance from minimum to Q1 smaller than distance from Q3 to maximum
0
5
10
15
20
25
30
35
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
median Q1 Q3 maxmin
Negative (left) skewed ▪ Distance from Q1 to median
larger than distance from median to Q3
▪ Distance from minimum to Q1 larger than distance from Q3 to maximum
0
5
10
15
20
25
30
72 80 88 95 10 3
11 1
11 8
12 8
F r e q
u e n
c y
median Q1 Q3 maxmin
Area Mean