Data analytics

profilejoypos
INSD5120ExploratoryDataAnalysis-Part1.pdf

 Why EDA?  Data types  Summary statistics

▪ Central tendency

▪ Dispersion

▪ Distribution shape

 Relative position  Outliers  Data visualizations

 A crucial first step in any analysis project is to gain an intuitive understanding of the data

 EDA is an approach and philosophy to data analysis that postpones model building/machine learning by first allowing the data to reveal its underlying structure through descriptive statistics and visualizations

 EDA is the connection point between data collection and formal analysis

 Appreciate the importance and techniques for EDA

 Understand definitions and uses of summary statistics

 Appreciate the differences, advantages and disadvantages between measures

 Be able to characterize data sets based on summary statistics and visualizations

 Different types of concepts are represented with different types of data

▪ Level of measurement determines the kinds of analyses that can be performed with the data

 Discrete & Continuous

▪ Discrete data can only assume values within some finite set

▪ Continuous data can take any value within some interval

Scale Examples Descriptive StatisticsDefinition

Nominal

Ordinal

Interval

Ratio

Non-ordered

categories

Ordered relation

between categories

Ordered relation,

equality of differences

Ordered relation,

equality of differences,

absolute zero

Race, gender

marital status

Attitudes,

social class

Temperature

Elapsed time,

costs, number

of customers

Percentages,

mode

Percentiles,

median

Range, mean,

standard deviation

All of above,

coefficient

of variation

 Summarizing a set of interval data typically involves

describing three main attributes

▪ Central tendency

▪ Dispersion

▪ Shape

0

5

10

15

20

25

30

35

40

45

F r e q

u e n

c y

 Measures of central tendency provide a focal point for

making decisions based on the data

▪ Mean (average)

▪ Median

▪ Mode

▪ Trimmed means

 Mean is the arithmetic average of data set ▪ Data: 𝑥1,𝑥2,…,𝑥𝑛

▪ Average: (𝑥1+𝑥2+⋯+𝑥𝑛)

𝑛

▪ Can be applied to ratio and interval data

 A dataset may represent a sample (subset) of elements from a population or the entire population of interest ▪ Sample mean:

▪ Population mean:

ҧ𝑥 = ෍

𝑖=1

𝑛 𝑥𝑖 𝑛

𝜇 = ෍

𝑖=1

𝑁 𝑥𝑖 𝑁

 Mean serves as a measure of central tendency since it is the value that balances positive and negative deviations

5,9,14,16

Average 11

5-11 = -6

9-11 = -2

14-11 = 3

16-11 = 5

-6 -2 3 5

11

 The mean is sensitive to outlier values in the data set ▪ Mean can change substantially because of a few very large

or small data points

▪ Mean is sensitive to data collection/recording errors

▪ Mean is not a robust estimator of central tendency  Always check integrity of data before continuing with analysis

▪ Check reasonableness of max and min

▪ Construct histogram and/or boxplots

 When possible plot data in the order that it was collected to help spot outliers and identify possible data collection errors

mean without outliers = 150.14 0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Data

V a

lu e

162

166

158

154

147

150

141

233

278

288

148

152

149

265

212

154

148

158

150

137

142

149

148

145

143

152

mean = 170.35

 Median is that value such that half the data is less than the

median and half is greater

▪ Can be applied to ratio, interval and ordinal data

101.7 100.6 92.3 91.6 94.3 93.7 108.8 92.3 110.6 100.2 104.3

91.6 92.3 92.3 93.7 94.3 100.2 100.6 101.7 104.3 108.8 110.6

 Median is a more robust measure of central tendency than mean

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Data

V a

lu e

median w/outliers= 151

median w/o outliers= 149

 Trimmed mean is the arithmetic mean after excluding the smallest and greatest x% of the data ▪ More robust to outliers than standard mean

▪ Typically eliminate smallest/greatest 5% or 10%

48

65

68

70

70

71

73

74

78

83

85

87

87

90

92

93

94

98

99

101

109

111

112

114

115

115

117

118

129

133

138

144

 Use median if ordinal data  If ratio or interval data, can calculate mean and median  Check data integrity

▪ Examine max and min

▪ Plot data

▪ If analyze data without outliers, report and explain outliers

▪ Use median or trimmed means if robust measure needed  Many studies involve studying the difference between

population means

▪ So using the mean may be dictated by objective of study

 If data is unimodal and fairly symmetric

▪ Mean is approximately equal to median

▪ Mean is a reasonable measure of central tendency

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

 If data is unimodal and asymmetric

▪ Median is better measure of central tendency

▪ Often calculate and report both median and mean

▪ Difference between mean and median indicative of asymmetry

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F re

q u

e n

c y

 Household income*

▪ Mean = $87,200, median = $46,700

 Net worth*

▪ Mean = $534,500, median = $81,200

 If data is not unimodal ▪ Then there is not a central

tendency to the data

▪ Neither mean nor median provide good summaries of data set

▪ Analyze data for distinct groups ▪ Identify groups and consider

determining characteristics for each group separately

0

5

10

15

20

25

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

 Time series data is collected periodically over some time interval

 Types of time series ▪ Stationary processes

▪ Data varies around some central value with approximately same variation over time

▪ Standard mean or median can be used as central tendency for stationary time series

▪ Nonstationary processes ▪ Data has trend and/or changes in variation

over time ▪ Moving averages can used to provide a

(moving) central tendency value for nonstationary time series

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Quarter

R e

v e n

u e

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Quarter

R e

v e n

u e

 Central tendency measures can be misleading or non- informative if there is not a “central tendency” in the data ▪ Multi-modal

▪ U-shaped distributions

▪ Uniform distributions

▪ Highly skewed

BIMODAL

76.0 72.0

68.0 64.0

60.0 56.0

52.0 48.0

44.0 40.0

36.0 32.0

28.0 24.0

100

80

60

40

20

0

Std. Dev = 11.51

Mean = 50.0

N = 1000.00

USHAPE

5.04.03.02.01.0

500

400

300

200

100

0

Std. Dev = 1.85

Mean = 3.1

N = 1000.00

UNIFORM

20.0

19.0

18.0

17.0

16.0

15.0

14.0

13.0

12.0

11.0

10.0

9.0

8.0

7.0

6.0

5.0

4.0

3.0

2.0

1.0

80

70

60

50

40

30

20

10

0

Std. Dev = 5.65

Mean = 10.6

N = 1000.00

SKEWED

1.63 1.50

1.38 1.25

1.13 1.00

.88 .75

.63 .50

.38 .25

.13 0.00

300

200

100

0

Std. Dev = .19

Mean = .19

N = 1000.00

 Any single number summary may not adequately represent

data and may hide differences between data sets

2 98

50 99

100 100

150 101

198 102

 Measures of dispersion provide ways to quantify the amount of variation within a data set

 Dispersion measures also provide context to evaluate significance of departures from central tendency

 Types of measures ▪ Range

▪ Standard deviation

▪ IQR

 The more spread out or dispersed the data, the larger the range, SD and IQR

 The more concentrated or homogeneous the data, the smaller the range, SD, and IQR

 Range: max - min 2 98

50 99

100 100

150 101

198 102

1962198  498102 

 Root mean square difference from the mean

▪ Data

▪ (𝑥1+𝑥2+⋯+𝑥𝑛)

𝑛 = 𝑚

n

mxmxmxmx SD

nn

22

1

2

2

2

1 )()(...)()( 

 

𝑥1,𝑥2,…,𝑥𝑛

 If the dataset represents a sample from a population, the

dispersion (sample standard deviation) of the sample can

be used to estimate the dispersion (population standard

deviation) of the population

▪ An unbiased estimator of the population SD is

▪ If the dataset is the entire population, the population SD is

𝑠 = 1

𝑛 − 1 ෍

𝑖=1

𝑛

(𝑥𝑖− ҧ𝑥) 2

𝜎 = 1

𝑁 ෍

𝑖=1

𝑁

(𝑥𝑖−𝜇) 2

 Simple example 2 98

50 99

100 100

150 101

198 102

m = 100 m = 100

6.69

5

)100198()100150()100100()10050()1002( 22222

 SD

4.1SD

 While the form of standard deviation is not particularly intuitive, many data sets can be characterized using just the mean and SD ▪ If the values of the data set are distributed

approximately normal, then ▪ ~68% of the data will be within 1 SD unit of mean,

~95% will be within 2 SD units and nearly all will be within 3 SD units

 Important to note that both range and standard deviation are sensitive to outliers

-3.0 0 -1.0 0 1.00 3.00

50

100

150

200

250

C o

u n

t

 A robust measure of dispersion is the interquartile range (IQR)  The IQR specifies the range over which the middle 50% of the

data is spread ▪ Q1 or 25

th percentile: value such that 25% of data less than, and 75% greater than

▪ Q3: value such that 75% less than, and 25% greater than

▪ IQR = Q3 - Q1

 Example

▪ Like the median the IQR is less sensitive to outliers since it is based on relative ranking of data points as opposed to their actual values

1 98 99 100 100 100 102 102 104

95 98 99 100 100 100 102 102 104

98.5 102IQR = 102 – 98.5 = 3.5

 Suppose a constant, c, is added to each data value

▪ What is the new mean and standard deviation?

 Suppose each data value is multiplied by a constant c

▪ What is the new mean and standard deviation?

 Often summary measures are given for groups of data  Then statistics are needed for the data aggregated together

▪ Means

▪ SD’s

▪ Frequencies

𝑚1,𝑚2,…,𝑚𝑛

𝑆𝐷1,𝑆𝐷2,…,𝑆𝐷𝑛

𝑓1,𝑓2,…,𝑓𝑛

 Aggregate mean

 Aggregate SD

𝐹 = 𝑓1 + 𝑓2 + ⋯+ 𝑓𝑛

𝑀 = 𝑓1 𝐹 𝑚1 +

𝑓2 𝐹 𝑚2 + ⋯

𝑓𝑛 𝐹 𝑚𝑛

𝑆𝐷 = 𝑓1 𝐹

𝑆𝐷1 2 + 𝑚1

2 + 𝑓2 𝐹

𝑆𝐷2 2 + 𝑚2

2 + ⋯+ 𝑓𝑛 𝐹

𝑆𝐷𝑛 2 + 𝑚𝑛

2 − 𝑀2

 The Wisconsin Breast Cancer data consists of attribute information on fine-needle aspirated breast tissue from 569 women. Information on each sample includes Which fatty acids can be used to distinguish area/regions?

 Diagnosis (M = malignant, B = benign) 3-32) and summary statistics on ten features computed from each cell nucleus in the sample

▪ Radius, texture, perimeter, area, smoothness, compactness (perimeter^2 / area - 1.0), concavity , concave points, symmetry, and fractal dimension

▪ For each feature, the mean, standard deviation, and "worst" or largest (mean of the three largest values) were computed for each sample, resulting in 30 measurements.

▪ For example, in the dataset, column 3 is Mean Radius, column 13 is Radius SD, column 23 is Worst Radius

 Find summary statistics (mean, median, SD, IQR) for each attribute feature – overall and with respect to diagnosis

 Construct parallel boxplots for each attribute feature, comparing differences between diagnoses  From the summary statistics and boxplots, suggest (and justify) guidelines for automating the

diagnosis using the attribute features

 The National Science Foundation’s Higher Education Research and Development Survey is the primary source of information on R&D expenditures at U.S. colleges and universities ▪ The data in HERD FY16 gives research expenditures (dollars in thousands) for FY’s 2007 – 2016

 Give summary statistics (mean, median, SD, IQR) and boxplots on R&D expenditures across higher education institutions for each fiscal year

 Analyze and describe the distribution of FY16 R&D expenditures among institutions. For example, are R&D expenditures uniformly distributed across universities, or do a few universities account for the majority of expenditures?

 Identify FY2015 – 2016 overall rankings and percentiles of Texas universities. Which Texas universities moved up in rankings and which moved down from FY2015 to FY2016

 Based on the last five fiscal years, project R&D expenditures and overall rankings for Texas universities in FY17.

 An important aspect of EDA is examining the relative position of select data points within the entire data set

▪ Standard units

▪ Percentile

 Translating a data point into standard units indicates the position of the data relative to the mean with respect to standard deviation units

▪ The z-score of a data point is given by

SD

mx z

 

 A z-score greater than 0 indicates the data point is greater than the mean

 A z-score less than 0 indicates the data point is less than the mean

 A z-score equal to 0 indicates the data point is equal to the mean

 A z-score between –1 and 1 indicates that the data point is a fairly typical value

 A z-score greater than ~ 2 or less than ~ –2 indicates a less than typical value

-3.0 0 -1.0 0 1.00 3.00

50

100

150

200

250

C o

u n

t

Typical

 Percentiles give a way to compare the relative position of multiple data points

▪ Removes potential confusion introduced by scale of data

 Percentile

▪ The pth percentile is that value such that p% of data are less than and (100-p)% are greater than

100 points data #

x points data # x of percentile 

 

 Intuitively, outlier values are atypical data points very different from the central tendency of the data set

 No one absolute definition for data points to be classified as outliers ▪ Z-score greater than ~2.5 or 3; less than ~-2.5 or –3

▪ More than 1.5 times the IQR above the 75th percentile, or 1.5 times the IQR below the 25th percentile

▪ Extreme points are more than 3 times the IQR above the 75th

percentile, or 3 times the IQR below the 25th percentile

 Symmetry

▪ Symmetrical distributions have means and medians that are approximately equal

▪ Skewed distributions have means and medians that are substantially different

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

Mean = Median

Positive or right-skewed

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F re

q u

e n

c y

Median < Mean

Negative or left-skewed

0

5

10

15

20

25

30

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

Mean < Median

 Kurtosis concerns how heavy the tails of the distribution are, typically using the normal distribution as the reference distribution (excess kurtosis)

▪ Leptokurtic (K > 0)

▪ Mesokurtic (K = 0)

▪ Playkurtic (K < 0)

0

5

10

15

20

25

30

72 80 88 95 10 3

11 1

11 8

12 8

F re

q u

e n

c y

0

5

10

15

20

25

30

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

0

2

4

6

8

10

12

14

16

18

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

 Visualization technique to represent relative proportion of data within distinct categories or intervals ▪ Frequency histogram (bar charts)

▪ Relative frequency histograms

▪ Density histograms

 Considerations ▪ Continuous or discrete data

▪ Endpoint convention

▪ Horizontal scale

▪ Number of intervals

 Example: Number of lawyers in Frisco law firms

Number of

Lawyers Frequency

Relative

Frequency

1 11 0.44

2 7 0.28

3 4 0.16

4 2 0.08

5 1 0.04 0

2

4

6

8

10

12

1 2 3 4 5

F r e q

u e n

c y

# of Lawyers

frequencyrelative frequency number of observations in the data set

 Most visualization tools employ

“best practice” rules for generating

histograms

▪ Often useful to vary groups for

partitioning data

▪ Typically keep the number of classes between 5 and 20 (rule of thumb: average of ~5 data points per group)

Weights of a sample of UNT students

 Analyze and describe the distribution of FY16 R&D expenditures among institutions. For example, are R&D expenditures uniformly distributed across universities, or do a few universities account for the majority of expenditures?

 Five-number summary provides a useful snapshot of distribution shape

▪ Minimum, Q1, median, Q3 and maximum

▪ Can be used to detect asymmetries in distribution

▪ Useful in comparing distributions

▪ Visualized with boxplots

 Basic boxplots

▪ Vertical scale includes min

and max values

▪ Box from Q1 to Q3

▪ Indication of median –

usually solid line at the

median

▪ Whiskers from Q1 to min and from Q3 to max

1 98 99 100 100 100 102 102 104 200

37 44 55 69 100 100 125 152 157 161

200

100

1

200

100

1

 Vertical scale includes min and max Box from Q1 to Q3

 Line at the median  Whiskers from Q1 to Q1-(1.5)IQR and

from Q3 to Q3+(1.5)IQR  Each mild outlier between (1.5)IQR

and 3IQR from Q1 and Q3 is marked by an open circle

 Each extreme outlier further than 3IQR from Q1 and Q3 is marked by a filled circle

 Symmetric distribution ▪ Distance from Q1 to

median approximately same as median to Q3

▪ Distance from minimum to Q1 approximately same as from Q3 to maximum

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F re

q u

e n

c y

median Q1 Q3 maxmin

 Positive (right) skewed ▪ Distance from Q1 to median

smaller than distance from median to Q3

▪ Distance from minimum to Q1 smaller than distance from Q3 to maximum

0

5

10

15

20

25

30

35

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

median Q1 Q3 maxmin

 Negative (left) skewed ▪ Distance from Q1 to median

larger than distance from median to Q3

▪ Distance from minimum to Q1 larger than distance from Q3 to maximum

0

5

10

15

20

25

30

72 80 88 95 10 3

11 1

11 8

12 8

F r e q

u e n

c y

median Q1 Q3 maxmin

Area Mean