quiz 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
1Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 1
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
During this lesson the following topics are covered:
• Why visualize?
• Examining a single variable
• Examining pairs of variables
• Indications of dirty data.
• Data exploration vs. presentation
Analyzing and Exploring the Data
Module 3: Basic Data Analytic Methods Using R 2
The topics for this lesson are listed.
Module 3: Basic Data Analytic Methods Using R 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Why Visualize?
Summary statistics give us some sense of the data:
Mean vs. Median.
Standard deviation.
Quartiles, Min/Max.
Correlations between variables.
summary(data)
x y
Min. :-3.05439 Min. :-3.50179
1st Qu.:-0.61055 1st Qu.:-0.75968
Median : 0.04666 Median : 0.07340
Mean :-0.01105 Mean : 0.09383
3rd Qu.: 0.56067 3rd Qu.: 0.88114
Max. : 2.60614 Max. : 4.28693
Visualization gives us a more holistic sense
3Module 3: Basic Data Analytic Methods Using R
In the previous lesson, we saw how to examine data in R, including how to generate the descriptive statistics: averages, data ranges, and quartiles (which are included in the summary() report).
We also saw how to compute correlations between pairs of variables of interest. These statistics do give us a sense of a data: an idea of its magnitude and range, and some obvious dirty data (missing values, values with obviously wrong magnitude or sign).
Visualization, however, gives us a succinct, more holistic view of the data that we may not be able to get from the numbers and summaries alone. It is an important facet of the initial data exploration. Visualization helps you assess data cleanliness, and also gives you an idea of potentially important relationships in the data before going on to build your models.
Module 3: Basic Data Analytic Methods Using R 3
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Anscombe’s Quartet
Module 3: Basic Data Analytic Methods Using R
Property Values
Mean of x in each case 9
Exact variance of x in each case
11
Exact mean of y in each case
7.5 (to 2 d.p)
Variance of Y in each case 4.13 (to 2 d.p)
Correlations between x and y in each case
0.816
Linear regression line in each case
Y = 3.00 + 0.500x (to 2 d.p and 3 d.p resp.)
4 data sets, characterized by the following. Are they the same, or are they different?
i
X y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
ii
x y
10.00 9.14
8.00 8.14
13.00 8.74
9.00 8.77
11.00 9.26
14.00 8.10
6.00 6.13
4.00 3.10
12.00 9.13
7.00 7.26
5.00 4.74
iii
x y
10.00 7.46
8.00 6.77
13.00 12.74
9.00 7.11
11.00 7.81
14.00 8.84
6.00 6.08
4.00 5.39
12.00 8.15
7.00 6.42
5.00 5.73
iv
x y
8.00 6.58
8.00 5.76
8.00 7.71
8.00 8.84
8.00 8.47
8.00 7.04
8.00 5.25
19.00 12.50
8.00 5.56
8.00 7.91
8.00 6.89
4
Anscombe’s Quartet is a synthesized example by the statistician F. J. Anscombe. Look at the properties and values of these four data sets. Based on standard statistical measures of mean, variance, and correlation (our descriptive statistics), these data sets are identical. Or are they?
Module 3: Basic Data Analytic Methods Using R 4
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Moral: Visualize Before Analyzing!
5Module 3: Basic Data Analytic Methods Using R
However, if we visualize each data set using a scatterplot and a regression line superimposed over each plot, the datasets appear quite different. Dataset 1 is the best candidate for a regression line, although there is a lot of variation. Dataset 2 is definitely non-linear. Dataset 3 is a close match, but over predicts at higher value of x and has an extreme outlier. And Dataset 4 isn’t captured at all by a simple regression line.
Assuming we have datasets represented by data frames s1, s2, s3, and s4, we can generate these plots in R by using the following code:
R-Code
plot(s1) plot(lm(s1$y ~ s1$x))
…
(Yes, a loop is possible but requires more advanced data manipulation: for information, consult the R “eval” function if interested). We also must take care to overwrite the preceding graph in each instance.
Code to produce these graphs is included in the script AnscombePlot.R. Note that the dataset for these plots are included in the standard R distribution. Type data() for a list of dataset included in the base distribution. data(name) will make that dataset available in your workspace.
Module 3: Basic Data Analytic Methods Using R 5
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Visualizing Your Data
• Examining the distribution of a single variable
• Analyzing the relationship between two variables
• Establishing multiple pair wise relationships between variables
• Analyzing a single variable over time
• Data exploration versus data presentation
Module 3: Basic Data Analytic Methods Using R 6
In a previous lesson, we’ve looked at how you can characterize your data by using traditional statistics. But we also showed how datasets could appear identical when using descriptive statistics, and yet look completely different when visualizing the data via a plot.
Using visual representations of data is the hallmark of exploratory data analysis: letting the data speak to us rather than necessarily imposing an interpretation on the data a priori. In the rest of this lesson, we are going to examine ways of displaying data so that we can better understand the underlying distributions of a single variable or the relationships between two or more variables.
Although data visualization is a powerful tool, the results we obtain may not be suitable when it comes time for us to “tell a story” about the data. Our last slide will discuss what kind of presentations are most effective.
Module 3: Basic Data Analytic Methods Using R 6
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Examining the Distribution of a Single Variable
Graphing a single variable
• plot(sort(.)) – for low volume data
• hist(.) – a histogram
• plot(density(.)) – densityplot A "continuous histogram“
• Example Frequency table of household
income
Module 3: Basic Data Analytic Methods Using R 7
R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot() and dotplot().
The example included is a frequency table of household income. We can certainly see a concentration of households in the leftmost portion of the graph.
Module 3: Basic Data Analytic Methods Using R 7
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Examining the Distribution of a Single Variable
Graphing a single variable
• plot(sort(.)) – for low volume data
• hist(.) – a histogram
• plot(density(.)) – densityplot A "continuous histogram“
• Example Frequency table of household
income
rug() plot emphasizes distribution
Module 3: Basic Data Analytic Methods Using R 8
R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot(), dotplot() and stem().
The example included is a frequency table of log10 of household income. We can certainly see a concentration of households in the rightmost portion of the graph. The rug() function creates a 1-dimensional density plot as well: notice how it emphasizes the area under the curve.
Module 3: Basic Data Analytic Methods Using R 8
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
A sense of the data range
• If it's very wide, or very skewed, try computing the log
Outliers, anomalies
• Possibly evidence of dirty data
Shape of the Distribution
• Unimodal? Bimodal?
• Skewed to left or right?
• Approximately normal? Approximately lognormal?
Example - Distribution of purchase size ($)
• Range from 0 to > $10K, right skewed
• Typical of monetary data
• Plotting log of data gives better sense of distribution
• Two purchasing distributions ~ $55
~ $2900
What are we looking for?
Module 3: Basic Data Analytic Methods Using R 9
When viewing the variables during the data exploration phase, you are looking for a sense of the data range, and whether the values are strongly concentrated in a certain range. If the data is very skewed, viewing the log of the data (if it's all positive) can help you detect structure that you might otherwise miss in a regularly scaled graph.
This is your chance to look for obvious signs of dirty data (outliers or unlikely looking values). See if the data is unimodel or multimodal: that gives you an idea of how many distinct populations (with distinct behavior patterns) might be mixed into your overall population. Knowing if the data is approximately normal (or can be transformed to approximately normal – for example, by taking the log) is important, since many modeling techniquest assume that the data is approximately normal in distribution.
For our example, we can look at the densityplot of purchase sizes (in $ US) of customers at our online retail site. The range here is extremely wide – from around $1 US to over $10,000 US. Extreme ranges like this are typical of monetary data, like income, customer value, tax liabilities, bank account sizes, etc. (In fact, all of this kind of data is often assumed to be distributed lognormally – that is, its log is a normal distribution).
The data range makes it really hard for us to see much detail, so we take the log of it, and then density plot it. Now we can see that there are (at least) two distinct population in our customer base: One population that makes small to medium size purchases (median purchase size about $55 US) and one that makes larger purchases (median purchase size about $2900 US). Can you see those two populations in the top graph?
The plots shown were made using the lattice package. If the data is in the vector purchase_size, then the lattice plot is: library(lattice)
densityplot(purchase_size) # top plot
# bottom plot as log10 is actually
# easier to read, but this plot is in natural log
densityplot(log(purchase_size)
(The commands were actually more complicated than that, but these commands give the basic equivalent)
Module 3: Basic Data Analytic Methods Using R 9
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Evidence of Dirty Data
Module 3: Basic Data Analytic Methods Using R 10
Missing
values?
Mis-entered
data?
Inherited
accounts?
Here's an example of how dirty data might manifest itself in your visualizations. We are looking at the age distribution of account holders at our bank. Mean age is about 40, approximately normally distributed with a standard deviation of about 15 years or so, which makes sense.
We see a few accounts with accountholder age < 10; unusual, but plausible. These could be custodial accounts, or college savings accounts set up by the parents of young children. We probably want to keep them for our analysis.
There is a huge spike of customers who are zero years old – evidence of missing data. We may need to eliminate these accounts from analysis (depending on how important we think age will be), or track down how to get the appropriate age data.
The customers with negative age are probably either missing data, or mis-entered data. The customers who are older than 100 are possibly also mis-entered data, or these are accounts that have been passed down to the heirs of the original accountholders (and not updated).We may want to exclude them as well, or at least threshold the age that we will consider in the analysis.
If this data is in a vector called age, then the plot is made by:
hist(age, breaks=100, main="Accountholder age distribution",
xlab="age", col="gray")
Module 3: Basic Data Analytic Methods Using R 10
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
"Saturated" Data
Module 3: Basic Data Analytic Methods Using R 11
Do we really have no mortgages older than 10 years?
Or does the year 2004 in the origination field mean "2004 or prior"?
Here's another example of dirty (or at least, "incompletely documented" data). We are looking at the age of mortgages in our bank's home loan portfolio. The age is calculated by subtracting the origination date of the loan from "today" (2013).
The first thing we notice is that we don't seem to have loans older than 10 years old – and we also notice that we have a disproportionate number of ten year old loans, relative to the age distribution of the other loans.
One possible reason for this is that the date field for loan origination may have been "overloaded" so that "2004" is actually a beacon value that means "2004 or prior" rather than literally 2004. (This sometimes happens when data is ported from one system to another, or because someone, somewhere, decided that origination dates prior to 2004 are not relevant).
What would we do about this? If we are analyzing probability of default, it is probably safe to eliminate the data (or keep the assumption that the loans are 10 years old), since 10 year old mortgages default quite rarely (most defaults occur before about the 4th year). For different analyses, we may need to search for a source of valid origination dates (if that is possible).
If the data is in the vector mortgage, the plot is made by:
hist(mortgage, breaks=10, main="Portfolio Distribution, Years
since origination", xlab="Mortgage Age", col="grey")
Module 3: Basic Data Analytic Methods Using R 11
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Analyzing the Relationship Between Two Variables
Module 3: Basic Data Analytic Methods Using R 12
How? • Two Continuous Variables (or two discrete variables)
Scatterplots
LOESS (fit smoothed line to the data)
Linear models: graph the correlation
Binplots, hexbin plots More legible color-based plots for high
volume data
• Continuous vs. Discrete Variable Jitter, Box and whisker plots, Dotplot or
barchart
Example: • Household income by region (ZIP1) • Scatterplot with jitter, with box-and-whisker overlaid • New England (0) and West Coast (9) have highest
mean household income
Scatterplots are a good first visualization for the relationship between two variables, especially two continuous variables. Since you are looking for the relationship between the two variables, it can often be helpful to fit a smoothing curve through the data, for example loess or a linear regression. We'll see an example of that a little later on.
For very high volume data, scatterplots are problematic; with too much data on the page, the details can get lost. Sometime the jitter() function can create enough (uniform) variation to see the associations more clearly. Hexbin plots are a good alternative: you can think of hexbin plots as two dimensional histograms that use color or grayscale to encode bin heights.
There are other alternatives for plotting continuous vs. discrete variables. Dotplots and barcharts plot the continuous value as a function of the discrete value when the relationship is one-to-one. Box and whisker plots show the distribution of the continuous variable for each value of the discrete variable.
The example here is of logged household incomes as a function of region (first digit of the zip). (Logged in this case means data that uses the logarithm of the value instead of the value itself.) In this example, we have also plotted the scatterplot beneath the box-and-whisker, with some jittering so each line of points widens into a strip. The "box" of the box and whisker shows the range that contains the central 50% of the data; the line inside the box is the location of the median. The "whiskers" give you an idea of the entire range of the data. Usually, box and whiskers also show "outliers" that lie beyond the whiskers, but they are turned off in this graph. This graphs shows how household income varies by region. The highest median incomes are in New England (region 0) and on the West Coast (region 9). New England is slightly higher, but the boxes for the two regions overlap enough that the difference between the two regions probably is not significant. The lowest household incomes tend to be in region 7 (TX, OK, Ark, LA).
Module 3: Basic Data Analytic Methods Using R 12
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Is there a relationship between the two variables? Linear? Quadratic?
Exponential?
Try semi-log or log-log plots
Is it a cloud?
Round? Concentrated? Multiple Clusters?
• How? Scatterplots
• Example Red line: linear fit
Blue line: LOESS
Fairly linear relationship, but with wide variance
Two Variables: What are we looking for?
Module 3: Basic Data Analytic Methods Using R 14
We are looking for a relationship between the two variables. If the functional relationship between the variables is somewhat pronounced, the data lies roughly along a curve: a straight line, a parabola, or an exponential curve. If y is related exponentially to x, then the plot of (x, log(y)) will be approximately linear. If the data is more like a cloud, the relationship is weaker.
In the example here, the relationship seems approximately linear; we've plotted the regression line in red. There are times when a standard regression line just doesn’t capture the relationship. In this case, the loess() function in R (also lowess()) will fit a non-linear line to the data. Here we've drawn the loess curve in blue.
R-Code
Assume a dataset named ds with variables cesd and mcs. The R code to generate the above plot is as follows.
with(ds,
{
plot(mcs ~ cesd)
abline(lm(mcs ~ cesd), lcol=“red”)
lines(lowess(mcs ~ cesd), lcol=“blue”)
} )
Module 3: Basic Data Analytic Methods Using R 14
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Two Variables: High Volume Data - Plotting
Module 3: Basic Data Analytic Methods Using R 15
Scatterplot: Overplotting makes it difficult
to see structure
Hexbinplot: Now we see where the data is
concentrated.
When we have too much data, the structure becomes difficult to see in a scatterplot. Here, we are plotting logged household income against years of education. The "blob" that we get on the scatterplot on the left suggests a somewhat linear relationship (this suggests, but the way, that an extra year of education multiplies your expected income by 10^M, where M is the slope of the regression line). However, we can't really see the structure of how the data is distributed.
On the right we have plotted the same data using a hexbinplot. Hexbinplots are a bit like 2-d histograms, where shading tells us how populated the bin is. Now we can see that the data is more densely clustered in a streak that runs through the center of the data cloud, roughly along the regression line. The biggest concentration is around 12 years of education, extending about to about 15 years.
Notice also the outlier data at MeanEducation = 0. Missing data perhaps?
<Continued>
Module 3: Basic Data Analytic Methods Using R 15
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Why? Examine many two-way
relationships quickly
• How? pairs(ds) can generate a plot of
each pairs of variables
• Example Iris Characteristics
Strong linear relationship between petal length and width
Petal dimensions discriminate species more strongly than sepal dimensions
Establishing Multiple Pairwise Relationships Between Variables
Module 3: Basic Data Analytic Methods Using R 17
There are times when it’s useful to see multiple values of a dataset in context in order to visually represent data relationships so as to magnify differences or to show patterns hidden within the data that summary statistics don’t reveal. In the graphic represented above, the variable sepal length, sepal width, petal length and petal width are compared with three species of irises (the key is not listed in the graphic). Colors are used to represent the different species, allowing us to compare differences across species for a particular combination of variables.
Consider the values encoded in the second square from the top right, where sepal length is compared with petal length. Values for petal length are encoded across the bottom; values for sepal length are encoded on the right hand side of the graphic. We can observe that the green and blue species are well matched, although the blue species has longer petals in the main. The petal length for the red species, however, remain markedly the same, and vary only in the lower half of sepal length values. As an exercise, imagine fitting a regression line to each of these individual graphs. What would you make of the relationship between sepal length and sepal width?
The R code for generating the plot is:
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species",
pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]
)
and uses the iris dataset included with the R standard distribution. Here colors include the species, as well as proving the spirit of APL is alive and well.
Module 3: Basic Data Analytic Methods Using R 17
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
What?
• Looking for … Data range
Trends
Seasonality
How?
• Use time series plot
Example
•International air travel (1949-1960)
• Upward trend: growth appears superlinear
• Seasonality Peak air travel around Nov. with smaller
peaks near Mar. and June
Analyzing a Single Variable over Time
Module 3: Basic Data Analytic Methods Using R 18
Visualizing a variable over time is the same as visualizing any pair of variables, but in this case we are looking for some specific patterns.
Data range, of course, tells us how much our y variable has increased or decreased over the period of time we are considering. We want to get a feeling for the growth rate, and whether or not we see and changes in that growth rate. We are also looking for seasonality: a regular pattern in the fluctuations over a fixed period of time. We can think of those patterns as marking "seasons“.
In the air travel data example that we show, we can see that air travel peaks regularly around Nov/Dec (the holiday season), with a smaller peak around the middle of the year (summer travel) and an even smaller one near the beginning of the year (spring break?).
We can also see that the number of air passengers increased steadily from 1949 to 1960, and that the growth appears to be faster than linear, at least during peak travel season.
Module 3: Basic Data Analytic Methods Using R 18
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Data Exploration vs. Presentation
Module 3: Basic Data Analytic Methods Using R 19
Data Exploration:
This tells you what you need to know.
Presentation:
This tells the stakeholders what they need to know.
Finally, we want to touch on the difference between using visualization for data exploration, and for presenting results to stakeholders. The plots and tips that we've discussed try to make the details of the data as clear as possible for the data scientist to see structure and relationships. These technical graphs don't always effectively convey the information that needs to be conveyed to non-technical stakeholders. For them, we want crisp graphics that focus on the message we want to convey.
We will touch more on this topic in Module 6, but for right now we'll share a small example. The top graph shows the density plot of logged account values for our bank. This graph gives us, as data scientists, information that can be relevant to downstream analysis. The account values are distributed approximately lognormally, in the range from 100 to 10M dollars. The median account value is in the area of $30,000 (10^4.5), with the bulk of the accounts between $1000 US and $1M US dollars.
It would be hard to explain this graph to stakeholders. For one thing, densityplots are fairly technical, and for another, it is awkward to explain why you are logging the data before showing it. You can convey essentially the same information by partitioning the data into "log-like" bins, and presenting the histogram of those bins, as we do in the bottom plot. Here, we can see that the bulk of the accounts are in the 1000-1M range, with the peak concentration in the 10-50K range, extending out to about 500K. This gives the stakeholders a better sense of the customer base than the top graphic would.
[Note – the reason that the lower graph isn't symmetric like the upper graph is because the bins are only "log-like". They aren't truly log10 scaled. Log10 scaled bins would be closer to: 1-3K, 3K-10K, 10K- 30K..... As an exercise, we could try splitting the bins that way, and we would see that the resulting bar chart would be symmetric. The bins we chose, however, might seem more "natural" to the stakeholders.]
Module 3: Basic Data Analytic Methods Using R 19
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
• Do you think the regression line sufficiently captures the relationship between the two variables? What might you do differently?
• In the Iris slide example, how would you characterize the relationship between sepal width and sepal length?
• Did you notice the use of color in the Iris slide? Was it effective? Why or why not?
Module 3: Basic Data Analytic Methods Using R 21
Please take a moment to answer these questions.
Module 3: Basic Data Analytic Methods Using R 21
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R: Analysis
During this lesson the following topics were covered: • Justifying why we visualize data • Using plots and graphs to determine:
• Shape of a single variable • “dirty” data or “saturated” data • Relationship between two or more variables • Relationship between multiple variables • A single variable over time
• Data exploration versus Presentation
Summary
Module 3: Basic Data Analytic Methods Using R 22
This slide captures the key topics from this lesson.
Module 3: Basic Data Analytic Methods Using R 22