Data Mining - Exploratory Data Analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Dr. Oner Celepcikay
ITS 632
Data Mining
Summer 2019Week 3: Data Exploration
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Chapter 3 Exploring Data
1st Step of Machine Learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
What is data exploration?
● Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis
– Making use of humans� abilities to recognize patterns
u People can recognize patterns not captured by data analysis tools
● Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its characteristics.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Techniques Used In Data Exploration
● In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
● In our discussion of data exploration, we focus on
– Summary statistics
– Visualization
– Online Analytical Processing (OLAP)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Summary Statistics
● Summary statistics are numbers that summarize properties of the data
– Summarized properties include frequency, location and spread u Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated in a single pass through the data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequency and Mode
●The frequency of an attribute value is the
percentage of time the value occurs in the
data set
– For example, given the attribute �gender� and a representative population of people, the gender
�female� occurs about 50% of the time.
● The mode of a an attribute is the most frequent
attribute value
● The notions of frequency and mode are typically
used with categorical data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Location: Mean and Median
● The mean is the most common measure of the location of a set of points.
● However, the mean is very sensitive to outliers.
● Thus, the median or a trimmed mean is also commonly used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Spread: Range and Variance
● Range is the difference between the max and min
● The variance or standard deviation is the most
common measure of the spread of a set of points.
● However, this is also sensitive to outliers, so that
other measures are often used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
● Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large
amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: Sea Surface Temperature
● The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a
single figure
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Representation
● Is the mapping of information to a visual format ● Data objects, their attributes, and the relationships
among data objects are translated into graphical elements such as points, lines, shapes, and colors.
● Example: – Objects are often represented as points – Their attribute values can be represented as the
position of the points or the characteristics of the points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
One Great Example
● The Power of Visualization by Hans Rosling
https://www.ted.com/talks/hans_rosling_shows_the_best
_stats_you_ve_ever_seen?language=en
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Arrangement
● Is the placement of visual elements within a display
● Can make a large difference in how easy it is to understand the data
● Example:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Selection
● Is the elimination or the de-emphasis of certain objects and attributes
● Selection may involve the chossing a subset of attributes – Dimensionality reduction is often used to reduce the
number of dimensions to two or three – Alternatively, pairs of attributes can be considered
● Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse
areas
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Histograms
● Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of
objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins
● Example: Petal Width (10 and 20 bins, respectively)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Two-Dimensional Histograms
● Show the joint distribution of the values of two attributes
● Example: petal width and petal length – What does this tell us?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Box Plots
● Box Plots – Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example of Box Plots
● Box plots can be used to compare attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Scatter Plots
● Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
u See example on the next slide
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Iris Sample Data Set
● Many of the exploratory data techniques are illustrated with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher – Three flower types (classes):
u Setosa u Virginica u Versicolour
– Four (non-class) attributes u Sepal width and length u Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Scatter Plot Array of Iris Attributes