Discussion - research and answer
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Dr. Oner Celepcikay
ITS 632
Data Mining
Summer 2019Week 2: Data & Data Exploration
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Chapter 3 Exploring Data
1st Step of Machine Learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
What is data exploration?
! Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns
u People can recognize patterns not captured by data analysis tools
! Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is Exploratory Data Analysis by Tukey – A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its characteristics.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Techniques Used In Data Exploration
! In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as
exploratory techniques – In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just exploratory
! In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Summary Statistics
! Summary statistics are numbers that summarize properties of the data
– Summarized properties include frequency, location and spread u Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated in a single pass through the data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequency and Mode
!The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’ occurs about 50% of the time.
! The mode of a an attribute is the most frequent attribute value
! The notions of frequency and mode are typically used with categorical data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Location: Mean and Median
! The mean is the most common measure of the location of a set of points.
! However, the mean is very sensitive to outliers. ! Thus, the median or a trimmed mean is also
commonly used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Spread: Range and Variance
! Range is the difference between the max and min ! The variance or standard deviation is the most
common measure of the spread of a set of points.
! However, this is also sensitive to outliers, so that other measures are often used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
! Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large
amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: Sea Surface Temperature
! The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a
single figure
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Representation
! Is the mapping of information to a visual format ! Data objects, their attributes, and the relationships
among data objects are translated into graphical elements such as points, lines, shapes, and colors.
! Example: – Objects are often represented as points – Their attribute values can be represented as the
position of the points or the characteristics of the points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
One Great Example
! The Power of Visualization by Hans Rosling
https://www.ted.com/talks/hans_rosling_shows_the_best _stats_you_ve_ever_seen?language=en
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Arrangement
! Is the placement of visual elements within a display
! Can make a large difference in how easy it is to understand the data
! Example:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Selection
! Is the elimination or the de-emphasis of certain objects and attributes
! Selection may involve the chossing a subset of attributes – Dimensionality reduction is often used to reduce the
number of dimensions to two or three – Alternatively, pairs of attributes can be considered
! Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse
areas
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Histograms
! Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of
objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins
! Example: Petal Width (10 and 20 bins, respectively)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Two-Dimensional Histograms
! Show the joint distribution of the values of two attributes
! Example: petal width and petal length – What does this tell us?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Box Plots
! Box Plots – Invented by J. Tukey – Another way of displaying the distribution of data – Following figure shows the basic part of a box plot
outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example of Box Plots
! Box plots can be used to compare attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Scatter Plots
! Scatter plots – Attributes values determine the position – Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots – Often additional attributes can be displayed by using
the size, shape, and color of the markers that represent the objects
– It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes u See example on the next slide
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Iris Sample Data Set
! Many of the exploratory data techniques are illustrated with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher – Three flower types (classes):
u Setosa u Virginica u Versicolour
– Four (non-class) attributes u Sepal width and length u Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Scatter Plot Array of Iris Attributes