Data Mining - Exploratory Data Analysis

Shrav

ITS_632_Week4_Chapter3.pdf

Home >Computer Science homework help >Data Mining - Exploratory Data Analysis

Dr. Oner Celepcikay

ITS 632

Data Mining

Summer 2019Week 3: Data Exploration

Chapter 3 Exploring Data

1st Step of Machine Learning

What is data exploration?

● Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis

– Making use of humans� abilities to recognize patterns

u People can recognize patterns not captured by data analysis tools

● Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey

– Seminal book is Exploratory Data Analysis by Tukey

– A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook

http://www.itl.nist.gov/div898/handbook/index.htm

A preliminary exploration of the data to better understand its characteristics.

http://www.itl.nist.gov/div898/handbook/index.htm

Techniques Used In Data Exploration

● In EDA, as originally defined by Tukey

– The focus was on visualization

– Clustering and anomaly detection were viewed as

exploratory techniques

– In data mining, clustering and anomaly detection are

major areas of interest, and not thought of as just

exploratory

● In our discussion of data exploration, we focus on

– Summary statistics

– Visualization

– Online Analytical Processing (OLAP)

Summary Statistics

● Summary statistics are numbers that summarize properties of the data

– Summarized properties include frequency, location and spread u Examples: location - mean

spread - standard deviation

– Most summary statistics can be calculated in a single pass through the data

Frequency and Mode

●The frequency of an attribute value is the

percentage of time the value occurs in the

data set

– For example, given the attribute �gender� and a representative population of people, the gender

�female� occurs about 50% of the time.

● The mode of a an attribute is the most frequent

attribute value

● The notions of frequency and mode are typically

used with categorical data

Measures of Location: Mean and Median

● The mean is the most common measure of the location of a set of points.

● However, the mean is very sensitive to outliers.

● Thus, the median or a trimmed mean is also commonly used.

Measures of Spread: Range and Variance

● Range is the difference between the max and min

● The variance or standard deviation is the most

common measure of the spread of a set of points.

● However, this is also sensitive to outliers, so that

other measures are often used.

Visualization

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

● Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large

amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns

Example: Sea Surface Temperature

● The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a

single figure

Representation

● Is the mapping of information to a visual format ● Data objects, their attributes, and the relationships

among data objects are translated into graphical elements such as points, lines, shapes, and colors.

● Example: – Objects are often represented as points – Their attribute values can be represented as the

position of the points or the characteristics of the points, e.g., color, size, and shape

– If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.

One Great Example

● The Power of Visualization by Hans Rosling

https://www.ted.com/talks/hans_rosling_shows_the_best

_stats_you_ve_ever_seen?language=en

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en

Arrangement

● Is the placement of visual elements within a display

● Can make a large difference in how easy it is to understand the data

● Example:

Selection

● Is the elimination or the de-emphasis of certain objects and attributes

● Selection may involve the chossing a subset of attributes – Dimensionality reduction is often used to reduce the

number of dimensions to two or three – Alternatively, pairs of attributes can be considered

● Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse

areas

Visualization Techniques: Histograms

● Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of

objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins

● Example: Petal Width (10 and 20 bins, respectively)

Two-Dimensional Histograms

● Show the joint distribution of the values of two attributes

● Example: petal width and petal length – What does this tell us?

Visualization Techniques: Box Plots

● Box Plots – Invented by J. Tukey

– Another way of displaying the distribution of data

– Following figure shows the basic part of a box plot outlier

10th percentile

25th percentile

75th percentile

50th percentile

10th percentile

Example of Box Plots

● Box plots can be used to compare attributes

Visualization Techniques: Scatter Plots

● Scatter plots

– Attributes values determine the position

– Two-dimensional scatter plots most common, but can

have three-dimensional scatter plots

– Often additional attributes can be displayed by using

the size, shape, and color of the markers that

represent the objects

– It is useful to have arrays of scatter plots can

compactly summarize the relationships of several pairs

of attributes

u See example on the next slide

Iris Sample Data Set

● Many of the exploratory data techniques are illustrated with the Iris Plant data set.

– Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician Douglas Fisher – Three flower types (classes):

u Setosa u Virginica u Versicolour

– Four (non-class) attributes u Sepal width and length u Petal width and length Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Scatter Plot Array of Iris Attributes