Discussion - research and answer

Kunall

Chapter2.pdf

Home >Information Systems homework help >Discussion - research and answer

Dr. Oner Celepcikay

ITS 632

Data Mining

Summer 2019Week 2: Data & Data Exploration

Chapter 3 Exploring Data

1st Step of Machine Learning

What is data exploration?

! Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns

u People can recognize patterns not captured by data analysis tools

! Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is Exploratory Data Analysis by Tukey – A nice online introduction can be found in Chapter 1 of the NIST

Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm

A preliminary exploration of the data to better understand its characteristics.

http://www.itl.nist.gov/div898/handbook/index.htm

Techniques Used In Data Exploration

! In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as

exploratory techniques – In data mining, clustering and anomaly detection are

major areas of interest, and not thought of as just exploratory

! In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)

Summary Statistics

! Summary statistics are numbers that summarize properties of the data

– Summarized properties include frequency, location and spread u Examples: location - mean

spread - standard deviation

– Most summary statistics can be calculated in a single pass through the data

Frequency and Mode

!The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a

representative population of people, the gender ‘female’ occurs about 50% of the time.

! The mode of a an attribute is the most frequent attribute value

! The notions of frequency and mode are typically used with categorical data

Measures of Location: Mean and Median

! The mean is the most common measure of the location of a set of points.

! However, the mean is very sensitive to outliers. ! Thus, the median or a trimmed mean is also

commonly used.

Measures of Spread: Range and Variance

! Range is the difference between the max and min ! The variance or standard deviation is the most

common measure of the spread of a set of points.

! However, this is also sensitive to outliers, so that other measures are often used.

Visualization

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

! Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large

amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns

Example: Sea Surface Temperature

! The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a

single figure

Representation

! Is the mapping of information to a visual format ! Data objects, their attributes, and the relationships

among data objects are translated into graphical elements such as points, lines, shapes, and colors.

! Example: – Objects are often represented as points – Their attribute values can be represented as the

position of the points or the characteristics of the points, e.g., color, size, and shape

– If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.

One Great Example

! The Power of Visualization by Hans Rosling

https://www.ted.com/talks/hans_rosling_shows_the_best _stats_you_ve_ever_seen?language=en

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en

Arrangement

! Is the placement of visual elements within a display

! Can make a large difference in how easy it is to understand the data

! Example:

Selection

! Is the elimination or the de-emphasis of certain objects and attributes

! Selection may involve the chossing a subset of attributes – Dimensionality reduction is often used to reduce the

number of dimensions to two or three – Alternatively, pairs of attributes can be considered

! Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse

areas

Visualization Techniques: Histograms

! Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of

objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins

! Example: Petal Width (10 and 20 bins, respectively)

Two-Dimensional Histograms

! Show the joint distribution of the values of two attributes

! Example: petal width and petal length – What does this tell us?

Visualization Techniques: Box Plots

! Box Plots – Invented by J. Tukey – Another way of displaying the distribution of data – Following figure shows the basic part of a box plot

outlier

10th percentile

25th percentile

75th percentile

50th percentile

10th percentile

Example of Box Plots

! Box plots can be used to compare attributes

Visualization Techniques: Scatter Plots

! Scatter plots – Attributes values determine the position – Two-dimensional scatter plots most common, but can

have three-dimensional scatter plots – Often additional attributes can be displayed by using

the size, shape, and color of the markers that represent the objects

– It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes u See example on the next slide

Iris Sample Data Set

! Many of the exploratory data techniques are illustrated with the Iris Plant data set.

– Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician Douglas Fisher – Three flower types (classes):

u Setosa u Virginica u Versicolour

– Four (non-class) attributes u Sepal width and length u Petal width and length Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Scatter Plot Array of Iris Attributes