data mining exam
Data Preprocessing
CSC 535/635
1
Data Representation
2
Features (aka attributes)
. . . . Samples
.
.
.
Types of Variables/Attributes/Data
• Two most common types are: • Numeric
– real-value variables or integers variables – Ex: age, speed, or length – values of a numeric attribute have two important properties:
• an order relation (2 < 5 and 5 < 7) • a distance relation (d [2.3,4.2] = 1.9)
• Categorical – have neither order nor distance relation – Ex: eye color, gender, or country of citizenship – only support an equality relation (Blue = Blue, or Brown ≠ Black)
3
Types of Variables - Another classification
• Continuous variables – aka quantitative or metric variables – values are measured using either an interval scale or a
ratio scale
• Discrete variables – aka qualitative variables – values are measured using one of two kinds of nonmetric
scales — nominal or ordinal
4
Interval Scaled vs. Ratio Scaled Variables
• The ratio relation does not hold true
• The 0 point is placed arbitrarily, and thus it does not indicate the absence of value
• Ex: temperature F/C – 0 F does not mean
absence of temperature. – 80 F ≠ twice 40 F
5
• The ratio relation holds true
• Ratio scale has an absolute 0 point
• Ex: height, length, salary – 6 feet = twice 3 feet
Interval Scale Ratio Scale
Nominal Scaled vs. Ordinal Scaled Variables
• Order-less scale • May use different
symbols, characters, or numbers to represent the different states/values of the variable
• Ex: color, gender, id, customer_type
6
• Values have an order relation but not a distance relation
• Ex: student_class, military_rank, grade
Nominal Scale Ordinal Scale
Encoding Numerical Data as Ordinal
• An ordinal variable can be used to encode a numeric variable with a smaller set of values
• Ex: age (with values young, middle aged, and old) • Ex: income (with values low, middle-class, upper-
middle-class, and rich) • Ex: height (with values short, medium, tall)
7
Binary Variables
• Has one of two states: 0, 1 • Examples: smoker, owns-house • Can be considered a special case of nominal
variables
8
Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data • e.g., occupation=“ ”
– noisy: containing errors or outliers • e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records
9
Why is Data Preprocessing Important?
• No quality data, no quality mining results! • Minimize GIGO (Garbage In Garbage Out) • Data preparation, cleaning, and transformation
comprises the majority of the work in a data mining application
10
Measures for Data Quality
• Accuracy • Completeness: not recorded, unavailable • Consistency: some modified but some not • Timeliness: timely updates • Believability • Interpretability
11
Major Tasks in Data Preprocessing
• Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
• Data integration – Integration of multiple databases, or files
• Data transformation – Normalization and aggregation
• Data reduction – Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
12
Data Cleaning
• Importance – “Data cleaning is the number one problem in data warehousing”
• Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration
13
Missing Data Name Age Sex Income Class Mike 40 Male 150k Big spender Jenny 20 Female ? Regular …
14
• Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of
entry • Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple • Fill in missing values manually: tedious & infeasible • Fill in it automatically with
– a global constant : e.g., “unknown” • may confuse DM algorithm
– the attribute mean or median – the attribute mean for all samples belonging to the same class – the most probable value: using inference-based tools such as
Bayesian formula or decision tree induction
15
Measuring the Central Tendency
• Mean:
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values
• Median:
– Middle value if odd number of values, or average of the middle two values otherwise
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal, multimodal
– Empirical formula:
16
Symmetric vs. Skewed Data
17
frequency curve
positively skewed negatively skewed
• Median, mean and mode of symmetric, positively and negatively skewed data
Also referred to as skewed to the right
Noisy Data
• Noise: random error or variance in a measured variable • Incorrect attribute values may due to
– faulty data collection instruments – data entry problems – data transmission problems – etc
• Other data problems which requires data cleaning – duplicate records, incomplete data, inconsistent data
18
How to Handle Noisy Data?
• Binning method – first sort data and partition into (equal-depth) bins – then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering – detect and remove outliers
19
How to Handle Noisy Data? (cont.)
• Regression – smooth by fitting the data into regression functions
• Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with
possible outliers)
20
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34
• Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34
21
Outlier Removal
• Data points inconsistent with the majority of data • Different outliers
– Valid: CEO’s salary, – Noisy: One’s age = 200, widely deviated points
• Removal methods – Clustering – Curve-fitting
22
Data Integration
• Data integration: – combines data from multiple sources
• Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-# • Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different sources can be different, e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
23
Data Transformation
• Transforming the data into a form appropriate for mining.
• Methods: – Smoothing: remove noise from data – Aggregation: summarization – Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling
– Attribute/feature construction • New attributes constructed from the given ones
24
Data Transformation: Normalization
25
• Given numeric attribute A with values v1, v2, …, vN • min-max normalization: to [new_minA, new_maxA]
• z-score normalization
• decimal scaling normalization Where j is the number of digits in the data value with the largest absolute value
Min-Max Normalization – Example
26
Z-Score Normalization – Example
27
Decimal Scaling Normalization – Example
• Suppose values of A range from -4986 to 7845 • To normalize by decimal scaling, divide by _____ • -4986 normalizes to -0.4986 • 7845 normalizes to 0.7845
28
Where j is the number of digits in the data value with the largest absolute value
Variation of Z-Score Normalization
29
Data Reduction
• Data may be too big to work with • Data reduction – Obtain a reduced representation of the data set that is
smaller in volume but yet produce the same (or almost the same) analytical results
• Data reduction strategies – Dimensionality reduction: e.g., remove unimportant
attributes – Reducing the number of attribute values – Reducing the number of tuples
30
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection) – Select a minimum set of attributes (features) that is
sufficient for the data mining task
• Heuristic methods (due to exponential # of choices) – step-wise forward selection – step-wise backward elimination – combining forward selection and backward elimination
31
Dimensionality Reduction
• Other techniques include • Principle component analysis • Wavelet transforms
32
Reducing the Number of Attribute Values
• Some algorithms such as decision trees compare different attribute’s values
• Approaches – Binning: replace with bin means, medians, boundaries – Discretization: discretize a numeric attribute (income) into
a small number of intervals, then map each interval to a discrete symbol (low, middle-income, high)
33
Reducing The Number of Tuples – Sampling
• Sampling: obtaining a small sample s to represent the whole dataset N
• Simple random sampling may have poor performance if the dataset is skewed
• Choose a representative subset of the data – Stratified sampling: approximate the percentage of each
class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data
34
Stratified Sampling
35
Raw Data Stratified Sample
36
Sampling: with or without Replacement
SRSW OR
(simpl e rand
om
samp le with
out
replac ement
)
SRSWR
Raw Data
Discretization
• Three types of attributes: – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers
• Discretization: – discretize the range of a continuous attribute
• Some techniques: – Binning methods – Clustering-based methods – Entropy-based methods (Decision Trees) – Histogram analysis
37
38
Histogram Analysis
• Graph displays of basic statistical class descriptions – Frequency histograms – A graphical method – Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
Discretization and Concept Hierarchies
• Discretization – Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. – Interval labels can then be used to replace actual data values
• Concept hierarchies – Reduce the data by collecting and replacing low level concepts
(numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
– Things can be done at different levels – Concept hierarchies for nominal attributes, where attributes
such as street can be generalized to higher-level concepts, like city, state, or country
39
Concept Hierarchy
• Concept hierarchy for attribute price
40
Concept Hierarchy
41
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
• Concept hierarchy for attribute street
Summary
• Data preparation is a big issue for data mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization
• Many methods have been proposed but still an active area of research
42
Measuring the Dispersion of Data
44
Properties of Normal Distribution Curve
• The normal (distribution) curve – From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation) – From μ–2σ to μ+2σ: contains about 95% of it – From μ–3σ to μ+3σ: contains about 99.7% of it
Measuring the Dispersion of Data (cont.) • Let x1,x2, . . . ,xN be the values of a
numeric attribute, X. • range(X) = max – min • Assume that the values of X are
sorted in increasing order • The kth q-quantile is the value x s.t.
at most k/q of the data values are less than x and at most (q-k)/q of the data values are more than x
• Quantiles are values in X that allow us to break the data distribution into equal-size consecutive sets
• When we break X into 4 equal parts, the quantiles are called quartiles
• When we break X into 100 equal parts, the quantiles are called percentiles
Measuring the Dispersion of Data (cont.)
• Quartiles: – Q1 (25th percentile) – Q2 (50th percentile) =
median – Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, M, Q3, max
• Outlier: usually, a value higher/lower than 1.5 x IQR above/below Q3/Q1
• A way of visualizing a distribution • Shows five-number summary:
min, Q1, M, Q3, max • Boxplot:
• Data is represented with a box
• The ends of the box are at the first and third quartiles -- the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extend to Minimum and Maximum
Boxplots
Boxplots (cont.)
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Whiskers extend to extreme lows only if less/higher than 1.5 x IQR Q1/Q3
• Otherwise, terminate at terminate at 1.5 x IQR beyond Q1/Q3
• Remaining points are plotted individually
Boxplots of Symmetric & Skewed Data
49
Symmetric Skewed right (positive)
Skewed left (negative)
50
Scatter plots • Provides a first look at bivariate data to see clusters of points,
outliers, … etc. • Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Data Preprocessing
CSC 535/635
1
Data Representation
2
Features (aka attributes)
. . . . Samples
.
.
.
Types of Variables/Attributes/Data
• Two most common types are: • Numeric
– real-value variables or integers variables – Ex: age, speed, or length – values of a numeric attribute have two important properties:
• an order relation (2 < 5 and 5 < 7) • a distance relation (d [2.3,4.2] = 1.9)
• Categorical – have neither order nor distance relation – Ex: eye color, gender, or country of citizenship – only support an equality relation (Blue = Blue, or Brown ≠ Black)
3
Types of Variables - Another classification
• Continuous variables – aka quantitative or metric variables – values are measured using either an interval scale or a
ratio scale
• Discrete variables – aka qualitative variables – values are measured using one of two kinds of nonmetric
scales — nominal or ordinal
4
Interval Scaled vs. Ratio Scaled Variables
• The ratio relation does not hold true
• The 0 point is placed arbitrarily, and thus it does not indicate the absence of value
• Ex: temperature F/C – 0 F does not mean
absence of temperature. – 80 F ≠ twice 40 F
5
• The ratio relation holds true
• Ratio scale has an absolute 0 point
• Ex: height, length, salary – 6 feet = twice 3 feet
Interval Scale Ratio Scale
Nominal Scaled vs. Ordinal Scaled Variables
• Order-less scale • May use different
symbols, characters, or numbers to represent the different states/values of the variable
• Ex: color, gender, id, customer_type
6
• Values have an order relation but not a distance relation
• Ex: student_class, military_rank, grade
Nominal Scale Ordinal Scale
Encoding Numerical Data as Ordinal
• An ordinal variable can be used to encode a numeric variable with a smaller set of values
• Ex: age (with values young, middle aged, and old) • Ex: income (with values low, middle-class, upper-
middle-class, and rich) • Ex: height (with values short, medium, tall)
7
Binary Variables
• Has one of two states: 0, 1 • Examples: smoker, owns-house • Can be considered a special case of nominal
variables
8
Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data • e.g., occupation=“ ”
– noisy: containing errors or outliers • e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records
9
Why is Data Preprocessing Important?
• No quality data, no quality mining results! • Minimize GIGO (Garbage In Garbage Out) • Data preparation, cleaning, and transformation
comprises the majority of the work in a data mining application
10
Measures for Data Quality
• Accuracy • Completeness: not recorded, unavailable • Consistency: some modified but some not • Timeliness: timely updates • Believability • Interpretability
11
Major Tasks in Data Preprocessing
• Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
• Data integration – Integration of multiple databases, or files
• Data transformation – Normalization and aggregation
• Data reduction – Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
12
Data Cleaning
• Importance – “Data cleaning is the number one problem in data warehousing”
• Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration
13
Missing Data Name Age Sex Income Class Mike 40 Male 150k Big spender Jenny 20 Female ? Regular …
14
• Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of
entry • Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple • Fill in missing values manually: tedious & infeasible • Fill in it automatically with
– a global constant : e.g., “unknown” • may confuse DM algorithm
– the attribute mean or median – the attribute mean for all samples belonging to the same class – the most probable value: using inference-based tools such as
Bayesian formula or decision tree induction
15
Measuring the Central Tendency
• Mean:
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values
• Median:
– Middle value if odd number of values, or average of the middle two values otherwise
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal, multimodal
– Empirical formula:
16
Symmetric vs. Skewed Data
17
frequency curve
positively skewed negatively skewed
• Median, mean and mode of symmetric, positively and negatively skewed data
Also referred to as skewed to the right
Noisy Data
• Noise: random error or variance in a measured variable • Incorrect attribute values may due to
– faulty data collection instruments – data entry problems – data transmission problems – etc
• Other data problems which requires data cleaning – duplicate records, incomplete data, inconsistent data
18
How to Handle Noisy Data?
• Binning method – first sort data and partition into (equal-depth) bins – then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering – detect and remove outliers
19
How to Handle Noisy Data? (cont.)
• Regression – smooth by fitting the data into regression functions
• Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with
possible outliers)
20
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34
• Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34
21
Outlier Removal
• Data points inconsistent with the majority of data • Different outliers
– Valid: CEO’s salary, – Noisy: One’s age = 200, widely deviated points
• Removal methods – Clustering – Curve-fitting
22
Data Integration
• Data integration: – combines data from multiple sources
• Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-# • Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different sources can be different, e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
23
Data Transformation
• Transforming the data into a form appropriate for mining.
• Methods: – Smoothing: remove noise from data – Aggregation: summarization – Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling
– Attribute/feature construction • New attributes constructed from the given ones
24
Data Transformation: Normalization
25
• Given numeric attribute A with values v1, v2, …, vN • min-max normalization: to [new_minA, new_maxA]
• z-score normalization
• decimal scaling normalization Where j is the number of digits in the data value with the largest absolute value
Min-Max Normalization – Example
26
Z-Score Normalization – Example
27
Decimal Scaling Normalization – Example
• Suppose values of A range from -4986 to 7845 • To normalize by decimal scaling, divide by _____ • -4986 normalizes to -0.4986 • 7845 normalizes to 0.7845
28
Where j is the number of digits in the data value with the largest absolute value
Variation of Z-Score Normalization
29
Data Reduction
• Data may be too big to work with • Data reduction – Obtain a reduced representation of the data set that is
smaller in volume but yet produce the same (or almost the same) analytical results
• Data reduction strategies – Dimensionality reduction: e.g., remove unimportant
attributes – Reducing the number of attribute values – Reducing the number of tuples
30
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection) – Select a minimum set of attributes (features) that is
sufficient for the data mining task
• Heuristic methods (due to exponential # of choices) – step-wise forward selection – step-wise backward elimination – combining forward selection and backward elimination
31
Dimensionality Reduction
• Other techniques include • Principle component analysis • Wavelet transforms
32
Reducing the Number of Attribute Values
• Some algorithms such as decision trees compare different attribute’s values
• Approaches – Binning: replace with bin means, medians, boundaries – Discretization: discretize a numeric attribute (income) into
a small number of intervals, then map each interval to a discrete symbol (low, middle-income, high)
33
Reducing The Number of Tuples – Sampling
• Sampling: obtaining a small sample s to represent the whole dataset N
• Simple random sampling may have poor performance if the dataset is skewed
• Choose a representative subset of the data – Stratified sampling: approximate the percentage of each
class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data
34
Stratified Sampling
35
Raw Data Stratified Sample
36
Sampling: with or without Replacement
SRSW OR
(simpl e rand
om
samp le with
out
replac ement
)
SRSWR
Raw Data
Discretization
• Three types of attributes: – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers
• Discretization: – discretize the range of a continuous attribute
• Some techniques: – Binning methods – Clustering-based methods – Entropy-based methods (Decision Trees) – Histogram analysis
37
38
Histogram Analysis
• Graph displays of basic statistical class descriptions – Frequency histograms – A graphical method – Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
Discretization and Concept Hierarchies
• Discretization – Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. – Interval labels can then be used to replace actual data values
• Concept hierarchies – Reduce the data by collecting and replacing low level concepts
(numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
– Things can be done at different levels – Concept hierarchies for nominal attributes, where attributes
such as street can be generalized to higher-level concepts, like city, state, or country
39
Concept Hierarchy
• Concept hierarchy for attribute price
40
Concept Hierarchy
41
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
• Concept hierarchy for attribute street
Summary
• Data preparation is a big issue for data mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization
• Many methods have been proposed but still an active area of research
42
Measuring the Dispersion of Data
44
Properties of Normal Distribution Curve
• The normal (distribution) curve – From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation) – From μ–2σ to μ+2σ: contains about 95% of it – From μ–3σ to μ+3σ: contains about 99.7% of it
Measuring the Dispersion of Data (cont.) • Let x1,x2, . . . ,xN be the values of a
numeric attribute, X. • range(X) = max – min • Assume that the values of X are
sorted in increasing order • The kth q-quantile is the value x s.t.
at most k/q of the data values are less than x and at most (q-k)/q of the data values are more than x
• Quantiles are values in X that allow us to break the data distribution into equal-size consecutive sets
• When we break X into 4 equal parts, the quantiles are called quartiles
• When we break X into 100 equal parts, the quantiles are called percentiles
Measuring the Dispersion of Data (cont.)
• Quartiles: – Q1 (25th percentile) – Q2 (50th percentile) =
median – Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, M, Q3, max
• Outlier: usually, a value higher/lower than 1.5 x IQR above/below Q3/Q1
• A way of visualizing a distribution • Shows five-number summary:
min, Q1, M, Q3, max • Boxplot:
• Data is represented with a box
• The ends of the box are at the first and third quartiles -- the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extend to Minimum and Maximum
Boxplots
Boxplots (cont.)
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Whiskers extend to extreme lows only if less/higher than 1.5 x IQR Q1/Q3
• Otherwise, terminate at terminate at 1.5 x IQR beyond Q1/Q3
• Remaining points are plotted individually
Boxplots of Symmetric & Skewed Data
49
Symmetric Skewed right (positive)
Skewed left (negative)
50
Scatter plots • Provides a first look at bivariate data to see clusters of points,
outliers, … etc. • Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Introduction to IPython
Reference: • Learning IPython for Interactive
Computing and Data Visualization, Second Edition, by Cyrille Rossant
• Python for Data Analysis, by Wes McKinney
• Python Data Science Handbook: Essential Tools for Working with Data, by Jake VanderPlas, 2016
What is it?
• IPython – Interactive Python • Provides a convenient command-line
interface to the scientific Python platform
2
Installing IPython
• Anaconda is a pre-packaged Python distribution that comes with many libraries for ML and data science
• Download from https://www.anaconda.com/download/
• We use Python 3
3
Starting Python • On Windows, open Anaconda from the
Start Menu and select Anaconda Prompt • Type python • Notice how prompt changes to >>> • Type exit() to exit Python
4
Starting the IPython Console • At the command prompt, type ipython • Notice the prompt • Type exit to exit ipython
5
Jupyter Notebook
• It comes with Anaconda ipython • To start, at the command prompt type
jupyter notebook • This starts the Jupyter server and opens a
new window in the browser at address localhost:8888
6
7
Closing the Notebook Server
• To close the notebook server, at the terminal window press Ctrl + C
8
Starting a new notebook
• In the Files tab, click New, and click Notebooks Python 3
9
Starting a new notebook (cont)
• This opens a new browser tab with the new notebook as follows
10
Types of Cells • A notebook consists of a linear list of cells • There are two main types of cells: • Markdown cells: contain rich text that you
can format and can contain images, HTML code, Latex math equations, and more – Ex: next slide
• Code cells: contain code to be executed by ipython – Ex: slide after next
11
Markdown Cell
12
Code Cell
13
Magic commands • Magic commands allow us to interact with
the file system • Magic commands start with %
14
In [1]:%pwd Out[1]: 'c:\\TEACHING\\Summer\\17\\minibook-2nd\\chapter1‘ In [2]: %cd chapter1\facebook Out[2]: c:\TEACHING\Summer\17\minibook-2nd\chapter1\facebook In [3]: %ls
Running Python Scripts from IPython
• Suppose we have a program saved in the file factorial.py that is located in the current directory. You can execute the file using any of these commands
15
In [3]: !ipython factorial.py 120 3628800
In [4]: !python factorial.py 120 3628800 In [5]:%run factorial.py 120 3628800
NumPy • NumPy stands for Numerical Python • NumPy provides a multidimensional array
data structure (ndarray) • NumPy arrays are more efficient than
Python’s lists • Many operations are performed on an
element-wise basis – called vector (or vectorized) operations
16
Creating Arrays
17
18
19
20
np.random.randint(low, high = None, size = None) generates random integers from [low, high) If high is None (the default), then results are from [0, low)
NumPy Standard Data Types Data type Description
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64.
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128.
complex64 Complex number, represented by two 32-bit floats
complex128 Complex number, represented by two 64-bit floats
21
NumPy Array Attributes • ndim: number of dimensions • shape: the number of elements in each dimension • size: the total number of elements in the array
22
23
Array Indexing
24
25
Array Slicing: Accessing Subarrays
26
Multidimensional Arrays
27
Subarrays as no-copy views: Unlike in Standard Python, array slices return views rather than copies of the array data
28
Creating copies of arrays: use the copy() method
29
Reshaping of Arrays
Does reshape create a view or a copy of the original array?
30
31
• We can use np.newaxis to increase the dimension of an array by one dimension
32
Vector Operations on Arrays
Using User-Defined Functions with ndarray Objects
33
Boolean Operations on Arrays
• Comparison and logical operations in ndarray objects work on an element-wise basis
• Evaluating conditions yield by default a Boolean ndarray object
• Use the bitwise logical operators (&, |, ^, ~) to perform element-wise logical operations
34
35
Boolean Arrays as Masks
• Boolean arrays can be used as masks, to select particular subsets of the data
36
37
38
Counting Entries
39
40
np.any() and np.all()
np.where()
• The np.where() function, allows the definition of actions depending on whether a condition is True or False
• The result of applying np.where() is a new ndarray object of the same shape as the original one
41
`
42
Aggregate Functions
• They are useful to compute summary statistics for large data
• NumPy provides many functions for this including: sum, max, min, mean, std, var, median
• These functions are faster than their Python counterparts
43
NumPy functions are faster than their Python counterparts
44
Applying Aggregate Functions to Rows/Columns of 2D Array
45
More on Axis/Axes • In multidimensional arrays, there is an axis per dimension (ex: a[3,1, 2]) • First axis runs across the rows, second axis runs across the columns • When an aggregate function is be applied along an axis, think about
the axis as the dimension of the array that will be collapsed, rather than the dimension that will be returned
– In case of axis=0, 1st dimension is collapsed; i.e., NumPy performs the action on elements of each column
– If axis=1, NumPy performs the action on rows
• Another way to think about axis is as the direction along which the operation is performed
46
47
Broadcasting
• For arrays of the same shape, binary operations are performed on an element- by-element basis
• Broadcasting allows binary operations to be performed on arrays of different sizes
48
In [1]: a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b
Out[1]: array([ 5, 7 , 9])
In [2]: a + 2
Out[2]: array([ 3, 4 , 5]) a + 2 z [1, 2, 3] + [2, 2, 2]
Broadcasting (cont.)
49
a [[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]
Rules of Broadcasting • A set of rules that control what happens when a
binary operation is applied to two arrays of different shapes
50
Rules of Broadcasting - Rule 1 • Rule 1: If the two arrays differ in their number of
dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (left) side
• Rule 1 applies:
51
a = np.arange(12).reshape(3, 4)
b = np.array([1 , 2 , 3, 4])
a + b
a.shape = (3, 4)
b.shape = (4,)
b.shape -> (1, 4)
Rules of Broadcasting – Rule 2 • Rule 2: If the shape of the two arrays does not
match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape
• Rule 2 applies to the array b
52
a.shape = (3, 4)
b.shape = (1, 4)
a = np.arange(12).reshape(3, 4)
b = np.array([1 , 2 , 3, 4]).reshape(1,4)
a + b
b.shape -> (3, 4) b [[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]]
b [[1, 2, 3, 4]]
Rules of Broadcasting (cont.) • Rule 3: If in any dimension the sizes disagree
and neither is equal to 1, an error is raised
• By rule 3, the arrays a and b are not compatible
53
a.shape = (3, 4)
b.shape = (2, 2)
a = np.arange(12).reshape(3, 4)
b = np.array([1 , 2 , 3, 4]).reshape(2,2)
a + b
Broadcasting Example 1 • Consider adding the following two arrays
• The shapes of the arrays are
• By rule 1, array a has fewer dimensions, so we pad its shape on the left by ones:
• By rule 2, the shape of the two arrays do not match on the first dimension, so we stretch this dimension in a to match:
54
In [3]: M = np.ones((2, 3))
a = np.arange(3)
M.shape = (2, 3)
a.shape = (3,)
M.shape -> (2, 3)
a.shape -> (1, 3)
M is [[1, 1, 1],
[1, 1, 1]]
a is [0, 1, 2]
M is [[1, 1, 1],
[1, 1, 1]]
a is [[0, 1, 2]]
M is [[1, 1, 1],
[1, 1, 1]]
a is [[0, 1, 2],
[0, 1, 2]]
M.shape -> (2, 3)
a.shape -> (2, 3)
M + a is
[[1, 2, 3],
[1, 2, 3]]
Broadcasting Example 2 • Consider adding the following two arrays
• The shapes of the arrays are
• By rule 1, we pad the shape of a on the left by 1s:
• By rule 2, we upgrade each of these ones to match the corresponding size of the other array:
55
a = np.arange(3)
b = np.arange(3).reshape(3, 1)
a.shape = (3,)
b.shape = (3,1)
a.shape -> (1, 3)
b.shape -> (3, 1)
a is [0, 1, 2]
b is [[0],
[1],
[2]]
b [[0],
[1],
[2]]
a is [[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
a.shape -> (3, 3)
b.shape -> (3, 3)
Now the two arrays have
the same shape and can
be added
b is [[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]
a + b is [[0, 1, 2],
[1, 2, 3],
[2, 3, 4]]
a [[0, 1, 2]]
Broadcasting Example 3 • Consider adding the following two arrays
• The shapes of the arrays are
• By rule 1, we pad array a by 1s on the left:
• By rule 2, the first dimension of a is stretched to match that of M:
• Now rule 3 applies, the two arrays do not match on the second dimension and neither is equal to 1, so the two arrays are not compatible 56
In [3]: M = np.ones((3, 2))
a = np.arange(3)
M.shape = (3, 2)
a.shape = (3,)
M.shape -> (3, 2)
a.shape -> (1, 3)
M is [[1, 1],
[1, 1],
[1, 1]]
a is [0, 1, 2]
M [[1, 1],
[1, 1],
[1, 1]]
a [[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
M.shape -> (3, 2)
a.shape -> (3, 3)
a is [[0, 1, 2]]
M [[1, 1],
[1, 1],
[1, 1]]
Pandas
• It is an IPython library for data manipulation • Has functions to work with any kind of data such
as tabular data
1
Data Structures of Pandas
• All data representation in Pandas is done using one of two data structures: – Series – DataFrame
2
Series • A Series is a one-dimensional array-like object
containing a sequence of values and an associated array of data labels, called its index
• A Series object wraps both its values and indices
3
Series
4
Series • You can create a Series with an index identifying
each data point with a label
5
6
Series • Another way to think about a Series is as a dict,
as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict
7
The index/label values in a Series do not have to be unique
8
DataFrame • Dataframe (DF) is the most important and useful data
structure in pandas • It is used for almost all kind of data representation and
manipulation in pandas • A DF is a table containing rows (representing
objects/observations) and columns (representing features/variables/attributes)
• Unlike numpy arrays, it can contain heterogeneous data • DataFrames are useful for representing raw datasets
and processed datasets in ML and data science
9
Relationship between DataFrames and Series
• We can think of a DataFrame as a sequence of aligned Series objects – aligned means that they share the same
index
10
Constructing Data Frame Objects
• A DataFrame object can be constructed in different ways, including: – from a list of dicts – from a dict of dicts – from a list of Series objects – from a dict of Series objects – from a 2-D NumPy array – from a 2-D Python list
11
List of Dictionaries to a Dataframe
12
Notice how dictionary keys become column names in the DF and each dictionary becomes a row in the DF
A Dictionary of Series Objects to a Dataframe
13
A 2-D List to a DataFrame
14
You can change the index and labels of a DF, using df.index = [‘a’, ‘b’, ‘c’] df.columns = [‘x’, ‘y’, ‘z’]
Data Retrieval
• Pandas provides different ways to retrieve data into a DataFrame. For example, we can read flat files, csv files, database tables, …etc.
15
CSV Files to a Dataframe
• We can read a CSV, or any delimited file, using pandas and get the result as a DF
• This is done using panda’s read_csv() function
• Usually, the only parameter needed is the name of the input file but sometimes you may need to add more parameters
16
Examples
• We are using the files simplemaps- worldcities-basic.csv and nyc_taxi data
17
Examples
18
19
DF.describe() gives basic statistics of the columns w/o param include=‘all’, function shows only columns with numeric types
20
Keyword argument parse_dates is used to specify the columns that contain dates so that pandas can parse them correctly
21
22
Descriptive Statistics with Pandas
23
• The following functions do as their names indicate
Selecting Data
• Pandas allows us to select rows, columns, and combinations of rows and columns, … etc.
24
Selecting Columns
25
Selecting Rows • Rows can be retrieved by position or name
– Use loc attribute to select row(s) from their labels – Use iloc attribute to select row(s) from their positions
• The following example show how to select the first row:
26
27
Remark: The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns
28
Using DF.loc[rows_labels_names] or DF.iloc[rows_indexes] is the proper way to select rows. The slice notation, DF[:], is provided for convenience.
Here, you needed to give the column labels, because writing DF[] is the notation to select columns.
Aside: More about loc and iloc
29
Filtering with Boolean Indexing
30
Operating on Data • Vector arithmetic can be used with DataFrames and
Series objects • Example: adding a new column to the DF with the trip
duration in minutes
31
More on Vector Operations
32
The result is a new Series object containing the aligned sum of a and b
More on Vector Operations
33
[[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11]]
[[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]]
Handling Missing Data
• In pandas, missing data is represented by NaN (Not a Number) or None
• Pandas provides several Series and DataFrame methods to deal with missing data, including: – isnull() indicates whether values are null or not – notnull() the opposite of isnull() – dropna() removes missing data – fillna(some_default_value) replaces missing data with
a default value
34
35
Working with a Series
•
36
37
Working with a DataFrame
38
Passing how='all' will only drop rows that are all NA
39
The thresh argument is used to keep rows/columns containing a certain number of observations.
To drop columns, use the argument axis=1 or axis = ‘columns’
The groupby Operation • Similar to the group by operation in SQL
where: – data is split into groups that share common
attributes – a function is applied to each group – function results are combined into a result
object
40
Illustration of a group aggregation
41
Grouping Key
• Data is split into groups based on the vales of a key or several keys
• Each grouping key can be – a list, array, dict, or series object – a column name in a DataFrame
• The keys do not have to be all of the same type
42
43
44
45
Using arrays as the grouping keys
46
The GroupBy method size() returns group sizes
Notice: there is no key2 column in the result as df[‘key2’] is not numeric data
Joins
• pd.merge • The idea is similar to joins in databases. That is
to join data from more than one table • Consider the NY city taxi dataset • The DFs data and fare have same number of
rows (one row for each ride) • We want to get the average tip from the fare DF • We will then join the average tip into the data DF
47
48
49
More about the merge Operation • Idea is similar to the join operation in SQL where
pandas will combine two DFs where values under common columns are the same
50
51
52
The on Keyword Argument
• The attribute on allows us to explicitly specify the name of the key column(s)
• It takes either a column name or a list of column names
53
54
If the columns to join on have different names, we can use the left_on and right_on parameters to specify the column names
We can use the DataFrame’s drop() method to remove redundant columns
55 Removing a row
56
DataFrame.set_index
57
58
Joining on Indexes
59
The join() method performs a merge that defaults to joining on indices
We can mix merging on indices and columns
60
Attributes lsuffix and rsuffix
61
Attributes lsuffix and rsuffix
The how Attribute
• Notice, by default pandas performs inner join (intersect)
• For full outer join (union), use the attribute how = 'outer'
• Inner join may be specified explicitly using how = 'inner'
• Other possible values for how are left and right for left/right outer joins
62
63
64
65
66
Renaming Columns in a DF – Function rename
67
DF.rename
68
DF.rename
Accessing an Element in a DF
69
Writing df1[0,1] or df1[[0,1]] is wrong
Data Visualization
• Data visualization makes it easier to sea patterns in the data than just looking at tables of numbers
• The Anscombe data set contains four sets of data, each contains two continuous variables
• Each set has the same mean, variance, correlation, and regression line
• Only when the data are visualized does it become obvious that each set does not follow the same pattern
1
The Anscombe Data Set
2
The Anscombe Data Set
3
The Anscombe Data Set
4
Plot of the Anscombe Data Set
5
Matplotlib
• matplotlib is Python’s fundamental plotting library
import matplotlib.pyplot as plt
• Calling plt.plot with two iterable objects of the same length (e.g., lists of numbers or NumPy arrays) creates a line plot
6
Launching the Qtconsole
ひ Open anaconda command prompt ひ Type
jupyter qtconsole
ひ To be able to se the charts in the console, type %matplotlib inline
ひ Same for jupyter notebook ひ In jupyter notebook, you may need plt.show()
after writing the code for a plot
7
Launching the Qtconsole
8
9
Creating a Line Plot
Remarks
• A plot represents value pairs (x, y) – need two arrays – x contains values on the x-axis – y contains values on the y-axis
• If you pass only one array to the plt.plot() function, matplotlib assumes it is the sequence of y values – it associates the y values with the sequence
x: 0,1,2,3, ... 10
Passing One Set of Values
11
12
Creating a Scatter Plot
13
Creating a Scatter Plot
Creating a Histogram
• A histograms shows the distribution of a variable • pylot.hist(array, bins = n) draws a histogram
14
plt.hist()
15
Creating a Bar Chart • A bar chart can be used to compare
variables • Very example, to compare the favorite
programming language for students in a CS class
• plt.bar(x, y) • plt.barh(x, y) draws a horizontal bar chart
16
plt.bar() & plt.barh()
17
Multiple Bar Charts
18
Pie Chart • A pie chart represents the data as if to fit into a circle • The individual data points are expressed as sectors of a circle
that add up to 360 degrees. • A pie chart chart is good for displaying categorical data and
summaries • data = [500, 200, 250] ; labels = ["Agriculture", "Aide", "News"] • plt.pie(data, labels=labels, autopct=‘%0.1f')
19
Pie Chart – colors & axis
20
Box Plot • A box plot is used to visualize the median value and
low and high ranges of a distribution • data = np.random.randn(50) #Z(0,1) • plt.boxplot(data)
21
Box Plot
22
Displaying Multiple Box Plots
23
Architecture of matplotlib ひ The architecture of matplotlib is
logically structured into three
layers as in the figure
24
ひ The communication is unidirectional ひ The three layers are:
1. Scripting: as data analysts, you work at this layer
2. Artist: see following slides
3. Backend: classes that implement the graphic
elements at a low level
Artist Layer
ひ All the elements that make up a chart, such as the title, ;┝キゲ ノ;HWノゲが ぐ WデIくが ;ヴW キミゲデ;ミIWゲ ラa デエW Aヴデキゲデ ラHテWIデ ┘キデエ ; hierarchical structure representing the different
components on a chart as in the Figure below
ひ There are two types of Artist classes: primitive and composite
25
The Main Artist objects in the Hierarchy
of the Artist Layer
26
1. Figure corresponds to the entire graphical representation and can contain many Axes 2. Axes corresponds to plot or chart. Each Axes object belongs to only one Figure, and is characterized by two Artist Axis. Objects such as title, x label, and the y label, belong to this composite artist.
3. Axis objects that take into account the numerical values to be represented on the x axis and y axis, define the limits and manage the ticks and tick labels
Axes vs Axis
27
Parts of a Figure
28
Scripting Layer (pyplot)
ひ The scripting layer is best suited for data analysts to manipulate and visualize the data
ひ This layer consists of an interface called pyplot
29
Adding More Features to Plots
ひ plt.plot([1,2,3,4]) ひ Default configuration: ひ Blue line ひ Size of axis matches range
of input values
30
ひ There is neither a title nor axis labels ひ There is no legend
Setting Dimensions & Title
ひ Specify the dimensions of the plot by calling plt.axis([xmin, xmax, ymin, ymax])
ひ A title can be specified using plt.title()
31
Before After
Adding Axis Labels
ひ np.linspace creates an array of regularly
spaced values
inclusive of the end
points
32
In [15]: x = np.linspace(2, 3, 11) x
Out[15]: array([ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. ])
ひ plt.xlabel("string") ひ plt.ylabel("string")
Adding Legend & Line Labels
ひ Each line in a plot can be given a label by using the label argument of the plot function
ひ TエW ノ;HWノ ┘ラミげデ ;ヮヮW;ヴ ラミ デエW ヮノラデ ┌ミノWゲゲ ┞ラ┌ also call plt.legend() to add a legend
ひ Legend location can be customized by using a loc argument of the legend method
ひ loc takes a string or integer value as in the table in the next slide
33
Values of loc argument
Location String Location Code
けHWゲデげ 0 け┌ヮヮWヴ ヴキェエデげ 1 け┌ヮヮWヴ ノWaデげ 2 けノラ┘Wヴ ノWaデげ 3 けノラ┘Wヴ ヴキェエデげ 4 けヴキェエデげ 5 けIWミデWヴ ノWaデげ 6 けIWミデWヴ ヴキェエデげ 7 けノラ┘Wヴ IWミデWヴげ 8 け┌ヮヮWヴ IWミデWヴげ 9 けIWミデWヴげ 10
34
Labels and Legends - Example
35
Labels and Legends - Example
36
Customizing Plots
ひ The plot attributes に marker is used to add a marker に color is used to add a color
37
Some Matplotlib marker Styles
38
Some Matplotlib color Code Letters
ひ Can use several formats to specify colors に ;ゲ キミ エデマノ ふWくェくが Iラノラヴ Эけセrrggbbげぶ に shades of gray can be specified as a string representing a float in
the range '0.0' to '1.0' ('0.0' black to '1.0' white)
ひ For basic built-in colors, you can use a single letter に b: blue に g: green に r: red に c: cyan に m: magenta に y: yellow に k: black に w: white
39
Markers and Colors - Example
40
import pylab pylab.plot([1, 2, 3], [1, 2, 3], label = 'line 1', color ="g", marker ="D" ) pylab.plot([1, 2, 3], [3, 1.5, 0.75], label = "line 2", marker = 's', color ='y') pylab.legend(loc='lower center')
Markers and Colors - Example
41
Linewidth & linstyle
42
ひ plot attributes linewidth and
linestyle
43
Linewidth & linstyle (cont)
ひ We can pass a third argument to the plot() function to specify some codes that correspond to the colors, styles,
ぐ WデIく
44
Linewidth & linstyle (cont)
Working with Multiple Subplots
ひ Within a figure, you can have multiple subplots ひ Each subplot specifies a different drawing region ひ Each subplot corresponds to a separate Axes
object
ひ One way of accomplishing this is by using pyplot.subplot()
45
The Function pyplot.subplot()
ひ The subplot() function, takes either a 3-digit integer or 3- separate integers as parameters
ひ They specify how the figure is divided into different subplot regions, and which subplot is active and will
receive the next pyplot command
に The first digit or number specifies the number of rows に The second digit or number specifies the number of columns に The third digit or number is the index/location of the current
subplot that will receive pyplot ommands
46
The Function pyplot.subplot() (cont.)
ひ The subplot location/index is sequentially numbered, and plots are placed first in a left-to-
right direction, then from top to bottom
47
Multiple Subplots Example
48
The figure has been divided into two vertical subplots
The figure has been divided into two horizontal subplots
pyplot.subplot() に Specifying Parameters As Separate Integers
49
Subplots に Alternative Approach ひ Create a figure object
fig = plt.figure(figsize=(6,4))
に figsize: specifies figure dimension (width, height) in inches; default is [6.4, 4.8]
ひ Add a subplot
ax = fig.add_subplot(nrows, ncols, index)
50
Example に Anscombe Data
51
Loading the Data
52
53Matplotlib figure with four empty axes or subplot areas
Specifying the Subplots
54
Plotting The Data
55
Adding Titles
56
Adjusting Things
An alternative approach is to use: fig.subplots_adjust(hspace=0.5)
Saving a Plot to a File
• To save the plot as an image, use the code plt.savefig(filename)
• The file format can be deduced from the file name extension
• pylab.savefig('plot.png') # save as a png image
• pylab.savefig('plot.pdf') # save as a pdf
• pylab.savefig('plot.eps') # save in Encapsulated Postscript format
57
58
Saving the Code • In [171]: import matplotlib.pyplot as plt
... • The magic command %save followed by filename followed by
number of input prompts can be used save code in .py file .py file.
• In [185]: %save my_first_chart 175
• In [186]: %save my_first_chart 171-175 #no space and consecutive line numbers
• After you launch the command, you will find the my_first_chart.py file in your working directory
• %load my_first_chart.py #loads the file • %run my_first_chart.py #runs the file
59
Saving the Code
60
Saving the Code
61
Loading and Running the File
62
Good resources
https://matplotlib.org/
https://www.w3schools.com/python/
https://www.geeksforgeeks.org/python-programming- language/
Jupyter Notebook: https://www.dataquest.io/blog/jupyter-notebook- tutorial/
NumPy: https://www.machinelearningplus.com/python/numpy- tutorial-part1-array-python-examples/
Data Science: https://www.datacamp.com/courses/tech:python
63
Chapter 4 に Classification
These slides are based on material and slides from:
1) Dunham, Data Mining – Introductory and Advanced Topics 2) Han, Kamber and Pei, Data Mining – Concepts and Techniques 3) Tan, Stienback and Kumar, Introduction to Data Mining
4) Zaki and Meira JR., Data Mining and Analysis – Fundamental Concepts and Algorithms 5) Liu, Web Data Mining – Exploring Hyperlinks, Contents, and Usage Data
1
Introduction
• Given a predefined set of classes and a training dataset, build a model to help classify objects into the different classes
• Applications – Classifying an image as a cat or a dog – Identify mushrooms as poisonous or edible – Identifying fake news – Predict when a river will flood – Pattern recognition に recognize hand written digits – Medical diagnosis – Fraud and spam detection – Identify individuals with credit risks – Speech recognition
2
Classification vs. Prediction
• Estimation and prediction can be considered classification
– Eゲデキマ;デキミェ ; ヮWヴゲラミげゲ ;ェW – Predicting stock market behavior
• Prediction models a continuous function while classification models a discrete-valued function
• Classification – predicts categorical class labels (discrete or nominal)
• Prediction – predicts unknown or missing values (models continuous-
valued functions)
3
Formal Definition
• Given a database D={t1,t2がぐがtn} and a set of classes C={C1がぐがCm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class.
• Classes are disjoint and form a partition of D.
4
Supervised vs. Unsupervised Learning
• Supervised learning (classification) – Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set • Unsupervised learning (clustering)
– The class labels of training data is unknown – Given a set of data, the task is to establish the existence of
classes or clusters in the data
5
What do we mean by learning?
• Given – a data set D, – a task T, and – a performance measure M,
a computer system is said to learn from D to perform the task T キa ;aデWヴ ノW;ヴミキミェが デエW ゲ┞ゲデWマげゲ ヮWヴaラヴマ;ミIW on T improves as measured by M
• In other words, the learned model helps the system to perform T better as compared to no learning
6
An example
• Data: Loan application data • Task: Predict whether a loan should be approved or
not.
• Performance measure: accuracy.
No learning: classify all future applications (test data) to
the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
7
Fundamental assumption of learning
Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples).
• In practice, this assumption is often violated to certain degree.
• Strong violations will clearly result in poor classification accuracy.
• To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.
8
ClassificationねA Two-Step Process • Model construction: describing a set of
predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
– The model can be represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model • The known labels of test samples are compared with
the classified results from the model
• Test set is independent of training set • If the accuracy is acceptable, use the model to
classify new tuples 9
Illustrating Classification Task
10
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learning algorithm
Training Set
Accuracy
• Other name: accuracy rate, classification accuracy, prediction accuracy: percentage of test samples that
are correctly classified
11
畦潔潔憲堅欠潔検 噺 Number of correct classificationsTotal number of test cases
Comparing Classification Methods
• Prediction accuracy • Speed
– time to construct the model – time to use the model
• Robustness – handling noise and missing values
• Scalability – ability to construct model efficiently given large data
• Interpretability – understanding and insight provided by the model
• Other measures – e.g., goodness of rules; decision tree size; compactness of
classification rules
12
Categorization of Classification Algorithms
• Classification Techniques – Decision Trees – Statistical – Distance – Rules – Neural Networks
13
Decision Tree-Based Algorithms
• Decision trees are among the most widely used techniques for classification
– classification accuracy is competitive with other methods, and
– very efficient • The classification model is a tree, called decision tree
(DT)
• C4.5 by Ross Quinlan is perhaps the best known system
14
Decision Tree-Based Algorithms
• A tree structure where: – each internal node
represents a test on an
attribute,
– each branch represents an outcome of the test,
– and leaf nodes reflect decision outcomes
• different classes and class distributions
15
DT- Loan Dataset
16
Approved or not
A Decision Tree from the Loan Dataset
• Decision nodes and leaf nodes (classes)
17
Using the Decision Tree
18
No
Is the DT Unique?
19
◼ No. Here is a simpler tree
◼ We want smaller tree and accurate tree ◼ Easy to understand and perform better
◼ Finding the best tree is
NP-hard
◼ All current tree building
algorithms are heuristic
algorithms
From a DT to Classification Rules
20
◼ A decision tree can be
converted to a set of
rules
◼ Each path from the
root to a leaf is a rule
Constructing a Decision Tree
• Constructing a DT is aka decision tree induction • Many Algorithms:
– ID3 – C4.5 – C5.0 – CART – SLIQ,SPRINT
21
The Basic Algorithm に ID3
• Assume attributes are categorical – Continuous attributes can be handled too
• A greedy divide-and-conquer algorithm • Tree is constructed in a top-down recursive manner • At start, all the training examples are at the root • Examples are partitioned recursively based on selected
attributes
• Attributes are selected on the basis of a measure called information gain
22
ID3
23
• A greedy divide-and- conquer algorithm
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on selected attributes
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
24
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Refund
Don’t Cheat
Yes No
Marital Status
Don’t Cheat
Cheat
Single,
Divorced Widowed
Don’t Cheat
Married
• Let Dt be the set of training samples that reach a node t
• General Procedure: – If all the samples in Dt belong to the same
class C, then t is a leaf node labeled as C
– If Dt is an empty set, then t is a leaf node labeled by the most common class at the parent
– If attribute set is empty, then t is a leaf node labeled by the majority class
– If Dt contains samples that belong to more than one class,
– use an attribute test to split the data into smaller subsets
– recursively apply the procedure to each subset
Decision Tree Induction
Decision Tree Induction
25
• Let Dt be the set of training samples that reach a node t
• General Procedure: – If all the samples in Dt belong to the same
class C, then t is a leaf node labeled as C
– If Dt is an empty set, then t is a leaf node labeled by the most common class at the parent
– If attribute set is empty, then t is a leaf node labeled by the majority class
– If Dt contains samples that belong to more than one class,
– use an attribute test to split the data into smaller subsets
– recursively apply the procedure to each subset
Refund
Don’t Cheat
Yes No
Marital Status
Don’t Cheat
Cheat
Single,
Divorced Widowed
Don’t Cheat
Married
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 Yes Married 120K No
3 Yes Divorced 220K No
4 No Married 100K No
5 No Married 75K No
6 No Married 60K No
7 No Single 70K No
8 No Single 85K Yes
9 No Single 90K Yes
10 No Divorced 95K Yes 10
The Basic Algorithm に ID3
• Conditions for stopping partitioning – All examples for a given node belong to the same class – There are no remaining attributes for further partitioning に
majority class is the leaf
– There are no examples left
26
ID3 Algorithm in More Details
• Algorithm Generate_DT(D, attribute_list) • Input: D, attribute_list • Output: a DT 1. Create a node N
2. If samples are all of the same class C, then
return N as a leaf node labeled with class C
3. If attribute_list is empty, then
return N as a leaf node labeled with the majority class
4. Select test_attribute, the attribute among attribute_list
with the most information gain
27
ID3 Algorithm in More Details (cont.)
5. Label node N with test_attribute
6. For each known value ai of test_attribute
i. grow a branch from node N for the condition test_attribute =
ai
ii. let Si be the set of samples in D for which test_attribute = ai
iii. if Si is empty, then
attach a leaf node labeled with the majority class in D
iv. else attach the node returned by Generate_DT(Si, attribute_list
- test_attribute) Marital Status
Don’t Cheat
Cheat Don’t Cheat
MarriedWidowed Single,
Divorced
When does Algorithm Stop?
• Algorithm stops at steps 2, 3 and 6-III – 2) all samples for a given node belong to the same class – 3) attribute_list is empty – 6-III) Si is empty, i.e., there are no samples left
29
Choosing an Attribute to Partition Data
• The key to building a decision tree - which attribute to choose in order to branch
• The objective is to reduce impurity or uncertainty in data as much as possible
– A subset of data is pure if all instances belong to the same class.
• The heuristic in ID3 is to choose the attribute with the maximum Information Gain based on entropy
reduction
30
The Loan Dataset
31
Approved or not
Two possible roots, which is better?
32
• Fig. (B) seems to be better
Entropy Measure
• Entropy is a measure of impurity, randomness, uncertainty
• We use entropy as a measure of impurity or disorder of dataset D. (Or, a measure of information in a tree)
• The entropy formula,
• Pr(cj) is the probability of class cj in data set D
33
,1)Pr(
)Pr(log)Pr()(
||
1
||
1 2
=
=
=
−=
C
j j
j
C
j j
c
ccDentropy
Entropy measure: let us get a feeling
• As the data become purer and purer, the entropy value becomes smaller and smaller. This is useful to us!
34
Entropy
• Entropy is a measure of impurity, randomness, uncertainty "in a set of data"
• Entropy of node T is minimum when all samples are of the same class
• Entropy of node T is maximum when there is equal representation of all classes
35
Information Gain
• Given a set of samples D, we first compute its entropy:
• If we select attribute Ai, with v values, as the test attribute at the current node, this will partition D into
v subsets D1, D2 ぐが Dv • The expected entropy if Ai is used as the current test
attribute:
36
=
= v
j j
j A DentropyD
D Dentropy
i
1
)( ||
|| )(
D
Ai
D1
av
D2 Dv . . . . . .
a1 a2
D
Information Gain (cont.)
• Information gained by selecting attribute Ai to branch or to partition the data is
• We choose the attribute with the highest gain to branch/split the current tree.
37
)()(),( DentropyDentropyADgain iAi
−=
Ai
D1
av
D2 Dv
a1 a2 . . . . . .
D
Example
38
Age Yes No entropy(Di) young 2 3 0.971 middle 3 2 0.971 old 4 1 0.722
Own_house is the best
choice for the root.
結券建堅剣喧検岫経岻 噺 伐 はなの抜 log態 はなの伐 ひなの抜 log態 ひなの 噺 ど┻ひばな 結券建堅剣喧検潮栂津 e 朕墜通鎚勅岫経岻 噺 はなの 抜 結券建堅剣喧検岫経怠岻 髪 ひなの 抜 結券建堅剣喧検岫経態岻噺 はなの 抜 ど 髪 ひなの 抜 ど┻ひなぱ噺 ど┻ののな 結券建堅剣喧検凋直勅岫経岻 噺 のなの 抜 結券建堅剣喧検岫経怠岻 髪 のなの 抜 結券建堅剣喧検岫経態岻 髪 のなの 抜 結券建堅剣喧検岫経戴岻噺 のなの抜 ど┻ひばな 髪 のなの抜 ど┻ひばな 髪 のなの抜 ど┻ばにに噺 ど┻ぱぱぱ
Calculating entropy(D2)
39
結券建堅剣喧検潮栂津 e 朕墜通鎚勅岫経岻 噺 はなの 抜 結券建堅剣喧検岫経怠岻 髪 ひなの 抜 結券建堅剣喧検岫経態岻噺 はなの 抜 ど 髪 ひなの 抜 ど┻ひなぱ噺 ど┻ののな entropy(D2) = -Pr(Class=Yes)*log(Pr(Class = Yes))
-Pr(Class=No)*log(Pr(Class = No))
= - 滞苔log(滞苔) - 戴苔log(戴苔) = 0.918
The Final Tree
40
C4.5 and C5.0
• Extensions of ID3 • Details of C5.0 are not public • C4.5 improves on ID3 as follows:
– allows for continuous-valued attributes – handles missing attribute values – uses tree pruning to improve the efficiency of the DT – uses a better heuristic (GainRatio vs Information Gain) – allows classification via either DT or classification rules – simplifies complex rules with long LHS – a default rule assigning to the majority class is used
41
Gain Ratio
• Information Gain is biased towards selecting attributes with many values
• An attribute with many values results in purer partitions, but may be useless for classification
42
Gain Ratio
• GainRatio: adjusts Information Gain by normalizing by SplitInfo, which takes into consideration the number and sizes of the different partitions of the attribute under consideration
• C4.5 select the attribute with the maximum GainRatio as the test attribute
43
罫欠件券迎欠建件剣 畦 噺 罫欠件券迎欠建件剣 経┸ 畦 噺 罫欠件券岫経┸ 畦岻SplitInfo岫畦岻 SplitInfo 畦 噺 伐布沈退怠賃 】経沈】】経】 log 】経沈】】経】 噺 伐布沈退怠賃 券沈券 log 券沈券
SplitInfo
• The split information represents the potential information generated by splitting the dataset into k
partitions:
• SplitInfo(A) = entropy(D) based on the values of A • High split info: many partitions that have about the
same size
– the larger k, the larger splitInfo, which = log2K • Low split info: few partitions hold most of the tuples
– e., g., [2, 2, 96] is purer than [2, 2, 2, 94] is purer than [5, 5, 5, 85]
44
SplitInfo 畦 噺 伐布沈退怠賃 】経沈】】経】 log 】経沈】】経】 噺 伐布沈退怠賃 券沈券 log 券沈券
Overfitting
45
• Overfitting: a constructed tree may overfit the training data
– too many branches, due to noise or outliers – poor accuracy for unseen samples
• Definition: A decision tree T overfits the training data if there exists another tree T' such that T performs
better than T' over the training samples while T ' has
a higher accuracy than T over the entire distribution
of the data
Tree Pruning
46
• Tree pruning removes the least reliable branches in a DT, which usually results in better classification
• A pruned DT is more efficient and easier to understand
Approaches for Tree Pruning
47
• Two approaches to avoid overfitting – Prepruning: halt tree construction earlyねdo not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold – Postpruning: remove branches from a 曨fully grown杇 tree
• A node is pruned by removing its branches and is labeled by the most common class
Cost Complexity Postpruning Algorithm
• The "cost complexity" of a tree is defined as a function of the number of leaves in the tree and the error rate of the tree に e.g., err(T)/|leaves(T)| or err(T) + a|leaves(T)|
• Construct a full decision tree, T • Compute the "cost complexity", c, of T • Internal nodes are removed from T in a bottom-up fashion • Let T' be the tree after removing an internal node, N, from T • Compute the "cost complexity", c', of T' • If c' < c, prune N, otherwise keep N • The algorithm generates a set of progressively pruned trees • The smallest decision tree that minimizes the cost complexity is
used
• A separate validation set is used when calculating "cost complexity"
48
CART
• CART 氾 Classification And Regression Trees • It generates a binary decision tree • At each internal node, CART makes an exhaustive search
to select the best splitting attribute, X, and value, v, that produces the greatest separation of the data
• If X < v デエWミ ゲWミS デエW S;デ; デラ デエW さノWaデざき ラデエWヴ┘キゲWが ゲWミS S;デ; ヮラキミデ デラ デエW さヴキェエデざ – X < v is called a criterion
• RWヮW;デ ゲ;マW ヮヴラIWゲゲ ラミ デエW デ┘ラ IエキノSヴWミ さミラSWゲざ – Yラ┌ ェWデ ; さデヴWWざ
• After the tree is built, pruning is used to improve the efficiency of the tree
1
CART (cont)
• CART makes an exhaustive search of attributes and split points to find the best splitting criterion for a node
• The best splitting criterion is found as follows:
• s 氾 IヴキデWヴキラミが s' 氾 HWゲデ IヴキデWヴキラミが t 氾 ミラSWが も 氾 ェララSミWゲゲが m = number of classification classes
• も(s / tぶ 氾 ェララSミWゲゲ ラa デエW ゲヮノキデ ;デ ミラSW t using criterion s • も 嫌嫗【 建 氾 HWゲデ ┗;ノ┌W ラa デエW ゲヮノキデ ;デ ミラSW デ
2
も 嫌嫗【 建 噺 警欠捲件 も岫嫌件【 建 岻
CART (cont)
• tL 氾 ノWaデ ゲ┌HデヴWW ラa デエW I┌ヴヴWミデ ミラSW t • tR 氾 ヴキェエデ ゲ┌HデヴWW ラa デエW I┌ヴヴWミデ ミラSW t
3
t
X < v
tL tR
X д┗
CART (cont)
• PL 氾 the probability that a tuple in the training set will be in the left subtree
= 】担探丹狸奪坦 辿樽 デL】】担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担】
• PR 氾 the probability that a tuple in the training set will be in the right subtree
= 】担探丹狸奪坦 辿樽 デR】】担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担】
4
CART (cont)
• P(Cj|tLぶ 氾 ヮヴラH;Hキノキデ┞ デエ;デ ; デヴ;キミキミェ デ┌ヮノW キゲ キミ Iノ;ゲゲキaキI;デキラミ class Cj and is at the same time in the left subtree
= 樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担
• P(Cj|tRぶ 氾 ヮヴラH;Hキノキデ┞ デエ;デ ; デヴ;キミキミェ デ┌ヮノW キゲ キミ Iノ;ゲゲキaキI;デキラミ class Cj and is at the same time in the right subtree
= 樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 嘆辿巽竪担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担
5
Example
• Consider the training data given in following table. Find the best splitting criterion at the root considering the
classifications in Output1.
6
Name Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium Source: Dunham, Data Mining に Introductory and Advanced Topics, Table 4.1, page 78
Example (cont)
• Attribute name is not useful • Gender has two possible values: F, M • Height has many values. For the sake of example, we
will consider only the following split values: 1.6, 1.7,
1.8, 1.9, and 2.0
• Calculate も(s / t) 7
Example (cont)
• も(Gender) = ? • |tL|=|Male| = 6 • |tR|=|Female| = 9 • PL =
樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担 = 6/15 • PR = 9/15 • Classes = {Short, Medium, Tall} • も(Gender) = 2(6/15)(9/15)デ珍樺岶聴朕墜追痛┸暢勅鳥通沈通陳┸脹銚鎮鎮岼 】鶏 系倹 建詣 伐 鶏 系倹 建迎 】
8
Gender
Male
tL tR
Female
Example (cont)
• Classes = {Short, Medium, Tall} • も(Gender) = 2(6/15)(9/15)デ珍樺岶聴朕墜追痛┸暢勅鳥通沈通陳┸脹銚鎮鎮岼 】鶏 系倹 建詣 伐 鶏 系倹 建迎 】 • P(Cj|tL) =
樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担 • |P(Short|Male) - P(Short|Female) | = |1/15 に 3/15|= 2/15 • |P(Medium|Male) - P(Medium|Female) | = |2/15 に 6/15|= 4/15 • |P(Tall|Male) - P(Tall|Female) | = |3/15 に 0/15|= 3/15 • も(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.288
9
Gender
Male
tL tR
Female
Example (cont)
• For Height • も(1.6) = ? • |tL|=|Height < 1.6| = 0 • PL = 0 • も(1.6) = 0
10
Height < 1.6
yes
tL tR
no
Example (cont)
• も(1.7) = ? • |tL|=|Height < 1.7| = |{Kristina, Kathy}| • PL =
】HWキェエデ а ヱくΑづ樽探鱈但奪嘆 誰脱 担探丹狸奪坦 叩担 追墜墜痛 = 2/15 • PL =
】HWキェエデ 兆 ヱくΑづ樽探鱈但奪嘆 誰脱 担探丹狸奪坦 叩担 追墜墜痛 = 13/15 • |P(Short|Height < 1.7) - P(Short|HWキェエデ 半 ヱくΑ) |
= |2/15 に 2/15|= 0 • |P(Medium|Height < 1.7) - P(Medium|HWキェエデ 半 ヱくΑ) | = |0 に 8/15|=
8/15
• |P(Tall|Height < 1.7) - P(Tall|HWキェエデ 半 ヱくΑ) | = |0 に 3/15|= 3/15 • も(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
11
Height < 1.7
yes
tL tR
no
Example (cont)
• At the start, there are six choices for split point (right branch on equality):
– も(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.288 – も(1.6) = 0 – も(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 – も(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 – も(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 – も(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8
12
Classification in Large Databases
• Classificationねa classical problem extensively studied by statisticians and machine learning researchers
• Scalability: Classifying large data sets with reasonable speed • Why decision tree induction in data mining?
– relatively fast learning speed – convertible to simple and easy to understand classification
rules
– comparable classification accuracy with other methods
13
Scalable DT Induction Methods
• Early approaches include • SLIQ • SPRINT • RainForest
14
SLIQ
• SLIQ builds an index for each attribute • It maintains disk-resident attribute lists sorted by
attribute values
• It maintains a single memory-resident class list
15
SLIQ (cont)
• A tuple is represented by a linkage of one entry from each attribute list to an entry in the class list
• A デ┌ヮノWゲげ Wミデヴ┞ キミ デエW Iノ;ゲゲ ノキゲデ Iラミデ;キミゲ – its class label, and – ; ノキミニ デラ ; ノW;a ミラSW キミ デエW DTが デエ;デ Iラミデ;キミゲ デエW デ┌ヮノWげゲ
classification
16
SLIQ に Disadvantage • Size of class list is proportional to number of tuples in
the training set.
• So class list can grow too big to fit in main memory
17
SPRINT
• “PRINT 氾 Scalable PaRallelizable INduction of decision Trees
• Based on CART – builds a binary DT
• Uses a data structure that holds for each attribute the class and RID
• When a node is split, the attribute lists are partitioned and distributed among children nodes
18
SPRINT (cont.)
• SPRINT improves on SLIQ by removing all memory restrictions
• SPRINT uses an additional data structures, a hash tree, with size proportional to the training dataset
– performance may deteriorate • Can be easily parallelized • Uses a gini index to find the best split
19
Gini Index
20
• If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency (proportion) of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index giniA(D) is defined as
• Reduction in Impurity:
• The attribute provides the largest reduction in impurity is chosen to split the node (need to consider all the possible splitting points for each attribute)
=
−= n
j p jDgini
1
21)(
)( || ||)(
|| ||)( 2
2 1
1 Dgini D D
Dgini D DDginiA +=
)()()( DginiDginiAgini A
−=
Gini Index に Example
21
• Suppose the attribute income partitions D into 10 tuples in D1: {low, medium} and 4 tuples in D2: {high}
and gini{medium,high} is 0.45.
Therefore, 訣件券件沈津頂墜陳勅樺岶鎮墜栂┸陳勅鳥沈通陳岼岫経岻 is better since it is the lowest
訣件券件岫経岻 噺 な 伐 ひなね 態 伐 のなね 態 噺 ど┻ねのひ 訣件券件沈津頂墜陳勅樺岶鎮墜栂┸陳勅鳥沈通陳岼岫経岻 噺 などなね 罫件券件岫経怠岻 髪 ねなね 罫件券件岫経態岻
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
Ex. D has 9 tuples in
buys_computer = 曨yes杇
and 5 in 曨no杇
= 怠待怠替(な 伐 胎怠待 態 伐 戴怠待 態岻 髪 替怠替 ふな 伐 態替 態 伐 態替 態岻
= ど┻ねねぬ = 訣件券件沈津頂墜陳勅樺岶朕沈直朕岼岫経岻
income buys_computer
yes no
low 3 1
medium 4 2
high 2 2
How to Specify Split Condition?
• Depends on attribute types – Continuous – Nominal – Ordinal
22
Binary Split of Continuous-Value Attributes
• Let attribute A be a continuous-valued attribute • Must determine the best split point for A
– Sort the value A in increasing order: a1, a2が ぐくが ai, ai+1が ぐ – Choose a threshold value between each pair of adjacent values
ai and ai+1 to use as a possible split point
– Typically, the midpoint is used • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point yielding maximum impurity reduction に つ訣件券件 畦 伐 is selected as the split-point for A
• Split: – Dヱ キゲ デエW ゲWデ ラa デ┌ヮノWゲ キミ D ゲ;デキゲa┞キミェ A г ゲヮノキデ-point, and D2 is
the set of tuples in D satisfying A > split-point
23
Binary Split of Nominal Attributes
• CarType: Family, Sports, Luxury • Binary split: Divides values into two subsets.
• What about this split?
• Need to find optimal partitioning • Need to consider all possibilities
– [{Family}, {Sport, Luxury}] vs [{Sports}, {Family, Luxury}] vs [{Luxury}, {Family, Sports}]
24
CarType {Sports,
Luxury} {Family}
CarType {Family,
Luxury} {Sports}
• ShirtSize: Small, Medium, Large, Extra Large
• Values can be grouped as long as the grouping does not violate the ;デデヴキH┌デWげゲ ラヴSWヴ property
25
• The groupings shown in Figures (a) and (b) preserve the order among the attribute values
• The grouping shown in Figure (c) violates the ordering property • One may handle an ordinal attribute as nominal if that is more
appropriate to the DM task
Binary Split of Ordinal Attributes
RainForest
• Uses aggregate metadata • At each node, it builds an AVC-set for each attribute
– AVC (Attribute, Value, Class_label) • AVC-set of attribute A at node N stores the class label
counts for each value of A
• The set of all AVC-sets at a given node is called the AVC- group of N
• Split attributes are found using AVC-groups • When splitting, the training samples are split and AVC-
groups are constructed for children nodes
• RainForest uses special techniques to handle the case when AVC-groups do not fit in memory
26
Rainforest: Training Set and Its AVC Sets
27
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
Credit rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
AVC-set on incomeAVC-set on Age
AVC-set on Student
Training Examples income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on credit_rating
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
Random Forests
• An ensemble method designed for decision trees • An ensemble for classification is a technique that combines
multiple classifiers to produce a better classifier
• If one tree is good, then many trees (a forest) should be better, provided that there is enough variety among them
• How can we make the DTs different? – train them on slightly different data に use bootstrap samples from the
dataset for each tree
– at each node, use a random subset of the features to find split attribute
• Once the set of trees are trained, the output of the forest is the majority vote for classification
28
Bootstrapping
• A bootstrap sample is a random sample of observations with replacement with the same
number of observations as the original dataset
– bootstrap sample is the same size as the original dataset – we may get some data several times and others not at all
29
Random Forests Algorithm
Training:
The original dataset D contains d samples
• for i Э ヱ ぐ N //N is number of DTs – Draw a bootstrap sample Di of D – Build a random-forest tree using Di
• For a node j, sample a random subset k (= sqrt(|Ai|)) of the attributes Ai remaining at the node
• Select the best attribute from k attributes to split the tree
• endfor Testing: majority voting is used to classify new samples
30
Scalable DT Induction Methods
• New approaches are based on using Hadoop and Spark に CSC 735 – Data is partitioned and disturbed – Parallel processing using many machines – Results are combined – Active research on how to do this for different ML
methods
31
Decision Trees in Scikit-Learn
• Scikit-learn is a ML library in IPython
32
The Iris Dataset
• Aka iris flower dataset • Introduced by Fisher in 1936 • The dataset consists of 50 samples from each of three species
of Iris (Iris setosa, Iris virginica and Iris versicolor)
33
• Four features were measured from each sample: the length and the
width of the sepals and petals, in
centimeters.
Loading the Iris Dataset
34
scikit-learn contains
many datasets in the
submodule
sklearn.datasets
Examining the Dataset
35
iris.data returns
a numpi.ndarray
with the tuples
in the dataset
Examining the Tuples
36
• Property shape gives the number of elements in
each dimension
• Property feature_names gives
the names of the columns
Examining the Labels
37
iris.target # a 1D array with 150 values
iris.target.shape
iris.target_names
Examining the Labels
38
Dividing the Dataset for Training and Testing
• By convention, X is used to name the features dataset and y is used to name the labels dataset
X = iris.data y = iris.target
• Import a function to split the dataset: from sklearn.model_selection import train_test_split
• Apply the function to both the observation and target data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=1)
39
Dividing the Dataset for Training and Testing
40
Creating a Decision Tree Model
• Iマヮラヴデ デエW マラSWノゲげ API aヴラマ Scikit-Learn from sklearn.tree import DecisionTreeClassifier
• Instantiate an instance of the model dt = DecisionTreeClassifier()
• You can specify parameters dt = DecisionTreeClassifier(criterion='entropy',
random_state=0)
• Get more information about a model help(DecisionTreeClassifier)
41
Supported values for
criterion are "gini"
for the gini index
(the default) and
"entropy" for the
information gain
42
Train/Fit the Model
• Fit the model to the training data by calling the fit() method of the model instance
trained_model = dt.fit(X_train, y_train)
43
Test the Model
• You can test the trained model by measuring its accuracy on the test set
y_pred = trained_model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) #0.97
new_sample = [[ 5, 4, 3, 2]] # Create new sample
trained_model.predict(new_sample) # Classify it #array([1])
44
Testing and Using the Model
45
Steps for Using a Scikit-Learn Classifier
1. Choose a class of model by importing the
appropriate class from Scikit-Learn
2. Specify model parameters by instantiating the class
with desired values
3. Arrange data into a features matrix and target
4. Split the data for training and testing
5. Fit the model to the training data
6. Test the model
7. Apply the model to new data
46
Using Random Forests in Scikit-Learn
47
First, load and prepare the data
Using Random Forests in Scikit-Learn (cont)
48
Second, create a model and train it
Using Random Forests in Scikit-Learn (cont)
49
Third, test the model
Using the model
Bayesian Classification
1
Bayesian Classification
• A statistical classifier: predicts class membership probabilities
• Fラ┌ミS;デキラミぎ B;ゲWS ラミ B;┞Wゲげ TエWラヴWマ • Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
2
B;┞Wゲげ TエWラヴWマぎ B;ゲキIゲ
• Let X be a data sample (曨evidence杇), its class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(C|X), the probability that the
hypothesis holds given the observed data sample X
• P(H|X) or P(C|X) is a posterior probability • P(C) is a prior probability
– the initial probability of C – e.g., X ┘キノノ H┌┞ Iラマヮ┌デWヴが ヴWェ;ヴSノWゲゲ ラa ;ェWが キミIラマWが ぐ
• P(X): probability that sample data is observed • P(X|C) (likelihood), the probability of observing the sample X,
given that the hypothesis H holds
3
B;┞Wゲげ TエWラヴWマ
• Given training data X, posteriori probability of a hypothesis H, P(C|X), follows the Bayes theorem
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes
4
鶏岫系】隙岻 噺 鶏岫隙】系岻鶏岫系岻鶏岫隙岻
Towards Naïve Bayesian Classifier • Let D be a training dataset • Each tuple is represented as vector X = (x1, x2が ぐが xn) • Suppose there are m classes C1, C2が ぐが Cm. • Classification is to find the maximum P(Ci|X) • This can be derived from Bayes暍 theorem (1<= i <= m)
• Since P(X) is constant for all classes, only
needs to be maximized
5
鶏岫系沈】薫岻 噺 鶏岫薫】系沈岻鶏岫系沈岻鶏岫薫岻 鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻
DWヴキ┗;デキラミ ラa N;シ┗W B;┞Wゲげ Cノ;ゲゲキaキWヴ
• A simplified assumption: attributes are conditionally independent
• This greatly reduces the computation cost • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean ´ and standard deviation ゝ
and P(xk|Ci) is
6
)|(...)|()|( 1
)|()|( 21
CixPCixPCixP n
k CixPCiP nk =
= =X
2
2
2
)(
2
1 ),,(
− −
= x
exg
P(Xk | iC ) = g(xk,mCi ,s Ci )
How to Estimate Probabilities from Data?
7
• Normal distribution:
• For (Income, Class=No): – If Class=No
• sample mean = 110K • sample variance = 2975K
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
c c c c
鶏 荊券潔剣兼結 噺 なにど 軽剣 噺 なに講 のね┻のね 結貸岫怠態待貸怠怠待岻鉄態 態苔胎泰 噺 ど┻どどばに
鶏岫畦沈】潔珍岻 噺 なに講購珍態 結貸 岫凋日貸禎乳岻鉄態蹄乳鉄
P(Xk | iC ) = g(xk,mCi ,s Ci )
N;シ┗W B;┞Wゲげ に Example
8
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data sample
X = (age <=30,
income = medium,
student = yes
credit_rating = Fair)
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻
)|(...)|()|( 1
)|()|( 21
CixPCixPCixP n
k CixPCiP nk =
= =X
AVC Sets for the Training Dataset
9
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
Credit rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
AVC-set on incomeAVC-set on Age
AVC-set on Student
Training Examples income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on credit_rating
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
N;シ┗W B;┞Wゲげ に Example
10
• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357
• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻
)|(...)|()|( 1
)|()|( 21
CixPCixPCixP n
k CixPCiP nk =
= =X
N;シ┗W B;┞Wゲげ に Example
11
• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357
• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4 P(student = 曨yes杇 | buys_computer = 曨yes) = 6/9 = 0.667 P(student = 曨yes杇 | buys_computer = 曨no杇) = 1/5 = 0.2 P(credit_rating = 曨fair杇 | buys_computer = 曨yes杇) = 6/9 = 0.667 P(credit_rating = 曨fair杇 | buys_computer = 曨no杇) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
student Buy_Computer
yes no
yes 6 1
no 3 4
Credit rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻
)|(...)|()|( 1
)|()|( 21
CixPCixPCixP n
k CixPCiP nk =
= =X
N;シ┗W B;┞Wゲげ に Example
12
• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357
• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4 P(student = 曨yes杇 | buys_computer = 曨yes) = 6/9 = 0.667 P(student = 曨yes杇 | buys_computer = 曨no杇) = 1/5 = 0.2 P(credit_rating = 曨fair杇 | buys_computer = 曨yes杇) = 6/9 = 0.667 P(credit_rating = 曨fair杇 | buys_computer = 曨no杇) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = 曨yes杇) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = 曨no杇) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = 曨yes杇) * P(buys_computer = 曨yes杇) = 0.028 P(X|buys_computer = 曨no杇) * P(buys_computer = 曨no杇) = 0.007
Therefore, X belongs to class (曨buys_computer = yes杇)
鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻
)|(...)|()|( 1
)|()|( 21
CixPCixPCixP n
k CixPCiP nk =
= =X
Exercise
• Consider the training dataset given in the table below, where A1, A2 and A3 are the input attributes
;ミS C キゲ デエW Iノ;ゲゲ ノ;HWノく UゲW N;シ┗W B;┞Wゲげ デラ aキミS デエW class for the object X = {1, 2, 2}.
13
Sample Attribute
A1
Attribute
A2
Attribute
A3
Class
C
1 1 2 1 1
2 0 0 1 1
3 0 1 2 1
4 1 0 1 1
5 2 1 2 2
6 1 2 1 2
7 2 2 2 2
Avoiding the 0-Probability Problem
• Naïve Bayesian requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero
• Ex. Suppose that a dataset with 1000 tuples. We have income=low (0), income= medium (990), and income = high (10),
• Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The 曨corrected杇 prob. estimates are close to their 曨uncorrected杇 counterparts
14
=
= n
k CixkPCiXP
1 )|()|(
15
N;シ┗W B;┞Wゲげ に Advantages and Disadvantages
• Advantages: – Easy to implement – Very efficient – Good results obtained in many applications – Can easily handle missing data by omitting that
probability in the calculations
• Disadvantages – Assumption: class conditional independence, therefore
loss of accuracy when the assumption is seriously
violated
Naïve Bayes in Scikit-Learn
• There are 3 classes that implement Naïve Bayes in the submodule sklearn.naive_bayes
– GaussianNB – MultinomialNB – BernoulliNB
• Each assumes a different likelihood distribution of デエW ;デデヴキH┌デWゲげ ┗;ノ┌Wゲ
• WエキIエ ラミW デラ ┌ゲW SWヮWミSゲ ラミ aW;デ┌ヴWゲげ デ┞ヮWゲぎ continuous, categorical, binary
16
Preparing the Dataset
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset: from sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=1)
17
Create a GaussianNB Model and Train It
• from sklearn.naive_bayes import GaussianNB
# Create Gaussian Naive Bayes object
classifer = GaussianNB()
# Train model
model = classifer.fit(X_train, y_train)
18
Test the Model
# Test the model
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
#0.97
new_sample = [[ 5, 4, 3, 2]] # Create new sample
model.predict(new_sample) # Classify it
#array([1])
19
Using NB to Work with Categorical Data
• MultinomialNB is commonly used when working with categorical attributes
– Ex: movie ratings ranging from 1 to 5 – Ex: working with text data (e.g., documents of text)
• Common approach: create a bag of words with a vector for each document containing counts of the
appearance of the words in the vocabulary
20
Prepare the Dataset
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.arrayふぷろI ノラ┗W Bヴ;┣キノく Bヴ;┣キノぁろが ろBヴ;┣キノ キゲ HWゲデげが ろGWヴマ;ミ┞ HW;デゲ Hラデエげへぶ
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
21
CountVectorizor and BOW
# Show feature matrix
bag_of_words
O┌デヮ┌デぎ аン┝Α ゲヮ;ヴゲW マ;デヴキ┝ ラa デ┞ヮW ろаIノ;ゲゲ ろミ┌マヮ┞くキミデヶヴろбげ with 8 stored elements in Compressed Sparse Row format>
bag_of_words.toarray()
array([[0, 0, 0, 2, 0, 0, 1,],
[0, 1, 0, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0, 0]], dtype=int64)
# Show feature names
count.get_feature_names()
['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love']
22
CountVectorizor and BOW (cont.)
• The feature matrix with the words as column names and each row is one observation
• The tuples: (['I love Brazil. Brazil! ', 'Brazil is best', ろGWヴマ;ミ┞ HW;デゲ Hラデエげへぶ
• count.get_feature_names() ['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love']
23
beats best both brazil germany is love
0 0 0 2 0 0 1
0 1 0 1 0 1 0
1 0 1 0 1 0 0
Preparing the Dataset and Creating the Model
# Create text text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])
# Create bag of words count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
24
# Create feature matrix
features = bag_of_words.toarray()
# Create target vector
target = np.array([0,0,1])
# Create multinomial naive Bayes object with prior probabilities of each class
classifier = MultinomialNB(class_prior=[0.25, 0.5])
# Train model
model = classifier.fit(features, target)
Using the Model
# Create new observation
new_observation = [[0, 0, 0, 1, 0, 1, 0]]
# Predict new observation's class
model.predict(new_observation)
# array([0])
25
BernoulliNB
• Works similar to MultinomialNB but use with binary data
• See example in the notebook
26
Distance-Based Classification Algorithms
27
Distance-Based Classification Algorithms
• Tuples in the same class are more similar to each other
• A similarity or distance measure is used to determine how similar tuples are
• The problem is how to define similarity measures
28
Distance Measures
• xi = (xi1, xi2が ぐが xip) and xj = (xj1, xj2が ぐが xjp) are two p- dimensional data objects
• Minkowski distance:
where q is a positive integer
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
29
穴岫捲件┸ 捲倹岻 噺 忍 岫】捲沈怠 伐 捲珍怠】槌 髪 】捲沈態 伐 捲珍態】槌髪┻ ┻ ┻ 髪】捲沈椎 伐 捲珍椎】槌岻
穴岫捲件┸ 捲倹岻 噺 】捲沈怠 伐 捲珍怠】 髪 】捲沈態 伐 捲珍態】髪┻ ┻ ┻ 髪】捲沈椎 伐 捲珍椎】 穴岫捲件┸ 捲倹岻 噺 布賃退怠椎 岫捲沈賃 伐 捲珍賃岻態
Distance-Based Classification Algorithms
• Pノ;IW キデWマゲ キミ Iノ;ゲゲ デラ ┘エキIエ デエW┞ ;ヴW さIノラゲWゲデざ • Must determine distance between an item and a
class
• Classes represented by – Centroid: central value – Medoid: actual point near the centroid – Individual points
• Example: C = {1, 4, 9} – Centroid is ________ – Medoid is _________
30
Simple Distance Based Classification Algorithm
• Decide on the representative for each class • DWIキSW ラミ デエW ゲキマキノ;ヴキデ┞ さラヴ Sキゲデ;ミIWざ マW;ゲ┌ヴW デラ ┌ゲW • To classify a new sample X, it will be compared to the
representative of each class
– Place X in the same class with the representative that is マラゲデ ゲキマキノ;ヴ さラヴ IノラゲWゲデざ デラ キデ
31
K Nearest Neighbors Algorithm
• Requires three things – Set of stored objects – Distance measure to compute
distance between objects
– The value of k, the number of nearest neighbors to retrieve
• To classify an unknown object: – Compute distance to the training
objects
– Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown object (e.g., by taking majority vote)
32
Unknown record
33
Definition of Nearest Neighbors
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of an object x are data points that have
the k smallest distance to x
K Nearest Neighbors Algorithm
• Compute distance between two points: – Euclidean distance
• Determine the class from nearest neighbors list – take the majority vote of class labels among the k-nearest
neighbors (simple voting)
– Weigh the vote according to distance (weighted voting ) • weight factor, w = 1/d2
34
−= i ii
qpqpd 2)(),(
Choosing the Value of K
• If k is too small, sensitive to noise points • If k is too large, neighborhood may include points
from other classes
35
X
• k is usually chosen empirically by trying a
range of values
Scaling Values
• Attributes may have to be scaled to prevent distance measures from being dominated by one of the
attributes
• Example: – height of a person may vary from 1.5m to 1.8m – weight of a person may vary from 90lb to 300lb – income of a person may vary from $10K to $1M
36
Eager vs Lazy Learners
• KNN is a lazy learner – It does not build a model from the training data – Unlike eager learners such as decision tree induction and
Naïve Bayes
– Classifying unknown objects is relatively expensive
37
KNN Algorithm Input:
D //Training data
k //Number of neighbors
t //Input tuple to classify
Output:
c //Class to which t is assigned
KNN Algorithm: //Algorithm to classify tuple using KNN
N = Ø ;
//Find set of neighbors, N, for t
foreach d in D do
if |N| 隼 k then N = N d;
else if u in N such that distance(t, d) 隼 distance(t, u) replace d by the neighbor in N with the largest distance to t
c = class to which most u in N are classified
Source: Dunham, Data Mining に Introductory and Advanced Topics, Algorithm 4.2 page 92
Remarks
• If K = 1, nearest neighbor algorithm • Choice of K is important for KNN • [Dunham] As a rule of thumb,
ニ г ミ┌マHWヴ ラa デヴ;キミキミェ ゲ;マヮノWゲ ; commercial algorithms often use a default value of 10.
• Researchers have shown that the classification accuracy of KNN can be as accurate as more elaborated methods
• KNN is slow at the classification time • KNN does not produce an understandable model
39
KNN in Scikit-Learn
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
40
KNN in Scikit-Learn (cont)
# Create standardizer
standardizer = StandardScaler()
# Standardize features
X_std = standardizer.fit_transform(X)
# Train a KNN classifier with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)
# Create two observations
new_observations = [[ 0.75, 0.75, 0.75, 0.75], [ 1, 1, 1, 1]]
# Predict the class of two observations
knn.predict(new_observations)
array([1, 2])
41
KNeighborsClassifier Parameters
• Parameters we can set include: – n_neighbors: value of K (default is 5) – metric: 'minkowski' (the default)| 'euclidean'| 'manhattan' – p: integer, (default = 2) Power parameter for the
Minkowski metric
– weights: 'uniform' (the default)| 'distance'
• Notebook name: KNN.ipynb
42