data mining exam

profilejaffar11
merge_from_ofoct9.pdf

Data Preprocessing

CSC 535/635

1

Data Representation

2

Features (aka attributes)

. . . . Samples

.

.

.

Types of Variables/Attributes/Data

• Two most common types are: • Numeric

– real-value variables or integers variables – Ex: age, speed, or length – values of a numeric attribute have two important properties:

• an order relation (2 < 5 and 5 < 7) • a distance relation (d [2.3,4.2] = 1.9)

• Categorical – have neither order nor distance relation – Ex: eye color, gender, or country of citizenship – only support an equality relation (Blue = Blue, or Brown ≠ Black)

3

Types of Variables - Another classification

• Continuous variables – aka quantitative or metric variables – values are measured using either an interval scale or a

ratio scale

• Discrete variables – aka qualitative variables – values are measured using one of two kinds of nonmetric

scales — nominal or ordinal

4

Interval Scaled vs. Ratio Scaled Variables

• The ratio relation does not hold true

• The 0 point is placed arbitrarily, and thus it does not indicate the absence of value

• Ex: temperature F/C – 0 F does not mean

absence of temperature. – 80 F ≠ twice 40 F

5

• The ratio relation holds true

• Ratio scale has an absolute 0 point

• Ex: height, length, salary – 6 feet = twice 3 feet

Interval Scale Ratio Scale

Nominal Scaled vs. Ordinal Scaled Variables

• Order-less scale • May use different

symbols, characters, or numbers to represent the different states/values of the variable

• Ex: color, gender, id, customer_type

6

• Values have an order relation but not a distance relation

• Ex: student_class, military_rank, grade

Nominal Scale Ordinal Scale

Encoding Numerical Data as Ordinal

• An ordinal variable can be used to encode a numeric variable with a smaller set of values

• Ex: age (with values young, middle aged, and old) • Ex: income (with values low, middle-class, upper-

middle-class, and rich) • Ex: height (with values short, medium, tall)

7

Binary Variables

• Has one of two states: 0, 1 • Examples: smoker, owns-house • Can be considered a special case of nominal

variables

8

Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data • e.g., occupation=“ ”

– noisy: containing errors or outliers • e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

9

Why is Data Preprocessing Important?

• No quality data, no quality mining results! • Minimize GIGO (Garbage In Garbage Out) • Data preparation, cleaning, and transformation

comprises the majority of the work in a data mining application

10

Measures for Data Quality

• Accuracy • Completeness: not recorded, unavailable • Consistency: some modified but some not • Timeliness: timely updates • Believability • Interpretability

11

Major Tasks in Data Preprocessing

• Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers

and noisy data, and resolve inconsistencies

• Data integration – Integration of multiple databases, or files

• Data transformation – Normalization and aggregation

• Data reduction – Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization

12

Data Cleaning

• Importance – “Data cleaning is the number one problem in data warehousing”

• Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration

13

Missing Data Name Age Sex Income Class Mike 40 Male 150k Big spender Jenny 20 Female ? Regular …

14

• Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of

entry • Missing data may need to be inferred

How to Handle Missing Data?

• Ignore the tuple • Fill in missing values manually: tedious & infeasible • Fill in it automatically with

– a global constant : e.g., “unknown” • may confuse DM algorithm

– the attribute mean or median – the attribute mean for all samples belonging to the same class – the most probable value: using inference-based tools such as

Bayesian formula or decision tree induction

15

Measuring the Central Tendency

• Mean:

– Weighted arithmetic mean:

– Trimmed mean: chopping extreme values

• Median:

– Middle value if odd number of values, or average of the middle two values otherwise

• Mode

– Value that occurs most frequently in the data

– Unimodal, bimodal, trimodal, multimodal

– Empirical formula:

16

Symmetric vs. Skewed Data

17

frequency curve

positively skewed negatively skewed

• Median, mean and mode of symmetric, positively and negatively skewed data

Also referred to as skewed to the right

Noisy Data

• Noise: random error or variance in a measured variable • Incorrect attribute values may due to

– faulty data collection instruments – data entry problems – data transmission problems – etc

• Other data problems which requires data cleaning – duplicate records, incomplete data, inconsistent data

18

How to Handle Noisy Data?

• Binning method – first sort data and partition into (equal-depth) bins – then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

• Clustering – detect and remove outliers

19

How to Handle Noisy Data? (cont.)

• Regression – smooth by fitting the data into regression functions

• Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with

possible outliers)

20

Binning Methods for Data Smoothing

• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Partition into (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34

• Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34

21

Outlier Removal

• Data points inconsistent with the majority of data • Different outliers

– Valid: CEO’s salary, – Noisy: One’s age = 200, widely deviated points

• Removal methods – Clustering – Curve-fitting

22

Data Integration

• Data integration: – combines data from multiple sources

• Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from

multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts

– for the same real world entity, attribute values from different sources can be different, e.g., different scales, metric vs. British units

• Removing duplicates and redundant data

23

Data Transformation

• Transforming the data into a form appropriate for mining.

• Methods: – Smoothing: remove noise from data – Aggregation: summarization – Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling

– Attribute/feature construction • New attributes constructed from the given ones

24

Data Transformation: Normalization

25

• Given numeric attribute A with values v1, v2, …, vN • min-max normalization: to [new_minA, new_maxA]

• z-score normalization

• decimal scaling normalization Where j is the number of digits in the data value with the largest absolute value

Min-Max Normalization – Example

26

Z-Score Normalization – Example

27

Decimal Scaling Normalization – Example

• Suppose values of A range from -4986 to 7845 • To normalize by decimal scaling, divide by _____ • -4986 normalizes to -0.4986 • 7845 normalizes to 0.7845

28

Where j is the number of digits in the data value with the largest absolute value

Variation of Z-Score Normalization

29

Data Reduction

• Data may be too big to work with • Data reduction – Obtain a reduced representation of the data set that is

smaller in volume but yet produce the same (or almost the same) analytical results

• Data reduction strategies – Dimensionality reduction: e.g., remove unimportant

attributes – Reducing the number of attribute values – Reducing the number of tuples

30

Dimensionality Reduction

• Feature selection (i.e., attribute subset selection) – Select a minimum set of attributes (features) that is

sufficient for the data mining task

• Heuristic methods (due to exponential # of choices) – step-wise forward selection – step-wise backward elimination – combining forward selection and backward elimination

31

Dimensionality Reduction

• Other techniques include • Principle component analysis • Wavelet transforms

32

Reducing the Number of Attribute Values

• Some algorithms such as decision trees compare different attribute’s values

• Approaches – Binning: replace with bin means, medians, boundaries – Discretization: discretize a numeric attribute (income) into

a small number of intervals, then map each interval to a discrete symbol (low, middle-income, high)

33

Reducing The Number of Tuples – Sampling

• Sampling: obtaining a small sample s to represent the whole dataset N

• Simple random sampling may have poor performance if the dataset is skewed

• Choose a representative subset of the data – Stratified sampling: approximate the percentage of each

class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data

34

Stratified Sampling

35

Raw Data Stratified Sample

36

Sampling: with or without Replacement

SRSW OR

(simpl e rand

om

samp le with

out

replac ement

)

SRSWR

Raw Data

Discretization

• Three types of attributes: – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers

• Discretization: – discretize the range of a continuous attribute

• Some techniques: – Binning methods – Clustering-based methods – Entropy-based methods (Decision Trees) – Histogram analysis

37

38

Histogram Analysis

• Graph displays of basic statistical class descriptions – Frequency histograms – A graphical method – Consists of a set of rectangles that reflect the counts or

frequencies of the classes present in the given data

Discretization and Concept Hierarchies

• Discretization – Reduce the number of values for a given continuous attribute by

dividing the range of the attribute into intervals. – Interval labels can then be used to replace actual data values

• Concept hierarchies – Reduce the data by collecting and replacing low level concepts

(numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

– Things can be done at different levels – Concept hierarchies for nominal attributes, where attributes

such as street can be generalized to higher-level concepts, like city, state, or country

39

Concept Hierarchy

• Concept hierarchy for attribute price

40

Concept Hierarchy

41

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

• Concept hierarchy for attribute street

Summary

• Data preparation is a big issue for data mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization

• Many methods have been proposed but still an active area of research

42

Measuring the Dispersion of Data

44

Properties of Normal Distribution Curve

• The normal (distribution) curve – From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation) – From μ–2σ to μ+2σ: contains about 95% of it – From μ–3σ to μ+3σ: contains about 99.7% of it

Measuring the Dispersion of Data (cont.) • Let x1,x2, . . . ,xN be the values of a

numeric attribute, X. • range(X) = max – min • Assume that the values of X are

sorted in increasing order • The kth q-quantile is the value x s.t.

at most k/q of the data values are less than x and at most (q-k)/q of the data values are more than x

• Quantiles are values in X that allow us to break the data distribution into equal-size consecutive sets

• When we break X into 4 equal parts, the quantiles are called quartiles

• When we break X into 100 equal parts, the quantiles are called percentiles

Measuring the Dispersion of Data (cont.)

• Quartiles: – Q1 (25th percentile) – Q2 (50th percentile) =

median – Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, M, Q3, max

• Outlier: usually, a value higher/lower than 1.5 x IQR above/below Q3/Q1

• A way of visualizing a distribution • Shows five-number summary:

min, Q1, M, Q3, max • Boxplot:

• Data is represented with a box

• The ends of the box are at the first and third quartiles -- the height of the box is IQR

• The median is marked by a line within the box

• Whiskers: two lines outside the box extend to Minimum and Maximum

Boxplots

Boxplots (cont.)

• Outlier: usually, a value higher/lower than 1.5 x IQR

• Whiskers extend to extreme lows only if less/higher than 1.5 x IQR Q1/Q3

• Otherwise, terminate at terminate at 1.5 x IQR beyond Q1/Q3

• Remaining points are plotted individually

Boxplots of Symmetric & Skewed Data

49

Symmetric Skewed right (positive)

Skewed left (negative)

50

Scatter plots • Provides a first look at bivariate data to see clusters of points,

outliers, … etc. • Each pair of values is treated as a pair of coordinates and

plotted as points in the plane

Data Preprocessing

CSC 535/635

1

Data Representation

2

Features (aka attributes)

. . . . Samples

.

.

.

Types of Variables/Attributes/Data

• Two most common types are: • Numeric

– real-value variables or integers variables – Ex: age, speed, or length – values of a numeric attribute have two important properties:

• an order relation (2 < 5 and 5 < 7) • a distance relation (d [2.3,4.2] = 1.9)

• Categorical – have neither order nor distance relation – Ex: eye color, gender, or country of citizenship – only support an equality relation (Blue = Blue, or Brown ≠ Black)

3

Types of Variables - Another classification

• Continuous variables – aka quantitative or metric variables – values are measured using either an interval scale or a

ratio scale

• Discrete variables – aka qualitative variables – values are measured using one of two kinds of nonmetric

scales — nominal or ordinal

4

Interval Scaled vs. Ratio Scaled Variables

• The ratio relation does not hold true

• The 0 point is placed arbitrarily, and thus it does not indicate the absence of value

• Ex: temperature F/C – 0 F does not mean

absence of temperature. – 80 F ≠ twice 40 F

5

• The ratio relation holds true

• Ratio scale has an absolute 0 point

• Ex: height, length, salary – 6 feet = twice 3 feet

Interval Scale Ratio Scale

Nominal Scaled vs. Ordinal Scaled Variables

• Order-less scale • May use different

symbols, characters, or numbers to represent the different states/values of the variable

• Ex: color, gender, id, customer_type

6

• Values have an order relation but not a distance relation

• Ex: student_class, military_rank, grade

Nominal Scale Ordinal Scale

Encoding Numerical Data as Ordinal

• An ordinal variable can be used to encode a numeric variable with a smaller set of values

• Ex: age (with values young, middle aged, and old) • Ex: income (with values low, middle-class, upper-

middle-class, and rich) • Ex: height (with values short, medium, tall)

7

Binary Variables

• Has one of two states: 0, 1 • Examples: smoker, owns-house • Can be considered a special case of nominal

variables

8

Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data • e.g., occupation=“ ”

– noisy: containing errors or outliers • e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

9

Why is Data Preprocessing Important?

• No quality data, no quality mining results! • Minimize GIGO (Garbage In Garbage Out) • Data preparation, cleaning, and transformation

comprises the majority of the work in a data mining application

10

Measures for Data Quality

• Accuracy • Completeness: not recorded, unavailable • Consistency: some modified but some not • Timeliness: timely updates • Believability • Interpretability

11

Major Tasks in Data Preprocessing

• Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers

and noisy data, and resolve inconsistencies

• Data integration – Integration of multiple databases, or files

• Data transformation – Normalization and aggregation

• Data reduction – Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization

12

Data Cleaning

• Importance – “Data cleaning is the number one problem in data warehousing”

• Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration

13

Missing Data Name Age Sex Income Class Mike 40 Male 150k Big spender Jenny 20 Female ? Regular …

14

• Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of

entry • Missing data may need to be inferred

How to Handle Missing Data?

• Ignore the tuple • Fill in missing values manually: tedious & infeasible • Fill in it automatically with

– a global constant : e.g., “unknown” • may confuse DM algorithm

– the attribute mean or median – the attribute mean for all samples belonging to the same class – the most probable value: using inference-based tools such as

Bayesian formula or decision tree induction

15

Measuring the Central Tendency

• Mean:

– Weighted arithmetic mean:

– Trimmed mean: chopping extreme values

• Median:

– Middle value if odd number of values, or average of the middle two values otherwise

• Mode

– Value that occurs most frequently in the data

– Unimodal, bimodal, trimodal, multimodal

– Empirical formula:

16

Symmetric vs. Skewed Data

17

frequency curve

positively skewed negatively skewed

• Median, mean and mode of symmetric, positively and negatively skewed data

Also referred to as skewed to the right

Noisy Data

• Noise: random error or variance in a measured variable • Incorrect attribute values may due to

– faulty data collection instruments – data entry problems – data transmission problems – etc

• Other data problems which requires data cleaning – duplicate records, incomplete data, inconsistent data

18

How to Handle Noisy Data?

• Binning method – first sort data and partition into (equal-depth) bins – then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

• Clustering – detect and remove outliers

19

How to Handle Noisy Data? (cont.)

• Regression – smooth by fitting the data into regression functions

• Combined computer and human inspection – detect suspicious values and check by human (e.g., deal with

possible outliers)

20

Binning Methods for Data Smoothing

• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Partition into (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34

• Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34

21

Outlier Removal

• Data points inconsistent with the majority of data • Different outliers

– Valid: CEO’s salary, – Noisy: One’s age = 200, widely deviated points

• Removal methods – Clustering – Curve-fitting

22

Data Integration

• Data integration: – combines data from multiple sources

• Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from

multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts

– for the same real world entity, attribute values from different sources can be different, e.g., different scales, metric vs. British units

• Removing duplicates and redundant data

23

Data Transformation

• Transforming the data into a form appropriate for mining.

• Methods: – Smoothing: remove noise from data – Aggregation: summarization – Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling

– Attribute/feature construction • New attributes constructed from the given ones

24

Data Transformation: Normalization

25

• Given numeric attribute A with values v1, v2, …, vN • min-max normalization: to [new_minA, new_maxA]

• z-score normalization

• decimal scaling normalization Where j is the number of digits in the data value with the largest absolute value

Min-Max Normalization – Example

26

Z-Score Normalization – Example

27

Decimal Scaling Normalization – Example

• Suppose values of A range from -4986 to 7845 • To normalize by decimal scaling, divide by _____ • -4986 normalizes to -0.4986 • 7845 normalizes to 0.7845

28

Where j is the number of digits in the data value with the largest absolute value

Variation of Z-Score Normalization

29

Data Reduction

• Data may be too big to work with • Data reduction – Obtain a reduced representation of the data set that is

smaller in volume but yet produce the same (or almost the same) analytical results

• Data reduction strategies – Dimensionality reduction: e.g., remove unimportant

attributes – Reducing the number of attribute values – Reducing the number of tuples

30

Dimensionality Reduction

• Feature selection (i.e., attribute subset selection) – Select a minimum set of attributes (features) that is

sufficient for the data mining task

• Heuristic methods (due to exponential # of choices) – step-wise forward selection – step-wise backward elimination – combining forward selection and backward elimination

31

Dimensionality Reduction

• Other techniques include • Principle component analysis • Wavelet transforms

32

Reducing the Number of Attribute Values

• Some algorithms such as decision trees compare different attribute’s values

• Approaches – Binning: replace with bin means, medians, boundaries – Discretization: discretize a numeric attribute (income) into

a small number of intervals, then map each interval to a discrete symbol (low, middle-income, high)

33

Reducing The Number of Tuples – Sampling

• Sampling: obtaining a small sample s to represent the whole dataset N

• Simple random sampling may have poor performance if the dataset is skewed

• Choose a representative subset of the data – Stratified sampling: approximate the percentage of each

class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data

34

Stratified Sampling

35

Raw Data Stratified Sample

36

Sampling: with or without Replacement

SRSW OR

(simpl e rand

om

samp le with

out

replac ement

)

SRSWR

Raw Data

Discretization

• Three types of attributes: – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers

• Discretization: – discretize the range of a continuous attribute

• Some techniques: – Binning methods – Clustering-based methods – Entropy-based methods (Decision Trees) – Histogram analysis

37

38

Histogram Analysis

• Graph displays of basic statistical class descriptions – Frequency histograms – A graphical method – Consists of a set of rectangles that reflect the counts or

frequencies of the classes present in the given data

Discretization and Concept Hierarchies

• Discretization – Reduce the number of values for a given continuous attribute by

dividing the range of the attribute into intervals. – Interval labels can then be used to replace actual data values

• Concept hierarchies – Reduce the data by collecting and replacing low level concepts

(numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

– Things can be done at different levels – Concept hierarchies for nominal attributes, where attributes

such as street can be generalized to higher-level concepts, like city, state, or country

39

Concept Hierarchy

• Concept hierarchy for attribute price

40

Concept Hierarchy

41

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

• Concept hierarchy for attribute street

Summary

• Data preparation is a big issue for data mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization

• Many methods have been proposed but still an active area of research

42

Measuring the Dispersion of Data

44

Properties of Normal Distribution Curve

• The normal (distribution) curve – From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation) – From μ–2σ to μ+2σ: contains about 95% of it – From μ–3σ to μ+3σ: contains about 99.7% of it

Measuring the Dispersion of Data (cont.) • Let x1,x2, . . . ,xN be the values of a

numeric attribute, X. • range(X) = max – min • Assume that the values of X are

sorted in increasing order • The kth q-quantile is the value x s.t.

at most k/q of the data values are less than x and at most (q-k)/q of the data values are more than x

• Quantiles are values in X that allow us to break the data distribution into equal-size consecutive sets

• When we break X into 4 equal parts, the quantiles are called quartiles

• When we break X into 100 equal parts, the quantiles are called percentiles

Measuring the Dispersion of Data (cont.)

• Quartiles: – Q1 (25th percentile) – Q2 (50th percentile) =

median – Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, M, Q3, max

• Outlier: usually, a value higher/lower than 1.5 x IQR above/below Q3/Q1

• A way of visualizing a distribution • Shows five-number summary:

min, Q1, M, Q3, max • Boxplot:

• Data is represented with a box

• The ends of the box are at the first and third quartiles -- the height of the box is IQR

• The median is marked by a line within the box

• Whiskers: two lines outside the box extend to Minimum and Maximum

Boxplots

Boxplots (cont.)

• Outlier: usually, a value higher/lower than 1.5 x IQR

• Whiskers extend to extreme lows only if less/higher than 1.5 x IQR Q1/Q3

• Otherwise, terminate at terminate at 1.5 x IQR beyond Q1/Q3

• Remaining points are plotted individually

Boxplots of Symmetric & Skewed Data

49

Symmetric Skewed right (positive)

Skewed left (negative)

50

Scatter plots • Provides a first look at bivariate data to see clusters of points,

outliers, … etc. • Each pair of values is treated as a pair of coordinates and

plotted as points in the plane

Introduction to IPython

Reference: • Learning IPython for Interactive

Computing and Data Visualization, Second Edition, by Cyrille Rossant

• Python for Data Analysis, by Wes McKinney

• Python Data Science Handbook: Essential Tools for Working with Data, by Jake VanderPlas, 2016

What is it?

• IPython – Interactive Python • Provides a convenient command-line

interface to the scientific Python platform

2

Installing IPython

• Anaconda is a pre-packaged Python distribution that comes with many libraries for ML and data science

• Download from https://www.anaconda.com/download/

• We use Python 3

3

Starting Python • On Windows, open Anaconda from the

Start Menu and select Anaconda Prompt • Type python • Notice how prompt changes to >>> • Type exit() to exit Python

4

Starting the IPython Console • At the command prompt, type ipython • Notice the prompt • Type exit to exit ipython

5

Jupyter Notebook

• It comes with Anaconda ipython • To start, at the command prompt type

jupyter notebook • This starts the Jupyter server and opens a

new window in the browser at address localhost:8888

6

7

Closing the Notebook Server

• To close the notebook server, at the terminal window press Ctrl + C

8

Starting a new notebook

• In the Files tab, click New, and click Notebooks Python 3

9

Starting a new notebook (cont)

• This opens a new browser tab with the new notebook as follows

10

Types of Cells • A notebook consists of a linear list of cells • There are two main types of cells: • Markdown cells: contain rich text that you

can format and can contain images, HTML code, Latex math equations, and more – Ex: next slide

• Code cells: contain code to be executed by ipython – Ex: slide after next

11

Markdown Cell

12

Code Cell

13

Magic commands • Magic commands allow us to interact with

the file system • Magic commands start with %

14

In [1]:%pwd Out[1]: 'c:\\TEACHING\\Summer\\17\\minibook-2nd\\chapter1‘ In [2]: %cd chapter1\facebook Out[2]: c:\TEACHING\Summer\17\minibook-2nd\chapter1\facebook In [3]: %ls

Running Python Scripts from IPython

• Suppose we have a program saved in the file factorial.py that is located in the current directory. You can execute the file using any of these commands

15

In [3]: !ipython factorial.py 120 3628800

In [4]: !python factorial.py 120 3628800 In [5]:%run factorial.py 120 3628800

NumPy • NumPy stands for Numerical Python • NumPy provides a multidimensional array

data structure (ndarray) • NumPy arrays are more efficient than

Python’s lists • Many operations are performed on an

element-wise basis – called vector (or vectorized) operations

16

Creating Arrays

17

18

19

20

np.random.randint(low, high = None, size = None) generates random integers from [low, high) If high is None (the default), then results are from [0, low)

NumPy Standard Data Types Data type Description

bool_ Boolean (True or False) stored as a byte

int_ Default integer type (same as C long; normally either int64 or int32)

intc Identical to C int (normally int32 or int64)

intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)

int8 Byte (-128 to 127)

int16 Integer (-32768 to 32767)

int32 Integer (-2147483648 to 2147483647)

int64 Integer (-9223372036854775808 to 9223372036854775807)

uint8 Unsigned integer (0 to 255)

uint16 Unsigned integer (0 to 65535)

uint32 Unsigned integer (0 to 4294967295)

uint64 Unsigned integer (0 to 18446744073709551615)

float_ Shorthand for float64.

float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa

float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa

float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa

complex_ Shorthand for complex128.

complex64 Complex number, represented by two 32-bit floats

complex128 Complex number, represented by two 64-bit floats

21

NumPy Array Attributes • ndim: number of dimensions • shape: the number of elements in each dimension • size: the total number of elements in the array

22

23

Array Indexing

24

25

Array Slicing: Accessing Subarrays

26

Multidimensional Arrays

27

Subarrays as no-copy views: Unlike in Standard Python, array slices return views rather than copies of the array data

28

Creating copies of arrays: use the copy() method

29

Reshaping of Arrays

Does reshape create a view or a copy of the original array?

30

31

• We can use np.newaxis to increase the dimension of an array by one dimension

32

Vector Operations on Arrays

Using User-Defined Functions with ndarray Objects

33

Boolean Operations on Arrays

• Comparison and logical operations in ndarray objects work on an element-wise basis

• Evaluating conditions yield by default a Boolean ndarray object

• Use the bitwise logical operators (&, |, ^, ~) to perform element-wise logical operations

34

35

Boolean Arrays as Masks

• Boolean arrays can be used as masks, to select particular subsets of the data

36

37

38

Counting Entries

39

40

np.any() and np.all()

np.where()

• The np.where() function, allows the definition of actions depending on whether a condition is True or False

• The result of applying np.where() is a new ndarray object of the same shape as the original one

41

`

42

Aggregate Functions

• They are useful to compute summary statistics for large data

• NumPy provides many functions for this including: sum, max, min, mean, std, var, median

• These functions are faster than their Python counterparts

43

NumPy functions are faster than their Python counterparts

44

Applying Aggregate Functions to Rows/Columns of 2D Array

45

More on Axis/Axes • In multidimensional arrays, there is an axis per dimension (ex: a[3,1, 2]) • First axis runs across the rows, second axis runs across the columns • When an aggregate function is be applied along an axis, think about

the axis as the dimension of the array that will be collapsed, rather than the dimension that will be returned

– In case of axis=0, 1st dimension is collapsed; i.e., NumPy performs the action on elements of each column

– If axis=1, NumPy performs the action on rows

• Another way to think about axis is as the direction along which the operation is performed

46

47

Broadcasting

• For arrays of the same shape, binary operations are performed on an element- by-element basis

• Broadcasting allows binary operations to be performed on arrays of different sizes

48

In [1]: a = np.array([1, 2, 3])

b = np.array([4, 5, 6])

a + b

Out[1]: array([ 5, 7 , 9])

In [2]: a + 2

Out[2]: array([ 3, 4 , 5]) a + 2 z [1, 2, 3] + [2, 2, 2]

Broadcasting (cont.)

49

a [[1, 2, 3],

[1, 2, 3],

[1, 2, 3]]

Rules of Broadcasting • A set of rules that control what happens when a

binary operation is applied to two arrays of different shapes

50

Rules of Broadcasting - Rule 1 • Rule 1: If the two arrays differ in their number of

dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (left) side

• Rule 1 applies:

51

a = np.arange(12).reshape(3, 4)

b = np.array([1 , 2 , 3, 4])

a + b

a.shape = (3, 4)

b.shape = (4,)

b.shape -> (1, 4)

Rules of Broadcasting – Rule 2 • Rule 2: If the shape of the two arrays does not

match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape

• Rule 2 applies to the array b

52

a.shape = (3, 4)

b.shape = (1, 4)

a = np.arange(12).reshape(3, 4)

b = np.array([1 , 2 , 3, 4]).reshape(1,4)

a + b

b.shape -> (3, 4) b [[1, 2, 3, 4],

[1, 2, 3, 4],

[1, 2, 3, 4]]

b [[1, 2, 3, 4]]

Rules of Broadcasting (cont.) • Rule 3: If in any dimension the sizes disagree

and neither is equal to 1, an error is raised

• By rule 3, the arrays a and b are not compatible

53

a.shape = (3, 4)

b.shape = (2, 2)

a = np.arange(12).reshape(3, 4)

b = np.array([1 , 2 , 3, 4]).reshape(2,2)

a + b

Broadcasting Example 1 • Consider adding the following two arrays

• The shapes of the arrays are

• By rule 1, array a has fewer dimensions, so we pad its shape on the left by ones:

• By rule 2, the shape of the two arrays do not match on the first dimension, so we stretch this dimension in a to match:

54

In [3]: M = np.ones((2, 3))

a = np.arange(3)

M.shape = (2, 3)

a.shape = (3,)

M.shape -> (2, 3)

a.shape -> (1, 3)

M is [[1, 1, 1],

[1, 1, 1]]

a is [0, 1, 2]

M is [[1, 1, 1],

[1, 1, 1]]

a is [[0, 1, 2]]

M is [[1, 1, 1],

[1, 1, 1]]

a is [[0, 1, 2],

[0, 1, 2]]

M.shape -> (2, 3)

a.shape -> (2, 3)

M + a is

[[1, 2, 3],

[1, 2, 3]]

Broadcasting Example 2 • Consider adding the following two arrays

• The shapes of the arrays are

• By rule 1, we pad the shape of a on the left by 1s:

• By rule 2, we upgrade each of these ones to match the corresponding size of the other array:

55

a = np.arange(3)

b = np.arange(3).reshape(3, 1)

a.shape = (3,)

b.shape = (3,1)

a.shape -> (1, 3)

b.shape -> (3, 1)

a is [0, 1, 2]

b is [[0],

[1],

[2]]

b [[0],

[1],

[2]]

a is [[0, 1, 2],

[0, 1, 2],

[0, 1, 2]]

a.shape -> (3, 3)

b.shape -> (3, 3)

Now the two arrays have

the same shape and can

be added

b is [[0, 0, 0],

[1, 1, 1],

[2, 2, 2]]

a + b is [[0, 1, 2],

[1, 2, 3],

[2, 3, 4]]

a [[0, 1, 2]]

Broadcasting Example 3 • Consider adding the following two arrays

• The shapes of the arrays are

• By rule 1, we pad array a by 1s on the left:

• By rule 2, the first dimension of a is stretched to match that of M:

• Now rule 3 applies, the two arrays do not match on the second dimension and neither is equal to 1, so the two arrays are not compatible 56

In [3]: M = np.ones((3, 2))

a = np.arange(3)

M.shape = (3, 2)

a.shape = (3,)

M.shape -> (3, 2)

a.shape -> (1, 3)

M is [[1, 1],

[1, 1],

[1, 1]]

a is [0, 1, 2]

M [[1, 1],

[1, 1],

[1, 1]]

a [[0, 1, 2],

[0, 1, 2],

[0, 1, 2]]

M.shape -> (3, 2)

a.shape -> (3, 3)

a is [[0, 1, 2]]

M [[1, 1],

[1, 1],

[1, 1]]

Pandas

• It is an IPython library for data manipulation • Has functions to work with any kind of data such

as tabular data

1

Data Structures of Pandas

• All data representation in Pandas is done using one of two data structures: – Series – DataFrame

2

Series • A Series is a one-dimensional array-like object

containing a sequence of values and an associated array of data labels, called its index

• A Series object wraps both its values and indices

3

Series

4

Series • You can create a Series with an index identifying

each data point with a label

5

6

Series • Another way to think about a Series is as a dict,

as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict

7

The index/label values in a Series do not have to be unique

8

DataFrame • Dataframe (DF) is the most important and useful data

structure in pandas • It is used for almost all kind of data representation and

manipulation in pandas • A DF is a table containing rows (representing

objects/observations) and columns (representing features/variables/attributes)

• Unlike numpy arrays, it can contain heterogeneous data • DataFrames are useful for representing raw datasets

and processed datasets in ML and data science

9

Relationship between DataFrames and Series

• We can think of a DataFrame as a sequence of aligned Series objects – aligned means that they share the same

index

10

Constructing Data Frame Objects

• A DataFrame object can be constructed in different ways, including: – from a list of dicts – from a dict of dicts – from a list of Series objects – from a dict of Series objects – from a 2-D NumPy array – from a 2-D Python list

11

List of Dictionaries to a Dataframe

12

Notice how dictionary keys become column names in the DF and each dictionary becomes a row in the DF

A Dictionary of Series Objects to a Dataframe

13

A 2-D List to a DataFrame

14

You can change the index and labels of a DF, using df.index = [‘a’, ‘b’, ‘c’] df.columns = [‘x’, ‘y’, ‘z’]

Data Retrieval

• Pandas provides different ways to retrieve data into a DataFrame. For example, we can read flat files, csv files, database tables, …etc.

15

CSV Files to a Dataframe

• We can read a CSV, or any delimited file, using pandas and get the result as a DF

• This is done using panda’s read_csv() function

• Usually, the only parameter needed is the name of the input file but sometimes you may need to add more parameters

16

Examples

• We are using the files simplemaps- worldcities-basic.csv and nyc_taxi data

17

Examples

18

19

DF.describe() gives basic statistics of the columns w/o param include=‘all’, function shows only columns with numeric types

20

Keyword argument parse_dates is used to specify the columns that contain dates so that pandas can parse them correctly

21

22

Descriptive Statistics with Pandas

23

• The following functions do as their names indicate

Selecting Data

• Pandas allows us to select rows, columns, and combinations of rows and columns, … etc.

24

Selecting Columns

25

Selecting Rows • Rows can be retrieved by position or name

– Use loc attribute to select row(s) from their labels – Use iloc attribute to select row(s) from their positions

• The following example show how to select the first row:

26

27

Remark: The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns

28

Using DF.loc[rows_labels_names] or DF.iloc[rows_indexes] is the proper way to select rows. The slice notation, DF[:], is provided for convenience.

Here, you needed to give the column labels, because writing DF[] is the notation to select columns.

Aside: More about loc and iloc

29

Filtering with Boolean Indexing

30

Operating on Data • Vector arithmetic can be used with DataFrames and

Series objects • Example: adding a new column to the DF with the trip

duration in minutes

31

More on Vector Operations

32

The result is a new Series object containing the aligned sum of a and b

More on Vector Operations

33

[[0, 1, 2, 3],

[4, 5, 6, 7],

[8, 9, 10, 11]]

[[0, 1, 2, 3],

[0, 1, 2, 3],

[0, 1, 2, 3]]

Handling Missing Data

• In pandas, missing data is represented by NaN (Not a Number) or None

• Pandas provides several Series and DataFrame methods to deal with missing data, including: – isnull() indicates whether values are null or not – notnull() the opposite of isnull() – dropna() removes missing data – fillna(some_default_value) replaces missing data with

a default value

34

35

Working with a Series

36

37

Working with a DataFrame

38

Passing how='all' will only drop rows that are all NA

39

The thresh argument is used to keep rows/columns containing a certain number of observations.

To drop columns, use the argument axis=1 or axis = ‘columns’

The groupby Operation • Similar to the group by operation in SQL

where: – data is split into groups that share common

attributes – a function is applied to each group – function results are combined into a result

object

40

Illustration of a group aggregation

41

Grouping Key

• Data is split into groups based on the vales of a key or several keys

• Each grouping key can be – a list, array, dict, or series object – a column name in a DataFrame

• The keys do not have to be all of the same type

42

43

44

45

Using arrays as the grouping keys

46

The GroupBy method size() returns group sizes

Notice: there is no key2 column in the result as df[‘key2’] is not numeric data

Joins

• pd.merge • The idea is similar to joins in databases. That is

to join data from more than one table • Consider the NY city taxi dataset • The DFs data and fare have same number of

rows (one row for each ride) • We want to get the average tip from the fare DF • We will then join the average tip into the data DF

47

48

49

More about the merge Operation • Idea is similar to the join operation in SQL where

pandas will combine two DFs where values under common columns are the same

50

51

52

The on Keyword Argument

• The attribute on allows us to explicitly specify the name of the key column(s)

• It takes either a column name or a list of column names

53

54

If the columns to join on have different names, we can use the left_on and right_on parameters to specify the column names

We can use the DataFrame’s drop() method to remove redundant columns

55 Removing a row

56

DataFrame.set_index

57

58

Joining on Indexes

59

The join() method performs a merge that defaults to joining on indices

We can mix merging on indices and columns

60

Attributes lsuffix and rsuffix

61

Attributes lsuffix and rsuffix

The how Attribute

• Notice, by default pandas performs inner join (intersect)

• For full outer join (union), use the attribute how = 'outer'

• Inner join may be specified explicitly using how = 'inner'

• Other possible values for how are left and right for left/right outer joins

62

63

64

65

66

Renaming Columns in a DF – Function rename

67

DF.rename

68

DF.rename

Accessing an Element in a DF

69

Writing df1[0,1] or df1[[0,1]] is wrong

Data Visualization

• Data visualization makes it easier to sea patterns in the data than just looking at tables of numbers

• The Anscombe data set contains four sets of data, each contains two continuous variables

• Each set has the same mean, variance, correlation, and regression line

• Only when the data are visualized does it become obvious that each set does not follow the same pattern

1

The Anscombe Data Set

2

The Anscombe Data Set

3

The Anscombe Data Set

4

Plot of the Anscombe Data Set

5

Matplotlib

• matplotlib is Python’s fundamental plotting library

import matplotlib.pyplot as plt

• Calling plt.plot with two iterable objects of the same length (e.g., lists of numbers or NumPy arrays) creates a line plot

6

Launching the Qtconsole

ひ Open anaconda command prompt ひ Type

jupyter qtconsole

ひ To be able to se the charts in the console, type %matplotlib inline

ひ Same for jupyter notebook ひ In jupyter notebook, you may need plt.show()

after writing the code for a plot

7

Launching the Qtconsole

8

9

Creating a Line Plot

Remarks

• A plot represents value pairs (x, y) – need two arrays – x contains values on the x-axis – y contains values on the y-axis

• If you pass only one array to the plt.plot() function, matplotlib assumes it is the sequence of y values – it associates the y values with the sequence

x: 0,1,2,3, ... 10

Passing One Set of Values

11

12

Creating a Scatter Plot

13

Creating a Scatter Plot

Creating a Histogram

• A histograms shows the distribution of a variable • pylot.hist(array, bins = n) draws a histogram

14

plt.hist()

15

Creating a Bar Chart • A bar chart can be used to compare

variables • Very example, to compare the favorite

programming language for students in a CS class

• plt.bar(x, y) • plt.barh(x, y) draws a horizontal bar chart

16

plt.bar() & plt.barh()

17

Multiple Bar Charts

18

Pie Chart • A pie chart represents the data as if to fit into a circle • The individual data points are expressed as sectors of a circle

that add up to 360 degrees. • A pie chart chart is good for displaying categorical data and

summaries • data = [500, 200, 250] ; labels = ["Agriculture", "Aide", "News"] • plt.pie(data, labels=labels, autopct=‘%0.1f')

19

Pie Chart – colors & axis

20

Box Plot • A box plot is used to visualize the median value and

low and high ranges of a distribution • data = np.random.randn(50) #Z(0,1) • plt.boxplot(data)

21

Box Plot

22

Displaying Multiple Box Plots

23

Architecture of matplotlib ひ The architecture of matplotlib is

logically structured into three

layers as in the figure

24

ひ The communication is unidirectional ひ The three layers are:

1. Scripting: as data analysts, you work at this layer

2. Artist: see following slides

3. Backend: classes that implement the graphic

elements at a low level

Artist Layer

ひ All the elements that make up a chart, such as the title, ;┝キゲ ノ;HWノゲが ぐ WデIくが ;ヴW キミゲデ;ミIWゲ ラa デエW Aヴデキゲデ ラHテWIデ ┘キデエ ; hierarchical structure representing the different

components on a chart as in the Figure below

ひ There are two types of Artist classes: primitive and composite

25

The Main Artist objects in the Hierarchy

of the Artist Layer

26

1. Figure corresponds to the entire graphical representation and can contain many Axes 2. Axes corresponds to plot or chart. Each Axes object belongs to only one Figure, and is characterized by two Artist Axis. Objects such as title, x label, and the y label, belong to this composite artist.

3. Axis objects that take into account the numerical values to be represented on the x axis and y axis, define the limits and manage the ticks and tick labels

Axes vs Axis

27

Parts of a Figure

28

Scripting Layer (pyplot)

ひ The scripting layer is best suited for data analysts to manipulate and visualize the data

ひ This layer consists of an interface called pyplot

29

Adding More Features to Plots

ひ plt.plot([1,2,3,4]) ひ Default configuration: ひ Blue line ひ Size of axis matches range

of input values

30

ひ There is neither a title nor axis labels ひ There is no legend

Setting Dimensions & Title

ひ Specify the dimensions of the plot by calling plt.axis([xmin, xmax, ymin, ymax])

ひ A title can be specified using plt.title()

31

Before After

Adding Axis Labels

ひ np.linspace creates an array of regularly

spaced values

inclusive of the end

points

32

In [15]: x = np.linspace(2, 3, 11) x

Out[15]: array([ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. ])

ひ plt.xlabel("string") ひ plt.ylabel("string")

Adding Legend & Line Labels

ひ Each line in a plot can be given a label by using the label argument of the plot function

ひ TエW ノ;HWノ ┘ラミげデ ;ヮヮW;ヴ ラミ デエW ヮノラデ ┌ミノWゲゲ ┞ラ┌ also call plt.legend() to add a legend

ひ Legend location can be customized by using a loc argument of the legend method

ひ loc takes a string or integer value as in the table in the next slide

33

Values of loc argument

Location String Location Code

けHWゲデげ 0 け┌ヮヮWヴ ヴキェエデげ 1 け┌ヮヮWヴ ノWaデげ 2 けノラ┘Wヴ ノWaデげ 3 けノラ┘Wヴ ヴキェエデげ 4 けヴキェエデげ 5 けIWミデWヴ ノWaデげ 6 けIWミデWヴ ヴキェエデげ 7 けノラ┘Wヴ IWミデWヴげ 8 け┌ヮヮWヴ IWミデWヴげ 9 けIWミデWヴげ 10

34

Labels and Legends - Example

35

Labels and Legends - Example

36

Customizing Plots

ひ The plot attributes に marker is used to add a marker に color is used to add a color

37

Some Matplotlib marker Styles

38

Some Matplotlib color Code Letters

ひ Can use several formats to specify colors に ;ゲ キミ エデマノ ふWくェくが Iラノラヴ Эけセrrggbbげぶ に shades of gray can be specified as a string representing a float in

the range '0.0' to '1.0' ('0.0' black to '1.0' white)

ひ For basic built-in colors, you can use a single letter に b: blue に g: green に r: red に c: cyan に m: magenta に y: yellow に k: black に w: white

39

Markers and Colors - Example

40

import pylab pylab.plot([1, 2, 3], [1, 2, 3], label = 'line 1', color ="g", marker ="D" ) pylab.plot([1, 2, 3], [3, 1.5, 0.75], label = "line 2", marker = 's', color ='y') pylab.legend(loc='lower center')

Markers and Colors - Example

41

Linewidth & linstyle

42

ひ plot attributes linewidth and

linestyle

43

Linewidth & linstyle (cont)

ひ We can pass a third argument to the plot() function to specify some codes that correspond to the colors, styles,

ぐ WデIく

44

Linewidth & linstyle (cont)

Working with Multiple Subplots

ひ Within a figure, you can have multiple subplots ひ Each subplot specifies a different drawing region ひ Each subplot corresponds to a separate Axes

object

ひ One way of accomplishing this is by using pyplot.subplot()

45

The Function pyplot.subplot()

ひ The subplot() function, takes either a 3-digit integer or 3- separate integers as parameters

ひ They specify how the figure is divided into different subplot regions, and which subplot is active and will

receive the next pyplot command

に The first digit or number specifies the number of rows に The second digit or number specifies the number of columns に The third digit or number is the index/location of the current

subplot that will receive pyplot ommands

46

The Function pyplot.subplot() (cont.)

ひ The subplot location/index is sequentially numbered, and plots are placed first in a left-to-

right direction, then from top to bottom

47

Multiple Subplots Example

48

The figure has been divided into two vertical subplots

The figure has been divided into two horizontal subplots

pyplot.subplot() に Specifying Parameters As Separate Integers

49

Subplots に Alternative Approach ひ Create a figure object

fig = plt.figure(figsize=(6,4))

に figsize: specifies figure dimension (width, height) in inches; default is [6.4, 4.8]

ひ Add a subplot

ax = fig.add_subplot(nrows, ncols, index)

50

Example に Anscombe Data

51

Loading the Data

52

53Matplotlib figure with four empty axes or subplot areas

Specifying the Subplots

54

Plotting The Data

55

Adding Titles

56

Adjusting Things

An alternative approach is to use: fig.subplots_adjust(hspace=0.5)

Saving a Plot to a File

• To save the plot as an image, use the code plt.savefig(filename)

• The file format can be deduced from the file name extension

• pylab.savefig('plot.png') # save as a png image

• pylab.savefig('plot.pdf') # save as a pdf

• pylab.savefig('plot.eps') # save in Encapsulated Postscript format

57

58

Saving the Code • In [171]: import matplotlib.pyplot as plt

... • The magic command %save followed by filename followed by

number of input prompts can be used save code in .py file .py file.

• In [185]: %save my_first_chart 175

• In [186]: %save my_first_chart 171-175 #no space and consecutive line numbers

• After you launch the command, you will find the my_first_chart.py file in your working directory

• %load my_first_chart.py #loads the file • %run my_first_chart.py #runs the file

59

Saving the Code

60

Saving the Code

61

Loading and Running the File

62

Good resources

 https://matplotlib.org/

 https://www.w3schools.com/python/

 https://www.geeksforgeeks.org/python-programming- language/

 Jupyter Notebook: https://www.dataquest.io/blog/jupyter-notebook- tutorial/

 NumPy: https://www.machinelearningplus.com/python/numpy- tutorial-part1-array-python-examples/

 Data Science: https://www.datacamp.com/courses/tech:python

63

Chapter 4 に Classification

These slides are based on material and slides from:

1) Dunham, Data Mining – Introductory and Advanced Topics 2) Han, Kamber and Pei, Data Mining – Concepts and Techniques 3) Tan, Stienback and Kumar, Introduction to Data Mining

4) Zaki and Meira JR., Data Mining and Analysis – Fundamental Concepts and Algorithms 5) Liu, Web Data Mining – Exploring Hyperlinks, Contents, and Usage Data

1

Introduction

• Given a predefined set of classes and a training dataset, build a model to help classify objects into the different classes

• Applications – Classifying an image as a cat or a dog – Identify mushrooms as poisonous or edible – Identifying fake news – Predict when a river will flood – Pattern recognition に recognize hand written digits – Medical diagnosis – Fraud and spam detection – Identify individuals with credit risks – Speech recognition

2

Classification vs. Prediction

• Estimation and prediction can be considered classification

– Eゲデキマ;デキミェ ; ヮWヴゲラミげゲ ;ェW – Predicting stock market behavior

• Prediction models a continuous function while classification models a discrete-valued function

• Classification – predicts categorical class labels (discrete or nominal)

• Prediction – predicts unknown or missing values (models continuous-

valued functions)

3

Formal Definition

• Given a database D={t1,t2がぐがtn} and a set of classes C={C1がぐがCm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class.

• Classes are disjoint and form a partition of D.

4

Supervised vs. Unsupervised Learning

• Supervised learning (classification) – Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating

the class of the observations

– New data is classified based on the training set • Unsupervised learning (clustering)

– The class labels of training data is unknown – Given a set of data, the task is to establish the existence of

classes or clusters in the data

5

What do we mean by learning?

• Given – a data set D, – a task T, and – a performance measure M,

a computer system is said to learn from D to perform the task T キa ;aデWヴ ノW;ヴミキミェが デエW ゲ┞ゲデWマげゲ ヮWヴaラヴマ;ミIW on T improves as measured by M

• In other words, the learned model helps the system to perform T better as compared to no learning

6

An example

• Data: Loan application data • Task: Predict whether a loan should be approved or

not.

• Performance measure: accuracy.

No learning: classify all future applications (test data) to

the majority class (i.e., Yes):

Accuracy = 9/15 = 60%.

• We can do better than 60% with learning.

7

Fundamental assumption of learning

Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples).

• In practice, this assumption is often violated to certain degree.

• Strong violations will clearly result in poor classification accuracy.

• To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.

8

ClassificationねA Two-Step Process • Model construction: describing a set of

predetermined classes

– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

– The model can be represented as classification rules, decision trees, or mathematical formulae

• Model usage: for classifying future or unknown objects

– Estimate accuracy of the model • The known labels of test samples are compared with

the classified results from the model

• Test set is independent of training set • If the accuracy is acceptable, use the model to

classify new tuples 9

Illustrating Classification Task

10

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning algorithm

Training Set

Accuracy

• Other name: accuracy rate, classification accuracy, prediction accuracy: percentage of test samples that

are correctly classified

11

畦潔潔憲堅欠潔検 噺 Number of correct classificationsTotal number of test cases

Comparing Classification Methods

• Prediction accuracy • Speed

– time to construct the model – time to use the model

• Robustness – handling noise and missing values

• Scalability – ability to construct model efficiently given large data

• Interpretability – understanding and insight provided by the model

• Other measures – e.g., goodness of rules; decision tree size; compactness of

classification rules

12

Categorization of Classification Algorithms

• Classification Techniques – Decision Trees – Statistical – Distance – Rules – Neural Networks

13

Decision Tree-Based Algorithms

• Decision trees are among the most widely used techniques for classification

– classification accuracy is competitive with other methods, and

– very efficient • The classification model is a tree, called decision tree

(DT)

• C4.5 by Ross Quinlan is perhaps the best known system

14

Decision Tree-Based Algorithms

• A tree structure where: – each internal node

represents a test on an

attribute,

– each branch represents an outcome of the test,

– and leaf nodes reflect decision outcomes

• different classes and class distributions

15

DT- Loan Dataset

16

Approved or not

A Decision Tree from the Loan Dataset

• Decision nodes and leaf nodes (classes)

17

Using the Decision Tree

18

No

Is the DT Unique?

19

◼ No. Here is a simpler tree

◼ We want smaller tree and accurate tree ◼ Easy to understand and perform better

◼ Finding the best tree is

NP-hard

◼ All current tree building

algorithms are heuristic

algorithms

From a DT to Classification Rules

20

◼ A decision tree can be

converted to a set of

rules

◼ Each path from the

root to a leaf is a rule

Constructing a Decision Tree

• Constructing a DT is aka decision tree induction • Many Algorithms:

– ID3 – C4.5 – C5.0 – CART – SLIQ,SPRINT

21

The Basic Algorithm に ID3

• Assume attributes are categorical – Continuous attributes can be handled too

• A greedy divide-and-conquer algorithm • Tree is constructed in a top-down recursive manner • At start, all the training examples are at the root • Examples are partitioned recursively based on selected

attributes

• Attributes are selected on the basis of a measure called information gain

22

ID3

23

• A greedy divide-and- conquer algorithm

• Tree is constructed in a top-down recursive manner

• At start, all the training examples are at the root

• Examples are partitioned recursively based on selected attributes

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

24

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Refund

Don’t Cheat

Yes No

Marital Status

Don’t Cheat

Cheat

Single,

Divorced Widowed

Don’t Cheat

Married

• Let Dt be the set of training samples that reach a node t

• General Procedure: – If all the samples in Dt belong to the same

class C, then t is a leaf node labeled as C

– If Dt is an empty set, then t is a leaf node labeled by the most common class at the parent

– If attribute set is empty, then t is a leaf node labeled by the majority class

– If Dt contains samples that belong to more than one class,

– use an attribute test to split the data into smaller subsets

– recursively apply the procedure to each subset

Decision Tree Induction

Decision Tree Induction

25

• Let Dt be the set of training samples that reach a node t

• General Procedure: – If all the samples in Dt belong to the same

class C, then t is a leaf node labeled as C

– If Dt is an empty set, then t is a leaf node labeled by the most common class at the parent

– If attribute set is empty, then t is a leaf node labeled by the majority class

– If Dt contains samples that belong to more than one class,

– use an attribute test to split the data into smaller subsets

– recursively apply the procedure to each subset

Refund

Don’t Cheat

Yes No

Marital Status

Don’t Cheat

Cheat

Single,

Divorced Widowed

Don’t Cheat

Married

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 Yes Married 120K No

3 Yes Divorced 220K No

4 No Married 100K No

5 No Married 75K No

6 No Married 60K No

7 No Single 70K No

8 No Single 85K Yes

9 No Single 90K Yes

10 No Divorced 95K Yes 10

The Basic Algorithm に ID3

• Conditions for stopping partitioning – All examples for a given node belong to the same class – There are no remaining attributes for further partitioning に

majority class is the leaf

– There are no examples left

26

ID3 Algorithm in More Details

• Algorithm Generate_DT(D, attribute_list) • Input: D, attribute_list • Output: a DT 1. Create a node N

2. If samples are all of the same class C, then

return N as a leaf node labeled with class C

3. If attribute_list is empty, then

return N as a leaf node labeled with the majority class

4. Select test_attribute, the attribute among attribute_list

with the most information gain

27

ID3 Algorithm in More Details (cont.)

5. Label node N with test_attribute

6. For each known value ai of test_attribute

i. grow a branch from node N for the condition test_attribute =

ai

ii. let Si be the set of samples in D for which test_attribute = ai

iii. if Si is empty, then

attach a leaf node labeled with the majority class in D

iv. else attach the node returned by Generate_DT(Si, attribute_list

- test_attribute) Marital Status

Don’t Cheat

Cheat Don’t Cheat

MarriedWidowed Single,

Divorced

When does Algorithm Stop?

• Algorithm stops at steps 2, 3 and 6-III – 2) all samples for a given node belong to the same class – 3) attribute_list is empty – 6-III) Si is empty, i.e., there are no samples left

29

Choosing an Attribute to Partition Data

• The key to building a decision tree - which attribute to choose in order to branch

• The objective is to reduce impurity or uncertainty in data as much as possible

– A subset of data is pure if all instances belong to the same class.

• The heuristic in ID3 is to choose the attribute with the maximum Information Gain based on entropy

reduction

30

The Loan Dataset

31

Approved or not

Two possible roots, which is better?

32

• Fig. (B) seems to be better

Entropy Measure

• Entropy is a measure of impurity, randomness, uncertainty

• We use entropy as a measure of impurity or disorder of dataset D. (Or, a measure of information in a tree)

• The entropy formula,

• Pr(cj) is the probability of class cj in data set D

33

,1)Pr(

)Pr(log)Pr()(

||

1

||

1 2

=

=

=

−=

C

j j

j

C

j j

c

ccDentropy

Entropy measure: let us get a feeling

• As the data become purer and purer, the entropy value becomes smaller and smaller. This is useful to us!

34

Entropy

• Entropy is a measure of impurity, randomness, uncertainty "in a set of data"

• Entropy of node T is minimum when all samples are of the same class

• Entropy of node T is maximum when there is equal representation of all classes

35

Information Gain

• Given a set of samples D, we first compute its entropy:

• If we select attribute Ai, with v values, as the test attribute at the current node, this will partition D into

v subsets D1, D2 ぐが Dv • The expected entropy if Ai is used as the current test

attribute:

36

 =

= v

j j

j A DentropyD

D Dentropy

i

1

)( ||

|| )(

D

Ai

D1

av

D2 Dv . . . . . .

a1 a2

D

Information Gain (cont.)

• Information gained by selecting attribute Ai to branch or to partition the data is

• We choose the attribute with the highest gain to branch/split the current tree.

37

)()(),( DentropyDentropyADgain iAi

−=

Ai

D1

av

D2 Dv

a1 a2 . . . . . .

D

Example

38

Age Yes No entropy(Di) young 2 3 0.971 middle 3 2 0.971 old 4 1 0.722

Own_house is the best

choice for the root.

結券建堅剣喧検岫経岻 噺 伐 はなの抜 log態 はなの伐 ひなの抜 log態 ひなの 噺 ど┻ひばな 結券建堅剣喧検潮栂津 e 朕墜通鎚勅岫経岻 噺 はなの 抜 結券建堅剣喧検岫経怠岻 髪 ひなの 抜 結券建堅剣喧検岫経態岻噺 はなの 抜 ど 髪 ひなの 抜 ど┻ひなぱ噺 ど┻ののな 結券建堅剣喧検凋直勅岫経岻 噺 のなの 抜 結券建堅剣喧検岫経怠岻 髪 のなの 抜 結券建堅剣喧検岫経態岻 髪 のなの 抜 結券建堅剣喧検岫経戴岻噺 のなの抜 ど┻ひばな 髪 のなの抜 ど┻ひばな 髪 のなの抜 ど┻ばにに噺 ど┻ぱぱぱ

Calculating entropy(D2)

39

結券建堅剣喧検潮栂津 e 朕墜通鎚勅岫経岻 噺 はなの 抜 結券建堅剣喧検岫経怠岻 髪 ひなの 抜 結券建堅剣喧検岫経態岻噺 はなの 抜 ど 髪 ひなの 抜 ど┻ひなぱ噺 ど┻ののな entropy(D2) = -Pr(Class=Yes)*log(Pr(Class = Yes))

-Pr(Class=No)*log(Pr(Class = No))

= - 滞苔log(滞苔) - 戴苔log(戴苔) = 0.918

The Final Tree

40

C4.5 and C5.0

• Extensions of ID3 • Details of C5.0 are not public • C4.5 improves on ID3 as follows:

– allows for continuous-valued attributes – handles missing attribute values – uses tree pruning to improve the efficiency of the DT – uses a better heuristic (GainRatio vs Information Gain) – allows classification via either DT or classification rules – simplifies complex rules with long LHS – a default rule assigning to the majority class is used

41

Gain Ratio

• Information Gain is biased towards selecting attributes with many values

• An attribute with many values results in purer partitions, but may be useless for classification

42

Gain Ratio

• GainRatio: adjusts Information Gain by normalizing by SplitInfo, which takes into consideration the number and sizes of the different partitions of the attribute under consideration

• C4.5 select the attribute with the maximum GainRatio as the test attribute

43

罫欠件券迎欠建件剣 畦 噺 罫欠件券迎欠建件剣 経┸ 畦 噺 罫欠件券岫経┸ 畦岻SplitInfo岫畦岻 SplitInfo 畦 噺 伐布沈退怠賃 】経沈】】経】 log 】経沈】】経】 噺 伐布沈退怠賃 券沈券 log 券沈券

SplitInfo

• The split information represents the potential information generated by splitting the dataset into k

partitions:

• SplitInfo(A) = entropy(D) based on the values of A • High split info: many partitions that have about the

same size

– the larger k, the larger splitInfo, which = log2K • Low split info: few partitions hold most of the tuples

– e., g., [2, 2, 96] is purer than [2, 2, 2, 94] is purer than [5, 5, 5, 85]

44

SplitInfo 畦 噺 伐布沈退怠賃 】経沈】】経】 log 】経沈】】経】 噺 伐布沈退怠賃 券沈券 log 券沈券

Overfitting

45

• Overfitting: a constructed tree may overfit the training data

– too many branches, due to noise or outliers – poor accuracy for unseen samples

• Definition: A decision tree T overfits the training data if there exists another tree T' such that T performs

better than T' over the training samples while T ' has

a higher accuracy than T over the entire distribution

of the data

Tree Pruning

46

• Tree pruning removes the least reliable branches in a DT, which usually results in better classification

• A pruned DT is more efficient and easier to understand

Approaches for Tree Pruning

47

• Two approaches to avoid overfitting – Prepruning: halt tree construction earlyねdo not split a

node if this would result in the goodness measure falling

below a threshold

• Difficult to choose an appropriate threshold – Postpruning: remove branches from a 曨fully grown杇 tree

• A node is pruned by removing its branches and is labeled by the most common class

Cost Complexity Postpruning Algorithm

• The "cost complexity" of a tree is defined as a function of the number of leaves in the tree and the error rate of the tree に e.g., err(T)/|leaves(T)| or err(T) + a|leaves(T)|

• Construct a full decision tree, T • Compute the "cost complexity", c, of T • Internal nodes are removed from T in a bottom-up fashion • Let T' be the tree after removing an internal node, N, from T • Compute the "cost complexity", c', of T' • If c' < c, prune N, otherwise keep N • The algorithm generates a set of progressively pruned trees • The smallest decision tree that minimizes the cost complexity is

used

• A separate validation set is used when calculating "cost complexity"

48

CART

• CART 氾 Classification And Regression Trees • It generates a binary decision tree • At each internal node, CART makes an exhaustive search

to select the best splitting attribute, X, and value, v, that produces the greatest separation of the data

• If X < v デエWミ ゲWミS デエW S;デ; デラ デエW さノWaデざき ラデエWヴ┘キゲWが ゲWミS S;デ; ヮラキミデ デラ デエW さヴキェエデざ – X < v is called a criterion

• RWヮW;デ ゲ;マW ヮヴラIWゲゲ ラミ デエW デ┘ラ IエキノSヴWミ さミラSWゲざ – Yラ┌ ェWデ ; さデヴWWざ

• After the tree is built, pruning is used to improve the efficiency of the tree

1

CART (cont)

• CART makes an exhaustive search of attributes and split points to find the best splitting criterion for a node

• The best splitting criterion is found as follows:

• s 氾 IヴキデWヴキラミが s' 氾 HWゲデ IヴキデWヴキラミが t 氾 ミラSWが も 氾 ェララSミWゲゲが m = number of classification classes

• も(s / tぶ 氾 ェララSミWゲゲ ラa デエW ゲヮノキデ ;デ ミラSW t using criterion s • も 嫌嫗【 建 氾 HWゲデ ┗;ノ┌W ラa デエW ゲヮノキデ ;デ ミラSW デ

2

も 嫌嫗【 建 噺 警欠捲件 も岫嫌件【 建 岻

CART (cont)

• tL 氾 ノWaデ ゲ┌HデヴWW ラa デエW I┌ヴヴWミデ ミラSW t • tR 氾 ヴキェエデ ゲ┌HデヴWW ラa デエW I┌ヴヴWミデ ミラSW t

3

t

X < v

tL tR

X д┗

CART (cont)

• PL 氾 the probability that a tuple in the training set will be in the left subtree

= 】担探丹狸奪坦 辿樽 デL】】担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担】

• PR 氾 the probability that a tuple in the training set will be in the right subtree

= 】担探丹狸奪坦 辿樽 デR】】担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担】

4

CART (cont)

• P(Cj|tLぶ 氾 ヮヴラH;Hキノキデ┞ デエ;デ ; デヴ;キミキミェ デ┌ヮノW キゲ キミ Iノ;ゲゲキaキI;デキラミ class Cj and is at the same time in the left subtree

= 樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担

• P(Cj|tRぶ 氾 ヮヴラH;Hキノキデ┞ デエ;デ ; デヴ;キミキミェ デ┌ヮノW キゲ キミ Iノ;ゲゲキaキI;デキラミ class Cj and is at the same time in the right subtree

= 樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 嘆辿巽竪担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担

5

Example

• Consider the training data given in following table. Find the best splitting criterion at the root considering the

classifications in Output1.

6

Name Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium Source: Dunham, Data Mining に Introductory and Advanced Topics, Table 4.1, page 78

Example (cont)

• Attribute name is not useful • Gender has two possible values: F, M • Height has many values. For the sake of example, we

will consider only the following split values: 1.6, 1.7,

1.8, 1.9, and 2.0

• Calculate も(s / t) 7

Example (cont)

• も(Gender) = ? • |tL|=|Male| = 6 • |tR|=|Female| = 9 • PL =

樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担 = 6/15 • PR = 9/15 • Classes = {Short, Medium, Tall} • も(Gender) = 2(6/15)(9/15)デ珍樺岶聴朕墜追痛┸暢勅鳥通沈通陳┸脹銚鎮鎮岼 】鶏 系倹 建詣 伐 鶏 系倹 建迎 】

8

Gender

Male

tL tR

Female

Example (cont)

• Classes = {Short, Medium, Tall} • も(Gender) = 2(6/15)(9/15)デ珍樺岶聴朕墜追痛┸暢勅鳥通沈通陳┸脹銚鎮鎮岼 】鶏 系倹 建詣 伐 鶏 系倹 建迎 】 • P(Cj|tL) =

樽探鱈但奪嘆 誰脱 達狸叩坦坦 Cj 担探丹狸奪坦 担竪叩担 叩嘆奪 叩狸坦誰 辿樽 担竪奪 狸奪脱担 坦探但担嘆奪奪樽探鱈但奪嘆 誰脱 担探丹狸奪坦 辿樽 担竪奪 担嘆叩辿樽辿樽巽 坦奪担 • |P(Short|Male) - P(Short|Female) | = |1/15 に 3/15|= 2/15 • |P(Medium|Male) - P(Medium|Female) | = |2/15 に 6/15|= 4/15 • |P(Tall|Male) - P(Tall|Female) | = |3/15 に 0/15|= 3/15 • も(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.288

9

Gender

Male

tL tR

Female

Example (cont)

• For Height • も(1.6) = ? • |tL|=|Height < 1.6| = 0 • PL = 0 • も(1.6) = 0

10

Height < 1.6

yes

tL tR

no

Example (cont)

• も(1.7) = ? • |tL|=|Height < 1.7| = |{Kristina, Kathy}| • PL =

】HWキェエデ а ヱくΑづ樽探鱈但奪嘆 誰脱 担探丹狸奪坦 叩担 追墜墜痛 = 2/15 • PL =

】HWキェエデ 兆 ヱくΑづ樽探鱈但奪嘆 誰脱 担探丹狸奪坦 叩担 追墜墜痛 = 13/15 • |P(Short|Height < 1.7) - P(Short|HWキェエデ 半 ヱくΑ) |

= |2/15 に 2/15|= 0 • |P(Medium|Height < 1.7) - P(Medium|HWキェエデ 半 ヱくΑ) | = |0 に 8/15|=

8/15

• |P(Tall|Height < 1.7) - P(Tall|HWキェエデ 半 ヱくΑ) | = |0 に 3/15|= 3/15 • も(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169

11

Height < 1.7

yes

tL tR

no

Example (cont)

• At the start, there are six choices for split point (right branch on equality):

– も(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.288 – も(1.6) = 0 – も(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 – も(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 – も(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 – も(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32

• Split at 1.8

12

Classification in Large Databases

• Classificationねa classical problem extensively studied by statisticians and machine learning researchers

• Scalability: Classifying large data sets with reasonable speed • Why decision tree induction in data mining?

– relatively fast learning speed – convertible to simple and easy to understand classification

rules

– comparable classification accuracy with other methods

13

Scalable DT Induction Methods

• Early approaches include • SLIQ • SPRINT • RainForest

14

SLIQ

• SLIQ builds an index for each attribute • It maintains disk-resident attribute lists sorted by

attribute values

• It maintains a single memory-resident class list

15

SLIQ (cont)

• A tuple is represented by a linkage of one entry from each attribute list to an entry in the class list

• A デ┌ヮノWゲげ Wミデヴ┞ キミ デエW Iノ;ゲゲ ノキゲデ Iラミデ;キミゲ – its class label, and – ; ノキミニ デラ ; ノW;a ミラSW キミ デエW DTが デエ;デ Iラミデ;キミゲ デエW デ┌ヮノWげゲ

classification

16

SLIQ に Disadvantage • Size of class list is proportional to number of tuples in

the training set.

• So class list can grow too big to fit in main memory

17

SPRINT

• “PRINT 氾 Scalable PaRallelizable INduction of decision Trees

• Based on CART – builds a binary DT

• Uses a data structure that holds for each attribute the class and RID

• When a node is split, the attribute lists are partitioned and distributed among children nodes

18

SPRINT (cont.)

• SPRINT improves on SLIQ by removing all memory restrictions

• SPRINT uses an additional data structures, a hash tree, with size proportional to the training dataset

– performance may deteriorate • Can be easily parallelized • Uses a gini index to find the best split

19

Gini Index

20

• If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency (proportion) of class j in D

• If a data set D is split on A into two subsets D1 and D2, the gini index giniA(D) is defined as

• Reduction in Impurity:

• The attribute provides the largest reduction in impurity is chosen to split the node (need to consider all the possible splitting points for each attribute)

 =

−= n

j p jDgini

1

21)(

)( || ||)(

|| ||)( 2

2 1

1 Dgini D D

Dgini D DDginiA +=

)()()( DginiDginiAgini A

−=

Gini Index に Example

21

• Suppose the attribute income partitions D into 10 tuples in D1: {low, medium} and 4 tuples in D2: {high}

and gini{medium,high} is 0.45.

Therefore, 訣件券件沈津頂墜陳勅樺岶鎮墜栂┸陳勅鳥沈通陳岼岫経岻 is better since it is the lowest

訣件券件岫経岻 噺 な 伐 ひなね 態 伐 のなね 態 噺 ど┻ねのひ 訣件券件沈津頂墜陳勅樺岶鎮墜栂┸陳勅鳥沈通陳岼岫経岻 噺 などなね 罫件券件岫経怠岻 髪 ねなね 罫件券件岫経態岻

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

Ex. D has 9 tuples in

buys_computer = 曨yes杇

and 5 in 曨no杇

= 怠待怠替(な 伐 胎怠待 態 伐 戴怠待 態岻 髪 替怠替 ふな 伐 態替 態 伐 態替 態岻

= ど┻ねねぬ = 訣件券件沈津頂墜陳勅樺岶朕沈直朕岼岫経岻

income buys_computer

yes no

low 3 1

medium 4 2

high 2 2

How to Specify Split Condition?

• Depends on attribute types – Continuous – Nominal – Ordinal

22

Binary Split of Continuous-Value Attributes

• Let attribute A be a continuous-valued attribute • Must determine the best split point for A

– Sort the value A in increasing order: a1, a2が ぐくが ai, ai+1が ぐ – Choose a threshold value between each pair of adjacent values

ai and ai+1 to use as a possible split point

– Typically, the midpoint is used • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

– The point yielding maximum impurity reduction に つ訣件券件 畦 伐 is selected as the split-point for A

• Split: – Dヱ キゲ デエW ゲWデ ラa デ┌ヮノWゲ キミ D ゲ;デキゲa┞キミェ A г ゲヮノキデ-point, and D2 is

the set of tuples in D satisfying A > split-point

23

Binary Split of Nominal Attributes

• CarType: Family, Sports, Luxury • Binary split: Divides values into two subsets.

• What about this split?

• Need to find optimal partitioning • Need to consider all possibilities

– [{Family}, {Sport, Luxury}] vs [{Sports}, {Family, Luxury}] vs [{Luxury}, {Family, Sports}]

24

CarType {Sports,

Luxury} {Family}

CarType {Family,

Luxury} {Sports}

• ShirtSize: Small, Medium, Large, Extra Large

• Values can be grouped as long as the grouping does not violate the ;デデヴキH┌デWげゲ ラヴSWヴ property

25

• The groupings shown in Figures (a) and (b) preserve the order among the attribute values

• The grouping shown in Figure (c) violates the ordering property • One may handle an ordinal attribute as nominal if that is more

appropriate to the DM task

Binary Split of Ordinal Attributes

RainForest

• Uses aggregate metadata • At each node, it builds an AVC-set for each attribute

– AVC (Attribute, Value, Class_label) • AVC-set of attribute A at node N stores the class label

counts for each value of A

• The set of all AVC-sets at a given node is called the AVC- group of N

• Split attributes are found using AVC-groups • When splitting, the training samples are split and AVC-

groups are constructed for children nodes

• RainForest uses special techniques to handle the case when AVC-groups do not fit in memory

26

Rainforest: Training Set and Its AVC Sets

27

student Buy_Computer

yes no

yes 6 1

no 3 4

Age Buy_Computer

yes no

<=30 2 3

31..40 4 0

>40 3 2

Credit rating

Buy_Computer

yes no

fair 6 2

excellent 3 3

AVC-set on incomeAVC-set on Age

AVC-set on Student

Training Examples income Buy_Computer

yes no

high 2 2

medium 4 2

low 3 1

AVC-set on credit_rating

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

Random Forests

• An ensemble method designed for decision trees • An ensemble for classification is a technique that combines

multiple classifiers to produce a better classifier

• If one tree is good, then many trees (a forest) should be better, provided that there is enough variety among them

• How can we make the DTs different? – train them on slightly different data に use bootstrap samples from the

dataset for each tree

– at each node, use a random subset of the features to find split attribute

• Once the set of trees are trained, the output of the forest is the majority vote for classification

28

Bootstrapping

• A bootstrap sample is a random sample of observations with replacement with the same

number of observations as the original dataset

– bootstrap sample is the same size as the original dataset – we may get some data several times and others not at all

29

Random Forests Algorithm

Training:

The original dataset D contains d samples

• for i Э ヱ ぐ N //N is number of DTs – Draw a bootstrap sample Di of D – Build a random-forest tree using Di

• For a node j, sample a random subset k (= sqrt(|Ai|)) of the attributes Ai remaining at the node

• Select the best attribute from k attributes to split the tree

• endfor Testing: majority voting is used to classify new samples

30

Scalable DT Induction Methods

• New approaches are based on using Hadoop and Spark に CSC 735 – Data is partitioned and disturbed – Parallel processing using many machines – Results are combined – Active research on how to do this for different ML

methods

31

Decision Trees in Scikit-Learn

• Scikit-learn is a ML library in IPython

32

The Iris Dataset

• Aka iris flower dataset • Introduced by Fisher in 1936 • The dataset consists of 50 samples from each of three species

of Iris (Iris setosa, Iris virginica and Iris versicolor)

33

• Four features were measured from each sample: the length and the

width of the sepals and petals, in

centimeters.

Loading the Iris Dataset

34

scikit-learn contains

many datasets in the

submodule

sklearn.datasets

Examining the Dataset

35

iris.data returns

a numpi.ndarray

with the tuples

in the dataset

Examining the Tuples

36

• Property shape gives the number of elements in

each dimension

• Property feature_names gives

the names of the columns

Examining the Labels

37

iris.target # a 1D array with 150 values

iris.target.shape

iris.target_names

Examining the Labels

38

Dividing the Dataset for Training and Testing

• By convention, X is used to name the features dataset and y is used to name the labels dataset

X = iris.data y = iris.target

• Import a function to split the dataset: from sklearn.model_selection import train_test_split

• Apply the function to both the observation and target data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,

random_state=1)

39

Dividing the Dataset for Training and Testing

40

Creating a Decision Tree Model

• Iマヮラヴデ デエW マラSWノゲげ API aヴラマ Scikit-Learn from sklearn.tree import DecisionTreeClassifier

• Instantiate an instance of the model dt = DecisionTreeClassifier()

• You can specify parameters dt = DecisionTreeClassifier(criterion='entropy',

random_state=0)

• Get more information about a model help(DecisionTreeClassifier)

41

Supported values for

criterion are "gini"

for the gini index

(the default) and

"entropy" for the

information gain

42

Train/Fit the Model

• Fit the model to the training data by calling the fit() method of the model instance

trained_model = dt.fit(X_train, y_train)

43

Test the Model

• You can test the trained model by measuring its accuracy on the test set

y_pred = trained_model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred) #0.97

new_sample = [[ 5, 4, 3, 2]] # Create new sample

trained_model.predict(new_sample) # Classify it #array([1])

44

Testing and Using the Model

45

Steps for Using a Scikit-Learn Classifier

1. Choose a class of model by importing the

appropriate class from Scikit-Learn

2. Specify model parameters by instantiating the class

with desired values

3. Arrange data into a features matrix and target

4. Split the data for training and testing

5. Fit the model to the training data

6. Test the model

7. Apply the model to new data

46

Using Random Forests in Scikit-Learn

47

First, load and prepare the data

Using Random Forests in Scikit-Learn (cont)

48

Second, create a model and train it

Using Random Forests in Scikit-Learn (cont)

49

Third, test the model

Using the model

Bayesian Classification

1

Bayesian Classification

• A statistical classifier: predicts class membership probabilities

• Fラ┌ミS;デキラミぎ B;ゲWS ラミ B;┞Wゲげ TエWラヴWマ • Performance: A simple Bayesian classifier, naïve

Bayesian classifier, has comparable performance with

decision tree and selected neural network classifiers

2

B;┞Wゲげ TエWラヴWマぎ B;ゲキIゲ

• Let X be a data sample (曨evidence杇), its class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(C|X), the probability that the

hypothesis holds given the observed data sample X

• P(H|X) or P(C|X) is a posterior probability • P(C) is a prior probability

– the initial probability of C – e.g., X ┘キノノ H┌┞ Iラマヮ┌デWヴが ヴWェ;ヴSノWゲゲ ラa ;ェWが キミIラマWが ぐ

• P(X): probability that sample data is observed • P(X|C) (likelihood), the probability of observing the sample X,

given that the hypothesis H holds

3

B;┞Wゲげ TエWラヴWマ

• Given training data X, posteriori probability of a hypothesis H, P(C|X), follows the Bayes theorem

• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes

4

鶏岫系】隙岻 噺 鶏岫隙】系岻鶏岫系岻鶏岫隙岻

Towards Naïve Bayesian Classifier • Let D be a training dataset • Each tuple is represented as vector X = (x1, x2が ぐが xn) • Suppose there are m classes C1, C2が ぐが Cm. • Classification is to find the maximum P(Ci|X) • This can be derived from Bayes暍 theorem (1<= i <= m)

• Since P(X) is constant for all classes, only

needs to be maximized

5

鶏岫系沈】薫岻 噺 鶏岫薫】系沈岻鶏岫系沈岻鶏岫薫岻 鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻

DWヴキ┗;デキラミ ラa N;シ┗W B;┞Wゲげ Cノ;ゲゲキaキWヴ

• A simplified assumption: attributes are conditionally independent

• This greatly reduces the computation cost • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value

xk for Ak divided by |Ci, D| (# of tuples of Ci in D)

• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean ´ and standard deviation ゝ

and P(xk|Ci) is

6

)|(...)|()|( 1

)|()|( 21

CixPCixPCixP n

k CixPCiP nk =

= =X

2

2

2

)(

2

1 ),,( 

 

− −

= x

exg

P(Xk | iC ) = g(xk,mCi ,s Ci )

How to Estimate Probabilities from Data?

7

• Normal distribution:

• For (Income, Class=No): – If Class=No

• sample mean = 110K • sample variance = 2975K

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

c c c c

鶏 荊券潔剣兼結 噺 なにど 軽剣 噺 なに講 のね┻のね 結貸岫怠態待貸怠怠待岻鉄態 態苔胎泰 噺 ど┻どどばに

鶏岫畦沈】潔珍岻 噺 なに講購珍態 結貸 岫凋日貸禎乳岻鉄態蹄乳鉄

P(Xk | iC ) = g(xk,mCi ,s Ci )

N;シ┗W B;┞Wゲげ に Example

8

Class:

C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

Data sample

X = (age <=30,

income = medium,

student = yes

credit_rating = Fair)

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻

)|(...)|()|( 1

)|()|( 21

CixPCixPCixP n

k CixPCiP nk =

= =X

AVC Sets for the Training Dataset

9

student Buy_Computer

yes no

yes 6 1

no 3 4

Age Buy_Computer

yes no

<=30 2 3

31..40 4 0

>40 3 2

Credit rating

Buy_Computer

yes no

fair 6 2

excellent 3 3

AVC-set on incomeAVC-set on Age

AVC-set on Student

Training Examples income Buy_Computer

yes no

high 2 2

medium 4 2

low 3 1

AVC-set on credit_rating

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

N;シ┗W B;┞Wゲげ に Example

10

• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357

• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

Age Buy_Computer

yes no

<=30 2 3

31..40 4 0

>40 3 2

income Buy_Computer

yes no

high 2 2

medium 4 2

low 3 1

鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻

)|(...)|()|( 1

)|()|( 21

CixPCixPCixP n

k CixPCiP nk =

= =X

N;シ┗W B;┞Wゲげ に Example

11

• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357

• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4 P(student = 曨yes杇 | buys_computer = 曨yes) = 6/9 = 0.667 P(student = 曨yes杇 | buys_computer = 曨no杇) = 1/5 = 0.2 P(credit_rating = 曨fair杇 | buys_computer = 曨yes杇) = 6/9 = 0.667 P(credit_rating = 曨fair杇 | buys_computer = 曨no杇) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

student Buy_Computer

yes no

yes 6 1

no 3 4

Credit rating

Buy_Computer

yes no

fair 6 2

excellent 3 3

鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻

)|(...)|()|( 1

)|()|( 21

CixPCixPCixP n

k CixPCiP nk =

= =X

N;シ┗W B;┞Wゲげ に Example

12

• P(Ci): P(buys_computer = 曨yes杇) = 9/14 = 0.643 P(buys_computer = 曨no杇) = 5/14= 0.357

• Compute P(X|Ci) for each class P(age = 曨<=30杇 | buys_computer = 曨yes杇) = 2/9 = 0.222 P(age = 曨<= 30杇 | buys_computer = 曨no杇) = 3/5 = 0.6 P(income = 曨medium杇 | buys_computer = 曨yes杇) = 4/9 = 0.444 P(income = 曨medium杇 | buys_computer = 曨no杇) = 2/5 = 0.4 P(student = 曨yes杇 | buys_computer = 曨yes) = 6/9 = 0.667 P(student = 曨yes杇 | buys_computer = 曨no杇) = 1/5 = 0.2 P(credit_rating = 曨fair杇 | buys_computer = 曨yes杇) = 6/9 = 0.667 P(credit_rating = 曨fair杇 | buys_computer = 曨no杇) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = 曨yes杇) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = 曨no杇) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = 曨yes杇) * P(buys_computer = 曨yes杇) = 0.028 P(X|buys_computer = 曨no杇) * P(buys_computer = 曨no杇) = 0.007

Therefore, X belongs to class (曨buys_computer = yes杇)

鶏岫系沈】薫岻 蛤 鶏岫薫】系沈岻鶏岫系沈岻

)|(...)|()|( 1

)|()|( 21

CixPCixPCixP n

k CixPCiP nk =

= =X

Exercise

• Consider the training dataset given in the table below, where A1, A2 and A3 are the input attributes

;ミS C キゲ デエW Iノ;ゲゲ ノ;HWノく UゲW N;シ┗W B;┞Wゲげ デラ aキミS デエW class for the object X = {1, 2, 2}.

13

Sample Attribute

A1

Attribute

A2

Attribute

A3

Class

C

1 1 2 1 1

2 0 0 1 1

3 0 1 2 1

4 1 0 1 1

5 2 1 2 2

6 1 2 1 2

7 2 2 2 2

Avoiding the 0-Probability Problem

• Naïve Bayesian requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero

• Ex. Suppose that a dataset with 1000 tuples. We have income=low (0), income= medium (990), and income = high (10),

• Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

– The 曨corrected杇 prob. estimates are close to their 曨uncorrected杇 counterparts

14

 =

= n

k CixkPCiXP

1 )|()|(

15

N;シ┗W B;┞Wゲげ に Advantages and Disadvantages

• Advantages: – Easy to implement – Very efficient – Good results obtained in many applications – Can easily handle missing data by omitting that

probability in the calculations

• Disadvantages – Assumption: class conditional independence, therefore

loss of accuracy when the assumption is seriously

violated

Naïve Bayes in Scikit-Learn

• There are 3 classes that implement Naïve Bayes in the submodule sklearn.naive_bayes

– GaussianNB – MultinomialNB – BernoulliNB

• Each assumes a different likelihood distribution of デエW ;デデヴキH┌デWゲげ ┗;ノ┌Wゲ

• WエキIエ ラミW デラ ┌ゲW SWヮWミSゲ ラミ aW;デ┌ヴWゲげ デ┞ヮWゲぎ continuous, categorical, binary

16

Preparing the Dataset

# Load data

iris = datasets.load_iris()

X = iris.data

y = iris.target

# Split the dataset: from sklearn.model_selection import

train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,

random_state=1)

17

Create a GaussianNB Model and Train It

• from sklearn.naive_bayes import GaussianNB

# Create Gaussian Naive Bayes object

classifer = GaussianNB()

# Train model

model = classifer.fit(X_train, y_train)

18

Test the Model

# Test the model

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

#0.97

new_sample = [[ 5, 4, 3, 2]] # Create new sample

model.predict(new_sample) # Classify it

#array([1])

19

Using NB to Work with Categorical Data

• MultinomialNB is commonly used when working with categorical attributes

– Ex: movie ratings ranging from 1 to 5 – Ex: working with text data (e.g., documents of text)

• Common approach: create a bag of words with a vector for each document containing counts of the

appearance of the words in the vocabulary

20

Prepare the Dataset

# Load libraries

import numpy as np

from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer

# Create text

text_data = np.arrayふぷろI ノラ┗W Bヴ;┣キノく Bヴ;┣キノぁろが ろBヴ;┣キノ キゲ HWゲデげが ろGWヴマ;ミ┞ HW;デゲ Hラデエげへぶ

# Create bag of words

count = CountVectorizer()

bag_of_words = count.fit_transform(text_data)

21

CountVectorizor and BOW

# Show feature matrix

bag_of_words

O┌デヮ┌デぎ аン┝Α ゲヮ;ヴゲW マ;デヴキ┝ ラa デ┞ヮW ろаIノ;ゲゲ ろミ┌マヮ┞くキミデヶヴろбげ with 8 stored elements in Compressed Sparse Row format>

bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1,],

[0, 1, 0, 1, 0, 1, 0],

[1, 0, 1, 0, 1, 0, 0]], dtype=int64)

# Show feature names

count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love']

22

CountVectorizor and BOW (cont.)

• The feature matrix with the words as column names and each row is one observation

• The tuples: (['I love Brazil. Brazil! ', 'Brazil is best', ろGWヴマ;ミ┞ HW;デゲ Hラデエげへぶ

• count.get_feature_names() ['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love']

23

beats best both brazil germany is love

0 0 0 2 0 0 1

0 1 0 1 0 1 0

1 0 1 0 1 0 0

Preparing the Dataset and Creating the Model

# Create text text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])

# Create bag of words count = CountVectorizer()

bag_of_words = count.fit_transform(text_data)

24

# Create feature matrix

features = bag_of_words.toarray()

# Create target vector

target = np.array([0,0,1])

# Create multinomial naive Bayes object with prior probabilities of each class

classifier = MultinomialNB(class_prior=[0.25, 0.5])

# Train model

model = classifier.fit(features, target)

Using the Model

# Create new observation

new_observation = [[0, 0, 0, 1, 0, 1, 0]]

# Predict new observation's class

model.predict(new_observation)

# array([0])

25

BernoulliNB

• Works similar to MultinomialNB but use with binary data

• See example in the notebook

26

Distance-Based Classification Algorithms

27

Distance-Based Classification Algorithms

• Tuples in the same class are more similar to each other

• A similarity or distance measure is used to determine how similar tuples are

• The problem is how to define similarity measures

28

Distance Measures

• xi = (xi1, xi2が ぐが xip) and xj = (xj1, xj2が ぐが xjp) are two p- dimensional data objects

• Minkowski distance:

where q is a positive integer

• If q = 2, d is Euclidean distance

• If q = 1, d is Manhattan distance

29

穴岫捲件┸ 捲倹岻 噺 忍 岫】捲沈怠 伐 捲珍怠】槌 髪 】捲沈態 伐 捲珍態】槌髪┻ ┻ ┻ 髪】捲沈椎 伐 捲珍椎】槌岻

穴岫捲件┸ 捲倹岻 噺 】捲沈怠 伐 捲珍怠】 髪 】捲沈態 伐 捲珍態】髪┻ ┻ ┻ 髪】捲沈椎 伐 捲珍椎】 穴岫捲件┸ 捲倹岻 噺 布賃退怠椎 岫捲沈賃 伐 捲珍賃岻態

Distance-Based Classification Algorithms

• Pノ;IW キデWマゲ キミ Iノ;ゲゲ デラ ┘エキIエ デエW┞ ;ヴW さIノラゲWゲデざ • Must determine distance between an item and a

class

• Classes represented by – Centroid: central value – Medoid: actual point near the centroid – Individual points

• Example: C = {1, 4, 9} – Centroid is ________ – Medoid is _________

30

Simple Distance Based Classification Algorithm

• Decide on the representative for each class • DWIキSW ラミ デエW ゲキマキノ;ヴキデ┞ さラヴ Sキゲデ;ミIWざ マW;ゲ┌ヴW デラ ┌ゲW • To classify a new sample X, it will be compared to the

representative of each class

– Place X in the same class with the representative that is マラゲデ ゲキマキノ;ヴ さラヴ IノラゲWゲデざ デラ キデ

31

K Nearest Neighbors Algorithm

• Requires three things – Set of stored objects – Distance measure to compute

distance between objects

– The value of k, the number of nearest neighbors to retrieve

• To classify an unknown object: – Compute distance to the training

objects

– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown object (e.g., by taking majority vote)

32

Unknown record

33

Definition of Nearest Neighbors

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of an object x are data points that have

the k smallest distance to x

K Nearest Neighbors Algorithm

• Compute distance between two points: – Euclidean distance

• Determine the class from nearest neighbors list – take the majority vote of class labels among the k-nearest

neighbors (simple voting)

– Weigh the vote according to distance (weighted voting ) • weight factor, w = 1/d2

34

 −= i ii

qpqpd 2)(),(

Choosing the Value of K

• If k is too small, sensitive to noise points • If k is too large, neighborhood may include points

from other classes

35

X

• k is usually chosen empirically by trying a

range of values

Scaling Values

• Attributes may have to be scaled to prevent distance measures from being dominated by one of the

attributes

• Example: – height of a person may vary from 1.5m to 1.8m – weight of a person may vary from 90lb to 300lb – income of a person may vary from $10K to $1M

36

Eager vs Lazy Learners

• KNN is a lazy learner – It does not build a model from the training data – Unlike eager learners such as decision tree induction and

Naïve Bayes

– Classifying unknown objects is relatively expensive

37

KNN Algorithm Input:

D //Training data

k //Number of neighbors

t //Input tuple to classify

Output:

c //Class to which t is assigned

KNN Algorithm: //Algorithm to classify tuple using KNN

N = Ø ;

//Find set of neighbors, N, for t

foreach d in D do

if |N| 隼 k then N = N  d;

else if  u in N such that distance(t, d) 隼 distance(t, u) replace d by the neighbor in N with the largest distance to t

c = class to which most u in N are classified

Source: Dunham, Data Mining に Introductory and Advanced Topics, Algorithm 4.2 page 92

Remarks

• If K = 1, nearest neighbor algorithm • Choice of K is important for KNN • [Dunham] As a rule of thumb,

ニ г ミ┌マHWヴ ラa デヴ;キミキミェ ゲ;マヮノWゲ ; commercial algorithms often use a default value of 10.

• Researchers have shown that the classification accuracy of KNN can be as accurate as more elaborated methods

• KNN is slow at the classification time • KNN does not produce an understandable model

39

KNN in Scikit-Learn

# Load libraries

from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler

from sklearn import datasets

# Load data

iris = datasets.load_iris()

X = iris.data

y = iris.target

40

KNN in Scikit-Learn (cont)

# Create standardizer

standardizer = StandardScaler()

# Standardize features

X_std = standardizer.fit_transform(X)

# Train a KNN classifier with 5 neighbors

knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

# Create two observations

new_observations = [[ 0.75, 0.75, 0.75, 0.75], [ 1, 1, 1, 1]]

# Predict the class of two observations

knn.predict(new_observations)

array([1, 2])

41

KNeighborsClassifier Parameters

• Parameters we can set include: – n_neighbors: value of K (default is 5) – metric: 'minkowski' (the default)| 'euclidean'| 'manhattan' – p: integer, (default = 2) Power parameter for the

Minkowski metric

– weights: 'uniform' (the default)| 'distance'

• Notebook name: KNN.ipynb

42