excel question

shoodi88

GEN101-AIPerformanceEvaluation1.pptx

Home >Architecture and Design homework help >excel question

GEN101 Introductory Artificial Intelligence

Artificial Intelligence Performance Evaluation

College of Engineering

Model Evaluation

Metrics for Performance Evaluation

How to evaluate the performance of a model?

Methods for Performance Evaluation

How to obtain reliable estimates?

Methods for Model Comparison

How to compare the relative performance among competing models?

We have already seen: the Confusion Matrix

Most widely-used metric:

Metrics for Performance Evaluation

We will focus on the predictive capability of a model

Rather than how fast it takes to classify or build models, scalability, etc.

	PREDICTED CLASS (AI)
ACTUAL CLASS (Human)		Class=Yes	Class=No
	Class=Yes	a (TP)	b (FN)
	Class=No	c (FP)	d (TN)

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

Limitation of Accuracy

Consider a 2-class problem

Number of Class 0 (e.g., Benign) examples = 9990

Number of Class 1 (e.g., Malignant) examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

Accuracy is misleading because model does not detect any class 1 example

Other Measures

	PREDICTED CLASS
ACTUAL CLASS		Class=Yes	Class=No
	Class=Yes	a (TP)	b (FN)
	Class=No	c (FP)	d (TN)

Precision-Recall

Two other metrics that are often used to quantify model performance are precision and recall.

Precision is defined as the number of true positives divided by the total number of positive predictions.

It quantifies what percentage of the positive predictions were correct

How correct your model’s positive predictions were.

Recall/Sensitivity is defined as the number of true positives divided by the total number of true positives and false negatives (all actual positives)

It quantifies what percentage of the actual positives you were able to identify

How sensitive your model was in identifying positives.

https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248

Precision-Recall

Precision =

Recall =

Sensitivity =

Specificity =

Out of all the examples that were predicted as positive, how many are correct? How precise was the AI?

Out of all the positive examples, how many were caught (recalled)? Did the AI miss?

Out of all the people that do have the disease, how many got positive test results? Did the AI miss anyone?

Out of all the people that do NOT have the disease, how many got negative results?

https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1

Precision-Recall

Example

In this example:

If the classifier predicts negative, you can trust it, the example is negative. Our AI does not do a mistake in negative. It is sensitive. However, pay attention, if the example is negative, you can’t be sure it will predict it as negative (specificity=78%).

If the classifier predicts positive, you can’t trust it (precision=33%)

However, if the example is positive, you can trust the classifier will find it (will not miss it) (recall=100%).

Precision: Is the green box pure?

Recall/Sensitivity: Did I lose any of the green balls to the red box? Did I miss any greens? Am I so sensitive to greens that I do not miss them?

Specificity: Are all the red balls in the red box?

Precision-Recall

Example

In this example:

Since the population is imbalanced:

The precision is relatively high

The recall is 100% because all the positive examples are predicted as positive.

The specificity is 0% because no negative example is predicted as negative.

Precision-Recall

Example

In this example:

If it predicts that an example is positive, you can trust it — it is positive.

However, if it predicts it is negative, you can’t trust it, the chances are that it is still positive.

Can be a useful classifier

Precision-Recall

Example

In this example:

The classifier detects all the positive examples as positive

It also detects all negative examples as negative.

All the measures are at 100%.

Why so many measures? Which is more important?

Because it depends

False positive is expensive. False alarm is a nightmare!

Precision is more important than recall

False negative is expensive. Missing one is a nightmare!

Recall is more important than precision

Precision-Recall Curves

The precision-recall curve is used for evaluating the performance of binary classification algorithms.

They provide a graphical representation of a classifier’s performance across many thresholds, rather than a single value

It is constructed by calculating and plotting the precision against the recall for a single classifier at a variety of thresholds.

It helps to visualize how the choice of threshold affects classifier performance and can even help us select the best threshold for a specific problem.

Precision-Recall Curves Interpretation

https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248

Precision-Recall Curves

A model that produces a precision-recall curve that is closer to the top-right corner is better than a model that produces a precision-recall curve that is skewed towards the bottom of the plot.

https://paulvanderlaken.com/2019/08/16/roc-auc-precision-and-recall-visually-explained/

Precision-Recall Curves

Example

Which of the following P-R curves produce represent a perfect classifier?

(a)

(b)

(c)

https://towardsdatascience.com/gaining-an-intuitive-understanding-of-precision-and-recall-3b9df37804a7

(b) -> perfect classifier

(a) -> very bad classifier (random classifier)

Model Evaluation

Metrics for Performance Evaluation

How to evaluate the performance of a model?

Methods for Performance Evaluation

How to obtain reliable estimates?

Methods for Model Comparison

How to compare the relative performance among competing models?

Methods for Performance Evaluation

Performance of a model may depend on other factors besides the learning algorithm:

How to obtain a reliable estimate of performance?

Methods of Estimation:

Class distribution

Size of training and test sets

Holdout

Reserve 2/3 for training and 1/3 for testing

Random subsampling

Repeated holdout

Cross validation

Partition data into k disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one

Leave-one-out: k=n

Model Evaluation

Metrics for Performance Evaluation

How to evaluate the performance of a model?

Methods for Performance Evaluation

How to obtain reliable estimates?

Methods for Model Comparison

How to compare the relative performance among competing models?

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of the algorithm, or sample distribution changes the location of the point

TP rate (TPR) =

FP rate (TPR) =

ROC (Receiver Operating Characteristic)

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

ROC (Receiver Operating Characteristic)

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

Using ROC for Model Comparison

No model consistently outperform the other

M1 is better for small FPR

M2 is better for large FPR

Area Under the ROC curve

Ideal:

Area = 1

Random guess:

Area = 0.5

How to construct an ROC curve

Instance	P(+\|A)	True Class
1	0.85	+
2	0.53	+
3	0.87	-
4	0.85	-
5	0.85	-
6	0.95	+
7	0.76	-
8	0.93	+
9	0.43	-
10	0.25	+

Use a classifier that produces a probability for each test instance P(+|A) for each test A

Sort the instances according to P(+|A) in decreasing order

Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold

TP rate (TPR) =

FP rate (TPR) =

How to construct an ROC curve

Instance	P(+\|A)	True Class
1	0.95	+
2	0.93	+
3	0.87	-
4	0.85	-
5	0.85	-
6	0.85	+
7	0.76	-
8	0.53	+
9	0.43	-
10	0.25	+

Use a classifier that produces a probability for each test instance P(+|A) for each test A

Sort the instances according to P(+|A) in decreasing order

Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold

TP rate (TPR) =

FP rate (TPR) =

How to construct an ROC curve

Use a classifier that produces a probability for each test instance P(+|A) for each test A

Sort the instances according to P(+|A) in decreasing order

Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold

TP rate (TPR) =

FP rate (TPR) =

#	Threshold >=	True Class (Human)	AI	TP	FP	TN	FN
1	0.95	+	+	1	0	5	4
2	0.93	+	-	2	0	5	3
3	0.87	-	-	2	1	4	3
4	0.85	-	-
5	0.85	-	-
6	0.85	+	-	3	3	2	2
7	0.76	-	-	3	4	1	2
8	0.53	+	-	4	4	1	1
9	0.43	-	-	4	5	0	1
10	0.25	+	-	5	5	0	0

How to construct an ROC curve

Instance	P(+\|A)	True Class	TP	FP	TN	FN	FPR	TPR
1	0.95	+	1	0	5	4	0	1/5
2	0.93	+	2	0	5	3	0	2/5
3	0.87	-	2	1	4	3	1/5	2/5
4	0.85	-
5	0.85	-
6	0.85	+	3	3	2	2	3/5	3/5
7	0.76	-	3	4	1	2	4/5	3/5
8	0.53	+	4	4	1	1	4/5	4/5
9	0.43	-	4	5	0	1	1	4/5
10	0.25	+	5	5	0	0	1	1

Breakout Session

How to construct an ROC curve

Instance	P(+\|A)	True Class	FPR	TPR
1	0.95	+
2	0.93	+
3	0.87	+
4	0.85	+
5	0.83	+
6	0.80	-
7	0.76	-
8	0.53	-
9	0.43	-
10	0.25	-

Example

How to construct an ROC curve

Instance	P(+\|A)	True Class	FPR	TPR
1	0.95	+	0	1/5
2	0.93	+	0	2/5
3	0.87	+	0	3/5
4	0.85	+	0	4/5
5	0.83	+	0	1
6	0.80	-	1/5	1
7	0.76	-	2/5	1
8	0.53	-	3/5	1
9	0.43	-	4/5	1
10	0.25	-	1	1

Example

How to construct an ROC curve

Instance	P(+\|A)	True Class	FPR	TPR
1	0.95	+	0	1/5
2	0.93	-	1/5	1/5
3	0.87	+	1/5	2/5
4	0.85	-	2/5	2/5
5	0.83	+	2/5	3/5
6	0.80	-	3/5	3/5
7	0.76	+	3/5	4/5
8	0.53	-	4/5	4/5
9	0.43	+	4/5	1
10	0.25	-	1	1

Example

ROC Interpretation

AUC (Area Under the Curve)

AUC stands for "Area under the ROC Curve."

It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1)

It provides an aggregate measure of performance across all possible classification thresholds.

AUC ranges in value from 0 to 1.

A model whose predictions are 100% wrong has an AUC of 0.0; which means it has the worst measure of separability

A model whose predictions are 100% correct has an AUC of 1.0; which means it has a good measure of separability

When AUC is 0.5, it means the model has no class separation capacity

AUC (Area Under the Curve)

https://paulvanderlaken.com/2019/08/16/roc-auc-precision-and-recall-visually-explained/