excel question
GEN101 Introductory Artificial Intelligence
Artificial Intelligence Performance Evaluation
College of Engineering
1
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
2
We have already seen: the Confusion Matrix
Most widely-used metric:
Metrics for Performance Evaluation
We will focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
| PREDICTED CLASS (AI) | |||
| ACTUAL CLASS (Human) | Class=Yes | Class=No | |
| Class=Yes | a (TP) | b (FN) | |
| Class=No | c (FP) | d (TN) |
TP: True Positive
FP: False Positive
TN: True Negative
FN: False Negative
Limitation of Accuracy
Consider a 2-class problem
Number of Class 0 (e.g., Benign) examples = 9990
Number of Class 1 (e.g., Malignant) examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any class 1 example
4
Other Measures
| PREDICTED CLASS | |||
| ACTUAL CLASS | Class=Yes | Class=No | |
| Class=Yes | a (TP) | b (FN) | |
| Class=No | c (FP) | d (TN) |
Precision-Recall
Two other metrics that are often used to quantify model performance are precision and recall.
Precision is defined as the number of true positives divided by the total number of positive predictions.
It quantifies what percentage of the positive predictions were correct
How correct your model’s positive predictions were.
Recall/Sensitivity is defined as the number of true positives divided by the total number of true positives and false negatives (all actual positives)
It quantifies what percentage of the actual positives you were able to identify
How sensitive your model was in identifying positives.
https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248
6
Precision-Recall
Precision =
Recall =
Sensitivity =
Specificity =
Out of all the examples that were predicted as positive, how many are correct? How precise was the AI?
Out of all the positive examples, how many were caught (recalled)? Did the AI miss?
Out of all the people that do have the disease, how many got positive test results? Did the AI miss anyone?
Out of all the people that do NOT have the disease, how many got negative results?
https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1
7
Precision-Recall
Example
In this example:
If the classifier predicts negative, you can trust it, the example is negative. Our AI does not do a mistake in negative. It is sensitive. However, pay attention, if the example is negative, you can’t be sure it will predict it as negative (specificity=78%).
If the classifier predicts positive, you can’t trust it (precision=33%)
However, if the example is positive, you can trust the classifier will find it (will not miss it) (recall=100%).
Precision: Is the green box pure?
Recall/Sensitivity: Did I lose any of the green balls to the red box? Did I miss any greens? Am I so sensitive to greens that I do not miss them?
Specificity: Are all the red balls in the red box?
8
Precision-Recall
Example
In this example:
Since the population is imbalanced:
The precision is relatively high
The recall is 100% because all the positive examples are predicted as positive.
The specificity is 0% because no negative example is predicted as negative.
9
Precision-Recall
Example
In this example:
If it predicts that an example is positive, you can trust it — it is positive.
However, if it predicts it is negative, you can’t trust it, the chances are that it is still positive.
Can be a useful classifier
10
Precision-Recall
Example
In this example:
The classifier detects all the positive examples as positive
It also detects all negative examples as negative.
All the measures are at 100%.
11
Why so many measures? Which is more important?
Because it depends
False positive is expensive. False alarm is a nightmare!
Precision is more important than recall
False negative is expensive. Missing one is a nightmare!
Recall is more important than precision
12
Precision-Recall Curves
The precision-recall curve is used for evaluating the performance of binary classification algorithms.
They provide a graphical representation of a classifier’s performance across many thresholds, rather than a single value
It is constructed by calculating and plotting the precision against the recall for a single classifier at a variety of thresholds.
It helps to visualize how the choice of threshold affects classifier performance and can even help us select the best threshold for a specific problem.
13
Precision-Recall Curves Interpretation
https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248
14
Precision-Recall Curves
A model that produces a precision-recall curve that is closer to the top-right corner is better than a model that produces a precision-recall curve that is skewed towards the bottom of the plot.
https://paulvanderlaken.com/2019/08/16/roc-auc-precision-and-recall-visually-explained/
15
Precision-Recall Curves
Example
Which of the following P-R curves produce represent a perfect classifier?
(a)
(b)
(c)
https://towardsdatascience.com/gaining-an-intuitive-understanding-of-precision-and-recall-3b9df37804a7
(b) -> perfect classifier
(a) -> very bad classifier (random classifier)
16
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
17
Methods for Performance Evaluation
Performance of a model may depend on other factors besides the learning algorithm:
How to obtain a reliable estimate of performance?
Methods of Estimation:
18
Class distribution
Size of training and test sets
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
19
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals
Characterize the trade-off between positive hits and false alarms
ROC curve plots TP rate (y-axis) against FP rate (x-axis)
Performance of each classifier represented as a point on ROC curve
changing the threshold of the algorithm, or sample distribution changes the location of the point
TP rate (TPR) =
FP rate (TPR) =
20
ROC (Receiver Operating Characteristic)
1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
21
At threshold t:
TPR=0.5, FPR=0.12
ROC (Receiver Operating Characteristic)
(TPR,FPR):
(0,0): declare everything to be negative class
(1,1): declare everything to be positive class
(0,1): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of the true class
Using ROC for Model Comparison
No model consistently outperform the other
M1 is better for small FPR
M2 is better for large FPR
Area Under the ROC curve
Ideal:
Area = 1
Random guess:
Area = 0.5
How to construct an ROC curve
| Instance | P(+|A) | True Class |
| 1 | 0.85 | + |
| 2 | 0.53 | + |
| 3 | 0.87 | - |
| 4 | 0.85 | - |
| 5 | 0.85 | - |
| 6 | 0.95 | + |
| 7 | 0.76 | - |
| 8 | 0.93 | + |
| 9 | 0.43 | - |
| 10 | 0.25 | + |
Use a classifier that produces a probability for each test instance P(+|A) for each test A
Sort the instances according to P(+|A) in decreasing order
Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold
TP rate (TPR) =
FP rate (TPR) =
How to construct an ROC curve
| Instance | P(+|A) | True Class |
| 1 | 0.95 | + |
| 2 | 0.93 | + |
| 3 | 0.87 | - |
| 4 | 0.85 | - |
| 5 | 0.85 | - |
| 6 | 0.85 | + |
| 7 | 0.76 | - |
| 8 | 0.53 | + |
| 9 | 0.43 | - |
| 10 | 0.25 | + |
Use a classifier that produces a probability for each test instance P(+|A) for each test A
Sort the instances according to P(+|A) in decreasing order
Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold
TP rate (TPR) =
FP rate (TPR) =
How to construct an ROC curve
Use a classifier that produces a probability for each test instance P(+|A) for each test A
Sort the instances according to P(+|A) in decreasing order
Apply threshold at each unique value of P(+|A) and count the number of TP, FP, TN, FN at each threshold
TP rate (TPR) =
FP rate (TPR) =
| # | Threshold >= | True Class (Human) | AI | TP | FP | TN | FN |
| 1 | 0.95 | + | + | 1 | 0 | 5 | 4 |
| 2 | 0.93 | + | - | 2 | 0 | 5 | 3 |
| 3 | 0.87 | - | - | 2 | 1 | 4 | 3 |
| 4 | 0.85 | - | - | ||||
| 5 | 0.85 | - | - | ||||
| 6 | 0.85 | + | - | 3 | 3 | 2 | 2 |
| 7 | 0.76 | - | - | 3 | 4 | 1 | 2 |
| 8 | 0.53 | + | - | 4 | 4 | 1 | 1 |
| 9 | 0.43 | - | - | 4 | 5 | 0 | 1 |
| 10 | 0.25 | + | - | 5 | 5 | 0 | 0 |
26
How to construct an ROC curve
| Instance | P(+|A) | True Class | TP | FP | TN | FN | FPR | TPR |
| 1 | 0.95 | + | 1 | 0 | 5 | 4 | 0 | 1/5 |
| 2 | 0.93 | + | 2 | 0 | 5 | 3 | 0 | 2/5 |
| 3 | 0.87 | - | 2 | 1 | 4 | 3 | 1/5 | 2/5 |
| 4 | 0.85 | - | ||||||
| 5 | 0.85 | - | ||||||
| 6 | 0.85 | + | 3 | 3 | 2 | 2 | 3/5 | 3/5 |
| 7 | 0.76 | - | 3 | 4 | 1 | 2 | 4/5 | 3/5 |
| 8 | 0.53 | + | 4 | 4 | 1 | 1 | 4/5 | 4/5 |
| 9 | 0.43 | - | 4 | 5 | 0 | 1 | 1 | 4/5 |
| 10 | 0.25 | + | 5 | 5 | 0 | 0 | 1 | 1 |
Breakout Session
How to construct an ROC curve
| Instance | P(+|A) | True Class | FPR | TPR |
| 1 | 0.95 | + | ||
| 2 | 0.93 | + | ||
| 3 | 0.87 | + | ||
| 4 | 0.85 | + | ||
| 5 | 0.83 | + | ||
| 6 | 0.80 | - | ||
| 7 | 0.76 | - | ||
| 8 | 0.53 | - | ||
| 9 | 0.43 | - | ||
| 10 | 0.25 | - |
Example
How to construct an ROC curve
| Instance | P(+|A) | True Class | FPR | TPR |
| 1 | 0.95 | + | 0 | 1/5 |
| 2 | 0.93 | + | 0 | 2/5 |
| 3 | 0.87 | + | 0 | 3/5 |
| 4 | 0.85 | + | 0 | 4/5 |
| 5 | 0.83 | + | 0 | 1 |
| 6 | 0.80 | - | 1/5 | 1 |
| 7 | 0.76 | - | 2/5 | 1 |
| 8 | 0.53 | - | 3/5 | 1 |
| 9 | 0.43 | - | 4/5 | 1 |
| 10 | 0.25 | - | 1 | 1 |
Example
How to construct an ROC curve
| Instance | P(+|A) | True Class | FPR | TPR |
| 1 | 0.95 | + | 0 | 1/5 |
| 2 | 0.93 | - | 1/5 | 1/5 |
| 3 | 0.87 | + | 1/5 | 2/5 |
| 4 | 0.85 | - | 2/5 | 2/5 |
| 5 | 0.83 | + | 2/5 | 3/5 |
| 6 | 0.80 | - | 3/5 | 3/5 |
| 7 | 0.76 | + | 3/5 | 4/5 |
| 8 | 0.53 | - | 4/5 | 4/5 |
| 9 | 0.43 | + | 4/5 | 1 |
| 10 | 0.25 | - | 1 | 1 |
Example
ROC Interpretation
AUC (Area Under the Curve)
AUC stands for "Area under the ROC Curve."
It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1)
It provides an aggregate measure of performance across all possible classification thresholds.
AUC ranges in value from 0 to 1.
A model whose predictions are 100% wrong has an AUC of 0.0; which means it has the worst measure of separability
A model whose predictions are 100% correct has an AUC of 1.0; which means it has a good measure of separability
When AUC is 0.5, it means the model has no class separation capacity
33
AUC (Area Under the Curve)
https://paulvanderlaken.com/2019/08/16/roc-auc-precision-and-recall-visually-explained/
34
AUC (Area Under the Curve) Interpretation
Example
Red distribution curve is of the positive class (patients with disease) and the green distribution curve is of the negative class(patients with no disease)
This is an ideal situation. When two curves don’t overlap at all means model has an ideal measure of separability.
It is perfectly able to distinguish between positive class and negative class.
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
35
AUC (Area Under the Curve) Interpretation
Example
When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class.
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
36
AUC (Area Under the Curve) Interpretation
Example
This is the worst situation. When AUC is approximately 0.5, the model has no discrimination capacity to distinguish between positive class and negative class.
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
37
AUC (Area Under the Curve) Interpretation
Example
When AUC is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and vice versa.
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
38
AUC (Area Under the Curve) Interpretation
Example
Which of the following ROC curves produce AUC values greater than 0.5?
a
b
c
d
e
https://developers.google.com/machine-learning/crash-course/classification/check-your-understanding-roc-and-auc
A -> This ROC curve has an AUC between 0.5 and 1.0, meaning it ranks a random positive example higher than a random negative example more than 50% of the time. Real-world binary classification AUC values generally fall into this range.
E->Best possible ROC curve, as it ranks all positives above all negatives. It has an AUC of 1.0.
In practice, if you have a "perfect" classifier with an AUC of 1.0, you should be suspicious, as it likely indicates a bug in your model. For example, you may have overfit to your training data, or the label data may be replicated in one of your features.
39
ROC vs. PRC
The main difference between ROC curves and precision-recall curves is that the number of true-negative results is not used for making a PRC
| Curve | x-axis | y-axis | ||
| Concept | Calculation | Concept | Calculation | |
| Precision-recall (PRC) | Recall | TP / (TP + FN) | Precision | TP / (TP + FP) |
| Receiver Operating Characteristics (ROC) | False Positive Rate | FP / (FP + TN) | Recall Sensitivity True Positive Rate | TP / (TP + FN) |
40
Intersection over Union (IoU) for object detection
http://arogozhnikov.github.io/2015/10/05/roc-curve.html
41
Is object detection classification or regression?
http://arogozhnikov.github.io/2015/10/05/roc-curve.html
42
Intersection over union (IoU) is an evaluation metric used to measure the accuracy of an object detector on a particular dataset.
It is a number from 0 to 1 that specifies the amount of overlap between the predicted and ground truth bounding box.
It is known to be the most popular evaluation metric for tasks such as segmentation, object detection and tracking.
What is Intersection over Union?
Definition
https://towardsdatascience.com/intersection-over-union-iou-calculation-for-evaluating-an-image-segmentation-model-8b22e2e84686
43
What is Intersection over Union?
Which one is best?
https://towardsdatascience.com/intersection-over-union-iou-calculation-for-evaluating-an-image-segmentation-model-8b22e2e84686
44
An IoU of 0 means that there is no overlap between the boxes
An IoU of 1 means that the union of the boxes is the same as their overlap indicating that they are completely overlapping
The lower the IoU
The worse the prediction result
Intersection over Union
https://towardsdatascience.com/iou-a-better-detection-evaluation-metric-45a511185be1
45
In order to apply Intersection over Union to evaluate an (arbitrary) object detector we need:
The ground-truth bounding boxes (i.e., the hand labeled bounding boxes that specify where in the image our object is).
The predicted bounding boxes from our model.
Intersection over Union
Your goal is to take the:
Training images
Bounding boxes
Then evaluate its performance on the testing set.
An Intersection over Union score > 0.5 is normally considered a “good” prediction.
Intersection over Union
construct an object detector
Intersection over Union
48
Confidence Intervals
http://arogozhnikov.github.io/2015/10/05/roc-curve.html
49
Confidence, in statistics, is a way to describe probability.
Confidence interval is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.
For example: if you construct a confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval.
Confidence Interval for Accuracy
Definition
Confidence level = 1 −α
https://www.scribbr.com/statistics/confidence-interval/
50
Prediction can be regarded as a Bernoulli trial
A Bernoulli trial has 2 possible outcomes
Possible outcomes for prediction: correct or wrong
Collection of Bernoulli trials has a Binomial distribution:
x ~ Bin(N, p) x: number of correct predictions
Example: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Nxp = 50 x 0.5 = 25
Classification Accuracy =
Given x (# of correct predictions) or equivalently,
acc = x / N, and N (# of test instances), Can we predict p (true accuracy of model)?
Confidence Interval for Accuracy
For large test sets (N > 30), acc has a normal distribution
with mean p and variance p(1-p) / N
Confidence Interval for p:
Confidence Interval for Accuracy
Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:
N=100, acc = 0.8
Let 1-α = 0.95 (95% confidence)
From probability table, Zα /2=1.96
Confidence Interval for Accuracy
Example
| N | 50 | 100 | 500 | 1000 | 5000 |
| p(lower) | 0.670 | 0.711 | 0.763 | 0.774 | 0.789 |
| p(upper) | 0.888 | 0.866 | 0.833 | 0.824 | 0.811 |
| 1-α | Z |
| 0.99 | 2.58 |
| 0.98 | 2.33 |
| 0.95 | 1.96 |
| 0.90 | 1.65 |
Standard Normal distribution
Performance Metrics for Multicalss AI
Turn it into Binary Classification
Binary Classification Problem 1: red vs. blue
Binary Classification Problem 2: red vs. green
Binary Classification Problem 3: red vs. yellow
Binary Classification Problem 4: blue vs. green
Binary Classification Problem 5: blue vs. yellow
Binary Classification Problem 6: green vs. yellow
Binary Classification Problem 1: red vs [blue, green, yellow]
Binary Classification Problem 2: blue vs [red, green, yellow]
Binary Classification Problem 3: green vs [red, blue, yellow]
Binary Classification Problem 4: yellow vs [red, blue, green]
One vs. All (Rest)
One vs. One
{red, blue, green, yellow} n = 4
n classes n binary classifiers
n classes n(n-1)/2 binary classifiers
Performance Metrics for Multicalss AI
Turn it into Binary Classification
One vs. All (Rest)
One vs. One
55
Breakout Session
One vs. All (Rest)
57
One vs. One
58
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+
=
+
+
+
+
=
Accuracy
c
b
a
a
p
r
rp
b
a
a
c
a
a
+
+
=
+
=
+
=
+
=
2
2
2
(F)
measure
-
F
(r)
Recall
(p)
Precision
d
w
c
w
b
w
a
w
d
w
a
w
4
3
2
1
4
1
Accuracy
Weighted
+
+
+
+
=
b
a
a
d
c
c
+
=
+
=
rate
positive
True
rate
positive
False