Learning Outcome PPT for the Course Assignment

profilevoyage
PA1.docx

Introduction

In this paper, I will use the cardUpgrade dataset to run the K nearest neighbor algorithm in JMP and predict the likelihood of a customer upgrading to platinum status or not. This dataset has three attributes:

a) UpGrade – this column contains categorical data represented in nominal form.

b) Purchases – is a numeric column that describes the monetary value of purchases made by each customer.

c) PlatProfile – this is also a categorical data that was used at the time the customers signed up to evaluate whether they fit the profile for a platinum member or not.

K Nearest Neighbors (KNN)

The K Nearest Neighbors algorithm is a supervised non-parametric lazy learning algorithm applicable to either classification or regression tasks (Okfalisa et al., 2017). First, a supervised machine learning algorithm relies on labeled training input to produce a desired output given non-labelled data. Secondly, non-parametric means that the algorithm makes no assumptions and any model built on it relies entirely on the training data it is fed without further assumptions on the structure of the data. Thirdly, the algorithm is lazy learning meaning that it does not make any generalization since the training is minimal. In essence, as opposed to most machine learning algorithms, the training data used in a KNN model is also used to test the model. KNN classifies a single data point by comparing it to the points it is closest and most similar to. It assumes that similar items exist in close proximity

KNN is apt for different ML tasks including decision-making (as in this task), recommender systems, and image recognition. For this particular task, I will use JMP to apply KNN on a customer’s dataset. The model will compare the initial customer profile attribute and purchase history to their current upgrade status to be able to predict whether other customers would be willing to upgrade or not. Because this is a classification task, the output should be a discrete value that shows whether a customer will upgrade or not. There is no middle ground which is why the values are binary, that is, 0 or 1. The model will have two predictors and a label. The output from this model will also be nominal which means it represents the upgrade status of an individual. While the output will be in the form of numerical values (0 or 1), these numbers are only representational and have no mathematical meaning (Ghattas et al., 2017).

KNN Scheme on JMP

First I imported the excel data set into JMP. The application treats the two categorical columns as numeric. This must be specified back to nominal in JMP. While this process is rather limited, it is a form of data cleaning when creating a machine learning model. Other data cleaning tasks typically include identification and removal of duplicate records/observations and finding ways to handle missing values. The second step is to conduct an exploratory data analysis (EDA). According to Jebb et al., this stage helps to decide the algorithm and variables that would be used (2017). EDA also involves data visualization which gives a cursory insight about the observations. For instance, below is a bubble plot of UpGrade status against purchases. It shows that those who did not upgrade are more concentrated on the lower end of purchases while those who did upgrade made higher purchase volumes. It also shows some outliers which should be handled. I have chosen to ignore these outliers because there are only one for each class meaning that they would have minimal impact on the model. Additionally, the dataset contains only forty observations, thus, there is very little room for cropping out other observations.

I also visualized the distribution of the purchase column to determine the percentage. The graph shows that the distribution is almost split evenly (55% for non-upgrades and 45% for upgrades).

Splitting the Data

For most machine learning modeling tasks, the data should only be split into training and testing sets. The train-test approach is used to avoid evaluating the model based on training dataset as this would result in a biased score. Kuhn & Johnson state that it is pragmatic for the model to be evaluated on data that was not used to either build or finetune it (2013, p. 67). Some algorithms including KNN, on the other hand, produce the best results when the data is split into three including a validation set. This third set is often seen as another set of test data, but it is used to tune the hyperparameters of the model. However, Touvron et al., elaborate that this splitting of the data into two or three sets only produces optimal results when the dataset has a large amount of data such that each class that could potentially be observed is included in each set (2020).

Since this dataset only has forty observations, it cannot be split thrice without compromising on model performance. It is, instead, split into training and test sets at the ratio of 4:1. This gives 32 observations for training and 8 observations for testing. The KNN algorithm also expects a k value which is typically selected as the square root of the total number of observations. When there are only two classes to be predicted, the standard practice is to pick an odd number to avoid a tie during majority voting. For this reason, I have picked a 7 as the value of k.

Confusion Matrix Interpretation

The confusion or error matrix is used to enhance the understanding of a classification task. Simply reporting on the accuracy of the model does not fully capture the performance of the algorithm (Okfalisa et al., 2017). The precision and recall scores are calculated from the output of the confusion matrix. In the matrix, the predicted values are described as either positive or negative whereas the actual values are resented as true or false. In the end, there are four possibilities:

a) True Positive – these are the values that have been correctly predicted by the model as positive and they are true.

b) True Negative – these values are actually true but the model predicts them as negative.

c) False Positive – values predicted as positive yet they are actually false

d) False Negative – these values have been predicted as false and they are also negative.

In this case these values would be interpreted as follows:

a) True Positive (TP) – customers who did upgrade to platinum and who the model has correctly predicted to have upgraded.

b) True Negative (TN) – customers predicted to upgrade yet did not upgrade.

c) False Positive (FP) – customers predicted to not upgrade yet they did upgrade.

d) False Negative (FN) – customers correctly predicted to not have upgraded their cards to platinum.

The matrix for this model is as in the image below:

This means that the model produces 2 TPs, 0 TN, 1 FP, and 5 FN. Thus, only one out of the eight possible predictions is incorrect. This model, therefore, has an accuracy score of 87.5%

PART 2

Using the Bayes theorem formula as indicated below, we can calculate the probability that a customer who makes purchases above 32.450 will also be likely to upgrade to platinum (Rouder & Morey, 2018).

Below are the notations for the probabilities:

P(A) – the probability that a customer upgrades to platinum (45%)

P(B) – the probability that a customer makes purchases equal to or more than 32.450 (52.5%)

P(A|B) – the probability that a customer upgrades given that he makes purchases equal to or above 32.450 (79.32%)

P(B|A) – the probability that a customer makes purchases equal to or above 32.450 given that he has upgraded.

Thus P(B|A) = .9254, that is, 92.54% which is a significant improvement from 87.5%. Therefore, using the Bayes theorem formula, the model’s accuracy has been improved.

References

Bowerman, B., Drougas, A. M., Duckworth, A. G., Hummel, R. M., Moniger, K. B., & Schur, P. J. (2019). Business statistics and analytics in practice (pp. 186–189). Mcgraw-Hill Education.

Ghattas, B., Michel, P., & Boyer, L. (2017). Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods. Pattern Recognition, 67, 177–185. https://doi.org/10.1016/j.patcog.2017.01.031

Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data analysis as a foundation of inductive research. Human Resource Management Review, 27(2), 265–276. https://doi.org/10.1016/j.hrmr.2016.08.003

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. In Google Books (p. 67). Springer Science & Business Media. https://books.google.com/books/about/Applied_Predictive_Modeling.html?id=xYRDAAAAQBAJ&source=kp_book_description

Okfalisa, Gazalba, I., Mustakim, & Reza, N. G. I. (2017). Comparative analysis of k-nearest neighbor and modified k-nearest neighbor algorithm for data classification. 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE). https://doi.org/10.1109/icitisee.2017.8285514

Rouder, J. N., & Morey, R. D. (2018). Teaching Bayes’ Theorem: Strength of Evidence as Predictive Accuracy. The American Statistician, 73(2), 186–190. https://doi.org/10.1080/00031305.2017.1341334

Touvron, H., Vedaldi, A., Douze, M., & Jégou, H. (2020). Fixing the train-test resolution discrepancy. ArXiv:1906.06423 [Cs]. https://arxiv.org/abs/1906.06423