Rapid Miner
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
1 of 8
Assignment A2: Text Mining + Neural Nets + Optimisation in RapidMiner Student Name (as per record)
AJITHA MONISHA CHANDRAN Student No 217444606
My other group members Group No 36
Student Name (as per record)
NIHAL FARHAN MOHAMMED
Student Nos 217445596
Exceptional Meets expectations Issues noted Improve Unacceptable
Exec Report
Create Models
Evaluate & Improve
Provide Solution
Research & Extend
Brief Comments
Days Late
≤ 5
Penalty
- Days Late X 5
Total
Include: Report and RMP files, with clear comments supplied to (easily) reproduce reported results.
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
2 of 8
Executive summary (one page) Expectation Business Problem Legal firm Righteous Compensation Lawyers (RCL) are confronting an issue in their management regarding the cost incurred by misclassification of each claim. Management is faced with payment of 30% more fees than the usual standard fees when non-fraud claim is misclassified as fraud and 50% more fees when a fraud claim is misclassified as non-fraud. Hence, to combat this issue, the main objective has been set to bring the best predictive model in business’s practice so that claims can be categorised faultlessly by training and testing the predictive model using previous years company data. Another requisite is to fix the payment in the course of model creation i.e. reducing 10% of the cost incurred per claim associated with text comprising 20% of cost in misclassification of non-fraud claims and 40% of cost in misclassification of fraud claims. Solution to Business Problem After exploring various models like KNN, Gradient Boosted Tree, Neural Networks, we have witnessed Neural Networks to be the best performing predictive model in our case study for categorisation of fraudulent and non- fraudulent claims by the usage of text data. It is evident from the Performance Vector figures of both text data analysis as well as the combination of text data and structured data analysis generated by Neural Network model that the accuracy percentage in classification of fraud claim using text data analysis is 98.73% and accuracy percentage in classification of fraud claims using text and structured data analysis is 98.98%. Misclassification rate of fraud claims is only 1.27% in text data analysis and 1.13% in the case of text and structured data analysis. Our model had detected 39 fraudulent claims using text data analysis and 41 fraudulent claims using text and structured data analysis. Also, value for Area Under Curve is 0.86 and 0.92 which is indicative of really great predictive model due to proximity to 1 which is considered the best. Hence, Neural Network is the best model for recognizing and classifying the claims as fraud or non-fraud.
Extension Before the deployment of the predictive model, RCL should consider training Neural Network model with similar datasets that can be received from claimants. Doing this will assure the accuracy percentage to increase till 99% and achieve 0% misclassification rate. Deep learning models could be implemented for further enhancements and high- end performance by the model for the classification and seeking insight on claims made by claimants. Additional tools such as python and R can be used to increase the analytical capabilities of the model for RCL.
When we compare
the models, text only
and mix model, we
can see the accuracy
is high for both
including Kappa
values.
Figure A: Neural Network’s Performance Vector of Text Data
Analysis
Figure B: Neural Network’s Performance Vector of Structured
and Text Data Analysis
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
3 of 8
Create a Model(s) in RapidMiner (two pages / page 1) Expectation We have constructed our predictive model for text data analysis initially. For this model creation, we used Neural Network model thrice with varying operators each time to compare their performance from performance vector and analyse the best performing model. We made use of ‘Subprocess’ operator after fetching the input data file in which role of fraud claims was set as label using ‘Set Role’ operator and assigned different roles for other attributes according to their need in analysis. Only Adjustor Notes and Fraud Claims were used as attributes as they included text which were required for performing text analysis. These text data are then converted to numeric data because Neural Networks is capable of reading only numeric data. Split Data splits the data into two categories: 70% of data for training purpose and 30% of the data for testing purpose. There are five hidden layers in Neural Networks and this multiple layer perception is trained in the operator. We set the cost of true claims of both fraud and non-fraud claims as 0.9 and false claim of fraud as 1.4 and non-fraud as 1.2. Using only text data in the prediction procedure of fraud claims, cost has been reduced by 10%.
Figure C: Neural Network Model Design for Text Data Analysis
Figure D: Subprocess of Neural Network Model Design for Structured Data Analysis
We have performed the same analysis with Gradient Boosted tree with the same operators used previously in Neural Network model.
Figure E: Gradient Boosted Tree Model Design for Text Data Analysis
Figure F: Subprocess of Gradient Boosted Tree Model Design for Structured Data Analysis
Performance vector of both the model helps us analyse the best model for our case study. Accuracy of Gradient Boosted tree is 98.1% whereas Neural Network’s accuracy is 98.7%. Even the misclassification rate of Neural Network is the least when compared to other models because it considers logistic regression primarily and K-NN considers only Euclidean distance between each point while Gradient Boosted tree considers few trees with optimum k values at 1 tree. Neural Networks trains its model continuously with every change in the dataset meant for training purpose.
Figure G: Neural Network’s Performance Vector Figure H: Gradient Boosted’s Performance Vector
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
4 of 8
Same model is applied with structured data to analyse the predictions. Because of the high imbalance in the distribution of data, we have used SMOTE operator which generates dummy values to maintain a balance in the training dataset.
Accuracy is visible from the figure as 99% but the classification error rate is 1.003% which is not a good performance indication because of the structured and imbalanced data.
Extension Here, we are going to analyse the prediction using the combination of text as well as structured data as input data file in ‘Read CSV’ operator. Functionalities of both text as well as structured data analysis were implemented in the model as in the figure for the successful running of the Neural Networks model.
Ensembles are used in the following model. MetaCost ensemble has the ability to operate 2-3 operators in it and to fix the cost for true positives as 0.9 and false negative for both fraud and non-fraud claims as 1.2. Performance of MetaCost is far good than regression, bagging and other ensembles as the Neural Network primarily considers logistic regression. A major disadvantage while using MetaCost is it takes around 7-8 hours for the calculation of
results and if the operation is disrupted in between due to any error, the process does not resume and as a result it starts from the step 1 again. But, MetaCost is suggested for better
results from the model.
Figure I: Gradient Boosted’s Design for Text Data Figure J: Subprocess of Gradient Boosted’s Design for Text Data
Figure K: Performance Vector of Neural Network’s Structured Data Analysis
Figure L:Neural Network’s Structured & Text Data Analysis
Figure M: Subprocess of Neural Network’s Structured & Text Data Analysis
Figure N: Performance Vector of Neural Network’s Structured & Text Data Analysis
Figure O & P: Ensembles model
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
5 of 8
Evaluate and Improve the Model(s) in RapidMiner (two pages / page 1) Expectation In order to reduce claims processing cost, we optimised the performance of our model with the use of ‘Cross Validation’ operator and the operator ‘Cost Performance’ to analyse the performance vector on the basis of cost. Comparison of the optimised models can be done in the figures below where the input data file consists of only text data with fold number equal to 10 in Cross Validation. Cross validation is applied for other models as well. We have used the same process of changing Subprocess to text analytics as mentioned before in the previous tasks. Same process is used for analysing structured data with the addition of ‘Nominal to Numerical’ operator in the sub process. Below are the Neural Network, Gradient Boosted tree models and their performance vectors for text data.
Figure Q & R : Model optimisation of Neural Networks (Cross Validation) Below is the Gradient Boosted model and its performance vector for structured data analysis using Cross Validation. Figure S: Performance vector of Cross validation model
Figure T: Subprocess of Cross validation The above model is used for the combination of both text and structured data further.
Figure U : Subprocess of Cross validation Figure v: Performance Vector of Cross validation. On the usage of optimising parameters like Cross Validation, Neural Network has produced the best results in comparison to other two models. Misclassification rate has reduced from 0.35% to 0.21% and accuracy has increased from 98.3% to 99.01%. Hence, we can conclude that Neural Network is the best model so far and Cross Validation has enhanced its performance even further for structured and unstructured data.
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
6 of 8
Extension Hyper parameters which acts as a loop with Neural Network are used to improve model performance. It is used for fixing the number of training cycle which we need to give to Neural Network model. This is the operator which enables the model to keep running continuously till we get our desired accuracy and lowest classification error rate. Loop function of Hyper parameter is shown in the figure.
the figure shows the continuous generation of result with each iteration in the input in Neural Networks. Optimise parameter’s grid values can be used for comparing the differences between the results (accuracy, classification error and AUC) at various cycles. For e.g.: at 700 cycles, accuracy has reached 97.8% with classification error rate of 0.022.
The model and Performance Vector of the model when MetaCost is used as ensemble is in the figure above. Accuracy of Neural Network has increased from 98% to 99.7% while the misclassification rate has also decreased.
ROC curve is also much higher for Neural Network in comparison to other two models i.e. Gradient Boosted tree and K- NN.
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
7 of 8
Provide an Integrated Solution in RapidMiner (one page) Expectation For RCL to use a predictive model for classifying the fraud and non-fraud claims, RCL can use the predictive model made out of Neural Networks as this offered us better results compared to the rest of the models. For applying this data net to the new dataset, we find need to do an additional training for the already created model. As we know neural networks get better with every training, RCL first has to train the model using the already gathered data from the previous claims and then make sure the new data set which they are applying this model upon has the same type and set of attributes as that of the training dataset.
The solution, the RCL can now use is the model as shown below: This model has the better performance compared to all other models. If RCL wants to upgrade the performance of the model, they can do so by providing more training cycles to the model. Now the results of the model are as shown. The performance of the model is around 97% accuracy with a misclassification of around 0.909%.
Extension As discussed above, for applying the model on the new dataset, we first have to train the predictive model using training data set collected from the previous claims. As shown below, the model is trained using training data and the new data is fed to the model directly.
Once the model is executed, we get the results and predictions for the fraud claims and non-fraud claims for the new dataset. The table of predictions says that the model is able to detect and classify fraud claims and non-fraud claims. The graph which is plotted from the results acquired from the model shows the attributes that are majorly involve in fraud claims. This helps RCL employees to be aware of the type of injury and nature of injury to look at with more caution when classifying the claims. However, in general the classification is done using the model but since there is a misclassification rate of 1%, it is always helpful for RCL to randomly check the classification of claims.
MIS772 Predictive Analytics Individual Task of Assignment A2-LP4
8 of 8
Further Research and Extensions in RM (one page) Expectation Once the neural network model is identified as the best model for RCL, it was enhanced using cross validations and Optimize parameter operators. Now that the enhancement is done, we can think of enhancing it in another way just by using python or any other analytical tools. Here, we are implementing looping parameter using Python’s execute python task operator which is extracted from extensions tab in RapidMiner. The model is as shown below:
Once the model is executed, the python script reads the source file and executes the file if I value is equal to 5. Here, every value is assigned with 1 and it repeats the loop until I value reach 5. This model depicts the use of execute python task. Further analysis is made in the extension part using RapidMiner. Extension I have tried to deduce words from a resume which are used frequently. To perform this, I have used ‘Process Document from Files’ to upload the text file (.txt) which contains a resume. Two operators ‘Transform Cases’ and ‘Tokenize’ were added in its Subprocess. ‘Transform cases’ operator converts the case of each letter in the resume to either lower case or upper case so that there is no confusion for the tool to differentiate between two similar words which were in different cases in the original file upload. This text is passed to ‘Tokenize’ operator which splits the text of resume into several sequences of tokens which are words here. Splitting point is set as non-letter mode by default in Parameter section. When the process is run, we get to view the word frequency arranged alphabetically in the result as shown in figure. It would have been beneficial if there was a provision to search a required keyword and compare two or more resumes i.e. text files so that the recruiter could find the required keywords used in the resume from different candidates.
Figure w: The model diagram design Figure X: Result of the model implementation