Research paper writing help
Abstract:
Sentiment analysis, a vital aspect of natural language processing, involves the application of machine learning models to discern the emotional tone conveyed in textual data. The use case for this type of problem is where businesses can make informed decisions based on customer feedback, identify the sentiments of their own employees and make decisions on hiring or retention, or for that matter classify a text based on its topic like whether it is pertaining to a particular subject like physics or chemistry as is useful in search engines. study is based on a model leveraging a sequential architecture, featuring an Embedding layer to transform words into dense vectors and using two Long Short-Term Memory (LSTM) layers to capture intricate sequential patterns. Using a 50-dimensional embedding dimension and 20 % dropout layers, this model aims to effectively classify sentiments in text data. The use of rectified linear unit (ReLU) activations enhances non-linearity, while the Softmax activation in the output layer aligns with the multi-class nature of sentiment analysis. Both training and test accuracy were well over 80 %.
Problem Statement:
To develop an effective sentiment analysis model for IMDb movie reviews, leveraging a model that can discern nuanced sentiments, address language intricacies, and provide a high degree of accuracy in the classification of sentiments. A well-trained model like this can be used to classify user feedback on a product, experience, or even a software application in an agile environment where user feedback is used in developing the application. Traditional non-neural network classifiers often process data independently without considering the sequential nature of language. LSTMs, designed for sequential data, can capture dependencies and relationships between words in a sequence, which is crucial for understanding nuances. A deep learning model with LSTM can address capture relationships between the words in a context due to their memory cell retaining information on the context over time.
Some guiding research questions are: How might a model like this be used to enhance the search results of movies and help recommend other movies to the viewers?
Once trained on the IMDB specific data set can this model be adopted in other areas as well where similar sentiments are to be classified ? for .e.g, product reviews on Amazon or restaurant food reviews ?
Description of the data set and problem setup:
The IMDb sentiment dataset is a widely used benchmark dataset in the field of natural language processing and sentiment analysis. This dataset consists of movie reviews collected from the IMDb website, a popular online movie database. The reviews are labeled with sentiment labels, indicating whether the sentiment expressed in each review is positive or negative.
It has 50000 records of sentiments and it’s a balanced data set, i.e., the number of records for positive and negative sentiments are equal as depicted in Figure 1 below.
Figure 1
The sentiments length are of varying lengths ranging from less than 100 words to some well over 2000 words. This can be discerned from the boxplot in Fig. 2.
Figure 2
|
Variable Type |
Range |
Encoding |
Example |
|
Input Variable (Text) |
|
|
|
|
Text (input sentence) |
Variable length |
Tokenization and Padding |
"The movie was fantastic!" |
|
Output Variable (Sentiment) |
|
|
|
|
Sentiment Label |
negative or positive |
One-Hot Encoding or Integer Encoding |
positive: 1 negative: 0 |
Table 1: Describes the input and output characteristics of the classification models used in this study.
Data cleaning:
After dropping any nulls which fortunately there weren’t any, 418 duplicate records were dropped from the dataset as well. The typical characteristics of typical sentiment text are that they contain punctuations, hyperlinks, line break tags probably as these were written on a web portal, stop words, proper nouns, numbers, and mixed letter casing (upper and lower case).
As part of the data-cleaning process in sentiment analysis, it is prudent to remove punctuations, hyperlinks, and numbers, normalize the case to either lower or upper, remove stop words, and lemmatize words to their root words. These are done to enhance the information conveyed and normalize the sentiment. For e.g. words like “the” or “is” don’t add to the sentiment and words like “run” or “running” convey the same meaning in the context. This helps to reduce the feature space and improve the generalization of the model.
Figures 3A and 3B show the distribution of sentiment length before and after the removal of stop words, respectively. Naturally, we can discern a reduction in average text length after the removal of the stop words.
Figure 3A Figure 3B
Figure 4
Figure 4 above shows a word cloud of the positive sentiment text after the removal of stop words. A similar word cloud for negative sentiment is shown in Figure 5.
Figure 5
One thing to note in the word clouds is that there are many words common to both sentiments which naturally pose a challenge in generalizing any model for classification. There were also line break tags “<br / ><br />” embedded within the sentiment text.
So as part of the data cleaning, punctuation and“<br / ><br />” tags were removed, in addition to any hyperlinks and numbers.
Baseline models
After data cleaning, the sentiment texts were passed through a tokenizer which enumerates each word of the text as an integer value based on its frequency of occurrence in the corpus (collection of sentiments). Words that occur more frequently are assigned a low number and those that are less frequent have a higher number. A 10000-word tokenizer was used which enumerates the top 10000 most common words in the corpus. Any other word not in this list was assigned a unique value of 1 which represents the out-of-vocabulary words. So once all the documents (sentiment texts) have been processed by the tokenizer, each sentiment text is essentially represented as a 10000 long vector.
After vectorizing all the input text to a 10000-long feature space, a Gaussian Naïve Bayes, Multinomial Naïve Bayes, and a Decision tree classifier were fit on an 80-20% split of train-test data. The target output was the sentiment prediction. All 3 classifiers performed very poorly and had an accuracy of around 50 percent with this tokenizer scheme. However, when these same models were fitted using a term-frequency inverse-document-frequency (TF-IDF) vectorizer the performance of all 3 models dramatically improved. The TF-IDF enumerates each word as an integer based on its relative importance to the document. For e.g., a word that is only present in a few documents would have a higher value than a word that is common to almost all documents. For brevity, only the results for the TF-IDF vectorizer method are summarized below as classification reports on test data.
Gaussian NB
Accuracy: 0.78
Classification Report:
precision recall f1-score support
negative 0.76 0.81 0.79 4939
positive 0.80 0.75 0.77 4978
accuracy 0.78 9917
macro avg 0.78 0.78 0.78 9917
weighted avg 0.78 0.78 0.78 9917
Multinomial NB
Accuracy: 0.86
Classification Report:precision recall f1-score supportnegative 0.86 0.85 0.86 4939positive 0.86 0.86 0.86 4978accuracy 0.86 9917macro avg 0.86 0.86 0.86 9917weighted avg 0.86 0.86 0.86 9917Decision Tree
Accuracy: 0.73Classification Report:precision recall f1-score supportnegative 0.72 0.73 0.73 5000positive 0.73 0.72 0.72 5000accuracy 0.73 10000macro avg 0.73 0.73 0.73 10000weighted avg 0.73 0.73 0.73 10000Multinomial NB had the best score by far compared to the other 2 baseline models.
However, as mentioned earlier these models don’t capture any context or relationship between the words, they treat each enumerated word as an independent feature.
LSTM based sequential recurrent neural network model
For the initial setup, a single-layer LSTM with 64 neurons, and embedding layer of 50 units for word encoding, and a dropout of 0.2 was used in the model. A sequence length of 80 was used which represents the sentence length used in the model, longer sentences were truncated, and shorter ones were 0 padded.
After 5 epochs we can see that the training accuracy was around 69% and validation accuracy was about 65%.
Epoch 5/5496/496 [==============================] - 40s 81ms/step - loss: 0.6294 - accuracy: 0.6994 - val_loss: 0.6504 - val_accuracy: 0.6508310/310 [==============================] - 3s 11ms/step - loss: 0.6471 - accuracy: 0.6530Test Loss: 0.6471, Test Accuracy: 0.6530
In order to try and improve on the model, another stackable LSTM layer of 32 units size was added with a dropout of 0.2 as well. The model performance dramatically improved almost comparable to the Multinomial NB.
Epoch 5/5496/496 [==============================] - 75s 151ms/step - loss: 0.3565 - accuracy: 0.8725 - val_loss: 0.3853 - val_accuracy: 0.8374310/310 [==============================] - 6s 19ms/step - loss: 0.3918 - accuracy: 0.8377Test Loss: 0.3918, Test Accuracy: 0.8377
Also tried increasing the sentence length to 150 and doubled the number of units in each layer to increase the memory size, but the model was excruciatingly slow. After manually experimenting with various sentence lengths from 50-100, a sentence length of 80 seemed to be optimal.
Limitations of the model:
The choice of parameters in LSTM models, such as the number of layers, hidden units, learning rate, etc., can significantly impact performance. Grid search or randomized search over hyperparameters is often needed to find an optimal set for a given task and dataset, but this is at the cost of computational time. Just like any other non-neural model, models might need continuous adaptation, especially in dynamic environments where the characteristics of the data change over time. This is often referred to as online or incremental learning. Each dataset is unique, and what works well for one might not work as effectively for another. This emphasizes the importance of experimenting and tuning hyperparameters based on the specific characteristics of the data. Training deep learning models, including LSTM networks, can indeed be time-consuming. This is influenced by factors like the complexity of the model architecture, the size of the dataset, and the chosen hyperparameters.
Sometimes, for real-time applications, there might be a trade-off between using a sophisticated LSTM model and a simpler, traditional model. While the neural network might offer better performance, traditional models like logistic regression or support vector machines can be faster to train and evaluate.
image6.png
image1.png
image2.png
image3.png
image4.png
image5.png