Sentiment Analysis of Movie Reviews

Project Abstract and Scope

The project aims to classify movie reviews as positive or negative us. The reviews are preprocessed and tokenized before being fed into a deep learning model for classification.

Dataset

The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. It consists of 2000 movie reviews, each of which has been tagged as positive or negative. The dataset is available at Cornell University.

Data Exploration and Pre Processing

Pre Processing

We use a standard preprocessing pipeline for text data:

Clean Text: Remove special characters, numbers, and punctuation.
Lowercase: Convert all text to lowercase.
Stemming: Reduce words to their root form.
Stopwords: Remove common words that do not contribute to the meaning of the text.
Tokenization: Split the text into words.

Dataset statistics

We calculate a few statistics to better understand the dataset:

---------------- Dataset Statistics ----------------
Feature                                  Value
-----------------------------------    ---------
Total Unique Words                     25297
Average Review Length                  2180.75
Median Review Length                   2024.5
Standard Deviation of Review Length    951.441
-----------------------------------------------------

Statistics of the dataset

Tokenization and Padding

To represent each text (= data point), there are many ways. In NLP/Deep Learning terminology, this task is called tokenization. It is common to represent text using popularity/ rank of words in text. The most common word in the text will be represented as 1, the second most common word will be represented as 2, etc.

words = []
for text in x_raw:
    words += text.split()
word_counts = Counter(words)
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
word_to_index = {word: i for i, (word, count) in enumerate(sorted_word_counts, 1)}

def text_to_sequence(text):
    words = text.split()
    sequence = [word_to_index[word] for word in words]
    return torch.tensor(sequence, dtype=torch.long)

Rank based tokenization

Thresholding

Selecting a review length threshold where 70% or 90% of the reviews have a length below it can help manage data preprocessing and analysis in NLP tasks. We choose either 70% or 90% as the threshold.

L = int(np.percentile(review_lengths, 70))

Thresholding

Interpretation

70th Percentile: This means that 70% of the reviews have a length below the value of threshold_70. It helps in focusing on the majority of the reviews without including many outliers.
90th Percentile: This means that 90% of the reviews have a length below the value of threshold_90. This is a more inclusive threshold, retaining more reviews but still excluding the longest ones.

Word Embeddings

One can use tokenized text as inputs to a deep neural network. However, a (not so) recent breakthrough in NLP suggests that more sophisticated representations of text yield better results. These sophisticated representations are called word embeddings. “Word embedding is a term used for representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.”

Most deep learning modules (including PyTorch) provide a convenient way to convert positive integer reresentations of words into a word embedding by an “Embedding layer.” The layer accepts arguments that define the mapping of words into embeddings, including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value). The layer also allows you to specify the dimension for each word vector, called the “embedding dimension.” We would like to use a word embedding layer for this project.

Assume that we are interested in the top 5,000 words. This means that in each integer sequence that represents each document, we set to zero those integers that represent words that are not among the top 5,000 words in the document

vocab_size = 5000
embedding_dim = 16
self.embedding = nn.Embedding(vocab_size, embedding_dim)

Embedding Layer in PyTorch

Network Architectures

We experiment with different network architectures to classify the movie reviews.

Note: We use early stopping in the below implementations to prevent overfitting. Early stopping is a technique used to avoid overfitting when training machine learning models. It works by monitoring the model’s performance on a validation set and stopping training when performance starts to degrade.

Simple Neural Network

We start with a simple neural network with an embedding layer, followed by a fully connected layer and a sigmoid activation function.

vocab_size = 5000
embedding_dim = 16

class NeuralNetwork(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(NeuralNetwork, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(16, 1)

    def forward(self, x):
        out = self.embedding(x)
        # global average pooling
        out = out.mean(dim=1)
        out = self.fc1(out)
        return out

LSTM

We experiment with a Long Short-Term Memory (LSTM) network to classify the movie reviews. LSTM is a type of recurrent neural network that is capable of learning long-term dependencies. It is well-suited to learn from sequences of data.

class LSTM(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super().__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()


    def forward(self, x):
        batch_size = x.size(0)
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(lstm_out)
        out = self.fc(out)
        sig_out = self.sig(out)
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]# get last batch of labels
        return sig_out.unsqueeze(1)

vocab_size = 5000+1
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
net = LSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

Results

We use scikit-learn’s classification report to evaluate the performance of the models.

Simple Neural Network

              precision    recall  f1-score   support

           0       0.87      0.88      0.87       300
           1       0.88      0.86      0.87       300

    accuracy                           0.87       600
   macro avg       0.87      0.87      0.87       600
weighted avg       0.87      0.87      0.87       600

The simple neural network performs well on the dataset. We can experiment with different hyperparameters and network architectures to improve the performance.

LSTM

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       300
           1       0.50      1.00      0.67       300

    accuracy                           0.50       600
   macro avg       0.25      0.50      0.33       600
weighted avg       0.25      0.50      0.33       600

With the current implementation, the LSTM model does not perform better than the simple neural network. We can experiment with different hyperparameters and network architectures to improve the performance.

References

https://en.wikipedia.org/wiki/Word_embedding
https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space
https://stackoverflow.com/questions/71998978/early-stopping-in-pytorch