Project : Sentiment Classifier using Python

Mahima Jain
4 min readFeb 11, 2021

--

In this post I’ll share the code and implementation of Sentiment Classifier using Python. For this purpose data set used is tensorflow dataset of IMDB reviews. IMDB reviews given by viewers about various movies (tensorflow_dataset). The dataset is consists of reviews in English language.

Algorithm in short :

  1. preprocess the data
  2. convert English data to numerical representations
  3. prepare it to be fed as input for our deeplearning model with GRUs.

Importing the modules :

import numpy as np
import tensorflow as tf
from tensorflow import keras
#set tensorflow seed:
tf.random.set_seed(42)

(with a seed you make sure that you can reproduce your results when using random generators).

Loading the dataset :

Importing IMDB tensorflow dataset named ‘imdb_reviews’.

import tensorflow_datasets as tfds
datasets, info = tfds.load(“imdb_reviews”, as_supervised=True, with_info=True)
#print(datasets.keys())

Size of train and test data is 25000, 25000 respectively.

train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples
#print(train_size, test_size)

Exploring the dataset :

for X_batch, y_batch in datasets["train"].batch(2).take(2):
for review, label in zip(X_batch.numpy(), y_batch.numpy()):
print("Review : ", review.decode("utf-8")[:200], "...")
print("Label : ", label, " = Positive" if label else " = Negative")
print()
  • datasets[train] and datasets[test] contains train and test dataset respectively.
  • batch(m) batches ‘m’ data samples at a time and take(n) allows to take ‘n’ batch at a time.
  • Each batch is of type ‘Eager Tensor’. We could convert it to numpy array using X_batch.numpy().
  • After traversing through batches, review of first 200 character and label is shown.

Defining the preprocess function :

In preprocessing function :

  1. Truncate the reviews, keeping only the first 300 characters as one can generally tell if review is positive or negative from first 2–3 lines.
  2. Use regular expressions to replace <br/> tags with spaces.
  3. And finally, the preprocess function splits the reviews by the spaces. It returns a RaggedTensor, and it converts this ragged tensor to a dense tensor, padding all reviews with <pad> so they all have same length.
def preprocess(X_batch, y_batch):
X_batch = tf.strings.substr(X_batch, 0, 300)
X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z]", b" ")
X_batch = tf.strings.split(X_batch)
return X_batch.to_tensor(default_value = b"<pad>"), y_batch
preprocess(X_batch, y_batch)
  • tf.strings used for working with strings.
  • tf.strings.substr(X_batch, 0, 300) it returns substrings from Tensor of strings.
  • .regex_replace() function replaces values of X_batch from param1 to param2.
  • .split() splits the elements of input into RaggedTensor.
  • For converting RaggedTensor into a tf.Tensor, X_batch.to_tensor(default_value = b“<pad>”) is used.

Constructing the Vocabulary :

This is used to count the number of each word occurred throughout the dataset using Counter() function.

from collections import Countervocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(2).map(preprocess):
for review in X_batch:
vocabulary.update(list(review.numpy()))
vocabulary.most_common()[:5]len(vocabulary)
  • Counter().update() counts and then updates the values of counter.
  • map() function maps the function across all the samples of given IMDB dataset.

Truncating the Vocabulary :

There are more than 50,000 words in the vocabulary. So lets truncate it to only 10,000 most common words.

vocab_size = 10000
truncated_vocabulary = [words for words, count in vocabulary.most_common()[:vocab_size]]

Creating a lookup table :

In this step we will convert words into numbers as computer can only process numbers. We shall create a lookup table (using 1,000 out-of-vocabulary buckets oov) such that the most frequent occurred words have lower indices than less frequent one.

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype = tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
#print(vocab_init) #table initializer given keys and values tensors.
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))
  • .StaticVocabularyTable() assigns oov keys to buckets.
  • If <other term> -> bucket_id, where bucket_id will be between 3 and 3+num_oov_buckets-1, calculated by : hash(<term>)%num_oov_buckets + vocab_size.
  • table.lookup looks up keys in the table, outputs the corresponding values.

Creating the Final Train and Test sets :

For creating the final training set train_set,

  1. batch the reviews
  2. then convert them to short sequences of words using the preprocess() function
  3. then encode these words using a simple encode_words() function that uses the table we just built and finally prefetch the next batch

For creating the final test set, test_set,

  1. create a batch of 1000 test samples
  2. then apply 2nd and 3rd steps of train_set
def encode_words(X_batch, y_batch):
return table.lookup(X_batch), y_batch
train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
test_set = datasets["test"].batch(1000).map(preprocess)
test_set = test_set.map(encode_words)
for X_batch, y_batch in train_set.take(1):
print(X_batch)
print(y_batch)
  • .repeat().batch(32).map(preprocess) repeatedly creates a batch of 32 samples and applies the function preprocess() on every batch.
  • .map(encode_words).prefetch(1) applies encode_words function to the data samples and parallely fetches the next batch.

Building the Model :

First layer is an Embedding Layer which will convert word IDs into Embeddings. The embedding matrix needs to have one row per word ID and one column per embedding dimension.

embed_size = 128
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
mask_zero = True,
input_shape = [None]),
keras.layers.GRU(4, return_sequences = True),
keras.layers.GRU(2),
keras.layers.Dense(1, activation = 'sigmoid')
])
model.compile(loss = 'binary_crossentropy', optimizer="adam", metrics = ["accuracy"])
  1. keras.layers.Embedding() turns positive integers into dense vectors of fixed size.
  2. keras.layers.GRU() , the Gated Recurrent Unit layer.

Training and Testing the Model :

import time
start = time.time()
model.fit(train_set, steps_per_epoch=train_size//32, epochs=10)
end=time.time()
print("Time of execution : ", end-start)
model.evaluate(test_set)
#output : [1.167264699935913, 0.7222800254821777]

The accuracy of test set in this case is 72.22%. You can improve the accuracy by changing the values like epoch, batch or parameters of layers etc.

You can find the whole code on my GitHub repo. “Sentiment Classifier

Sign up to discover human stories that deepen your understanding of the world.

--

--

Mahima Jain
Mahima Jain

Written by Mahima Jain

Just a geek who enjoys learning new technologies. Please feel free to correct me if there is anything wrong in my blogs.

No responses yet

Write a response