paint-brush
How Machines Learn Emotions: Sentiment Analysis of Amazon Product Reviewsby@nobody1234
308 reads
308 reads

How Machines Learn Emotions: Sentiment Analysis of Amazon Product Reviews

by nobodyJune 9th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article, I walk you through the sentiment analysis of Amazon Electronics Product Reviews. The dataset contains id, review, and rating of product for sentiment analysis. The division of sentiment, based on vote value, is as follows: 0 < Vote < 3 = Negative sentiment (-1) 2 < Vote = 3 = Neutral Sentiment (0) 3 < Vote <= 5 = Positive Sentiment. The model evaluation function: Support Vector Machine Machine (SVM) Model: Decision Tree Model: ROC togethers all 3 togethers.
featured image - How Machines Learn Emotions: Sentiment Analysis of Amazon Product Reviews
nobody HackerNoon profile picture

Hey Folks! In this article, I walk you through the sentiment analysis of Amazon Electronics Product Reviews.

The dataset

Before we move forward, let’s download the dataset that we'll use in this project.

You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz. The download size of the dataset is 1.2GB. The dataset is zipped, so you first need to unzip it. Now the size of the dataset is around 2.5GB. It may be possible that this dataset would not open in your Microsoft Excel.

If you still want to open you can use Delimit software for it. Here is the download link: http://delimitware.com/download.html.

Let’s analyze the dataset

The dataset contains these columns/features:

  • reviewerID
     — ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin 
    — ID of the product, e.g. 0000013714
  • reviewerName 
    — name of the reviewer
  • vote 
    — helpful votes of the review
  • style
     — product metadata, e.g., “Format” is “Hardcover”
  • reviewText 
    — text of the review
  • overall 
    — rating of the product
  • summary 
    — summary of the review
  • unixReviewTime 
    — time of the review (unix time)
  • reviewTime
     — time of the review (raw)
  • image
     — images that users post after they have received the product

The dataset has lots of features, but for sentiment analysis, we need review and rating.

Importing the libraries and the data

import numpy as np
import pandas as pd
import random
import os
import json
import sys
import gzip
from collections import defaultdict
import csv
import time

#nltk libraries and packages
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn

#Ml related libraries
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, naive_bayes, svm
from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score as AUC

After reading the dataset as a pandas data frame, we create a dataset with id, review, and rating of product for sentiment analysis.

#reading the json file in a list
values=[]
with open("Electronics_5.json","r") as f:
    for i in f:
        values.append(json.loads(i))
print(values[:5])

We saved our filtered dataset in the Electronic_review.csv file.

Now we read our Electronic_review data into a data frame:

#read the dataset into a df
colnames = ["id","text","overall"]
df= pd.read_csv("Electronic_review.csv",names= colnames,header = None)

Populating the data with proper values of sentiments

The division of sentiment, based on vote value, is as follows

  • 0 < Vote < 3 => Negative sentiment (-1)
  • Vote = 3 => Neutral Sentiment (0)
  • 3 < Vote <= 5 => Positive Sentiment (1)

Let’s save this data frame as processedData.csv.

newdf.to_csv("processedData.csv",chunksize=100000)

Let’s see how our processed data look like:

df = pd.read_csv("processedData.csv",nrows = 100000)
print(df.head(5))

Preprocess the text data samples

let’s import some important libraries:

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn
import nltk
nltk.download("stopwords")
import re
nltk.download("punkt")

Now read the processedDatat.csv:

df= pd.read_csv(“processedData.csv”)

Stemming algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations.

Developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced, and the results provided in the information retrieval process will be more accurate.

    lat_df = df[:100000]
    lat_df.to_csv("CurrentUsedFile.csv")

We saved the first 100,000 rows of data as CurrentUsedFile.csv so that we can easily process the data.

Split the dataset into train and test set

#importing the new dataset
lat_df = pd.read_csv("CurrentUsedFile.csv")
print(lat_df.head(5))
#create x and y => x:textreview , y:sentiment
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df['reviewText_final'],lat_df['Sentiment'],test_size=0.2,random_state = 42)
print(Train_X.shape,Train_Y.shape)
print(Test_X.shape,Test_Y.shape)

Test_Y_binarise
 = label_binarize(Test_Y,classes = [0,1,2])

Applying TF-IDF vectorizer to the tokens formed for each of the review samples

# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the df
from sklearn.feature_extraction.text import TfidfVectorizer
Tfidf_vect = TfidfVectorizer(max_features=500000)               #tweak features based on the dataset
Tfidf_vect.fit(lat_df['reviewText_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Applying the SVM, NB, and DT models

Before going ahead, let’s create a model evaluation function:

def modelEvaluation(predictions, y_test_set):
    #Print model evaluation to predicted result 
    
    print ("\nAccuracy on validation set: {:.4f}".format(accuracy_score(y_test_set, predictions)))
print ("\nClassification report : \n", metrics.classification_report(y_test_set, predictions))
    print ("\nConfusion Matrix : \n", metrics.confusion_matrix(y_test_set, predictions))

Naive Bayes Model:

# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
import time
second=time.time()
Naive = naive_bayes.MultinomialNB()
historyNB = Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
modelEvaluation(predictions_NB, Test_Y)
from sklearn.metrics import precision_recall_fscore_support
a,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average='macro')
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
print("Precision is: ",a)
print("Recall is: ",b)
print("F-1 Score is: ",c)

Support Vector Machine (SVM) Model:

asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average='macro')
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print("Precision is: ",asvm)
print("Recall is: ",bsvm)

Decision Tree Model:

third=time.time()
decTree = DecisionTreeClassifier()
decTree.fit(Train_X_Tfidf, Train_Y)
y_decTree_predicted = decTree.predict(Test_X_Tfidf)
modelEvaluation(y_decTree_predicted, Test_Y)

Plotting all 3 ROC together

That’s all.

Also published on Medium's sameerbairwa