The Effect of Review Polarity on usefulness
Diya Gangopadhyay
Abstract:
This paper uses Machine Learning to predict if a book review would be perceived as
useful by users. It uses a dataset of book reviews from Amazon and is primarily
focused on finding the relationship between positive or negative sentiment expressed
in the review and its usefulness rating. The algorithm used is SMO with an Attribute
Selection wrapper. User defined unigram and bi-gram features that represent
positivity and negativity in reviews have been included in the feature space to boost
performance. The user defined feature was found to be the most predictive by the
Ranking mechanism of the Attribute Selector and a small improvement in
performance (kappa value) has been observed due to the addition of these features.
This indicates a possible correlation between the polarity and the perceived
usefulness of a book review which can be verified in the future using more advanced
experimentation.
Introduction:
There has been considerable research on sentiment mining of product reviews using
text classification algorithms to classify reviews as positive and negative. Predicting
usefulness of a review from text has also been studied in several research papers,
for instance in [1] where subjectivity has been used as a measure. Predicting
usefulness can find several applications, one of them being creating a more efficient
interface by displaying reviews in decreasing order of predicted usefulness. This
would help users in getting relevant information about a product they are interested
in, without having to browse through too many reviews. From the perspective of the
product manufacturer, studying the reviews predicted to be most useful can help in
projecting general public opinion on the product at a relatively early stage.
The focus of this study is the relationship between the positive/negative sentiment of
a review and its helpful rating by users. If such a relationship can be established, it
can give an insight into patterns of user behavior. For instance if positive reviews are
generally perceived to be more useful, it is an indication that users typically look for
recommendations rather than warnings against specific products (books in this case)
that they have in mind. It is worth noting that positive and negative reviews help
users in different ways. While a positive review can be useful by convincing the user
to buy any book, a negative review is useful only if it recommends against a book
the user was considering buying. Thus, a negative review, intuitively has a lower
possibility of being useful as it has to be about a book that users are likely to be
considering beforehand.
The dataset used consists of around 200 book reviews from Amazon with ratings on
a scale of 1 to 5, review text and the number of people who found it helpful out of all
those who rated it. For the purpose of classification only those reviews were
considered which were tagged by more than 5 users and had above 60% majority
either in favor of “helpful” or “not helpful”. This filtering was done to ensure that no
review which is marginally helpful or non helpful is included in the data since these
could be merely by chance and hence not contain enough features to indicate
“helpfulness”.
The idea of a possible correlation between review polarity and its usefulness was
generated through observing a high correlation between ratings 4 and 5 and
usefulness (helpfulness). It led to the hypothesis about whether positivity could be a
predictor of a review being useful.
Polarity detection is a theme which has been studied at length in several research
articles, for example [2] uses Semantic Orientation - Pointwise Mutual Information
(SO-PMI) scores of words to measure their positive/negative orientation. To get this
value, the differences of PMI scores of the given word with certain query words
(representing positive and negative polarity) is computed. Thus the idea here is to
check for similarity with standard representatives of positive or negative words.
In [3] SVM has been used to classify text as per sentiment expressed using features
like unigrams, selected words and POS. TF-IDF has been applied to take into account
the rarity of these features.
In this study the “exclusiveness” of positive and negative words have been measured
statistically by counting their frequency of appearance in positive versus negative
reviews. The assumption is that a word with a positive meaning that appears
exclusively in the positive reviews is more likely to indicate positivity of a review it
appears in, and hence helpfulness. It is similar to the idea of rarity that TF-IDF
measures, but rather than measuring how exclusive a word is to a particular review,
it measures how specific it is to positive reviews in general.
Thus the approach used is to measure helpfulness indirectly through semantic
orientation.
SVM has been used due to its good performance with text classification in general
and similar applications on semantic polarity detection in particular. A feature
selection wrapper has been used to ensure that only the optimum number of
features are selected to prevent generating an over complex model which over-fit’s
the training data.
Research Method:
To get a baseline performance, SVM with feature selection wrapper using ranker to
select 50 features was used in the Tag helper Tools. All rare features were removed
with a threshold of 2 and stemming was set off since stemming could remove
meaningful information about the semantic orientation of words by reducing them to
roots. The results were as under:
Table 1: Baseline Performance for text only
Algorithm Performance - Performance - test
training (kappa) (kappa)
SVM with Attribute Selection 0.1893 0.3188
wrapper, ranker used to select top
50 features
A separate study was performed to see the predictive power of the review ratings
using decision trees and OneR., JRIP (rule) and pruned J48 (decision trees). All the
algorithms gave comparable performance and found higher rating to be strongly
associated with usefulness. The results are summarized as under:
Table 2: Performance of review rating as a predictor of helpfulness
Algorithm Model Generated Performance (kappa)
J48 rating 1: TRUE
JRIP rating helpful=FALSE
=> helpful=TRUE
OneR FALSE 0.5397
>= 1.5-> TRUE
The inspiration to use polarity as a usefulness predictor was derived from the above
results. The initial approach was to add user defined features in the Tag Helper Tool
comprising positive adjectives which occur in the training dataset multiple (>4)
times. It is to be noted that while high rating was a strong predictor of helpfulness,
low rating was not as strongly correlated with non-helpfulness. Thus, initially, the
focus was on using positive features only, to predict helpfulness.
Though this approach gave a higher kappa over the training set, it did not generalize
well enough to the test data. The selected words were found to be highly predictive,
appearing at the top of the feature list. But in the test set, they gave a result biased
towards helpful, as evident from the confusion matrix.
No changes were made to the baseline algorithm in any experiment to focus on the
effect of added features only. SVM was chosen due to its applicability to text
classification problems and proved effectiveness in problems dealing with semantic
orientation [1 and 2]
The results are summarized as under:
Table 3 - Iteration 1
Approach Features added Performance (kappa)
training
Adding positive adjectives U:TRUE = ANY( best,well, Training - 0.2072
as “helpfulness” excellent, amazing, Test - 0.2643
indicators wonderful, enjoy,
great,feel,interest,
well_written )
Some negative features were also added to prevent the model from getting too
positive biased, but no improvement was observed as the negative features had
lower predictive power than positive ones. For the training set an enhanced
performance was observed which can be attributed to over-fitting due to more
features added.
Table 3 - Positive and negative
Approach Features added Performance
(kappa) training
Adding positive adjectives U:TRUE = ANY( best,well, excellent, Training - 0.338
as “helpfulness” amazing, wonderful, enjoy, Test - 0.2643
indicators and negative great,feel,interest, well_written )
ones as “non helpfulness” U:FALSE = ANY(horrible, suffer,
indicators waste, smug, no_empathy,pander,
distasteful,fluff,
didn't_flow,wasn't_real,hackneyed
,rant, dud, slam, disappoint, error)
The next modification made was to detect the “strength” of the positive adjectives.
For instance the word “excellent” has a stronger positive connotation than the word
“good”. It was hypothesized that the presence of “stronger” positive words are better
indicators of the review itself being positive. This experiment did not yield any good
result firstly due to lack of clear definition for a strong adjective and secondly due to
lack of sufficient examples of such “strong” adjectives.
The need for a more concrete measure of exclusiveness was perceived. A different
approach was now taken to measure the frequency of occurrence of the selected
words in positive and negative reviews and look at the ratio of the two. The ratio of
appearance in positive review out of all appearances of the word was used as the
selection criterion. The rationale here is that those words which appear very
frequently in positive reviews and very rarely in negative ones are likely to be good
predictors of positivity and hence, according to the hypothesis in this study, of
helpfulness.
Words occurring very rarely (< 3 times) in the entire dataset were not used.
This approach yielded a more promising result for two reasons.
1) The positive bias could be reduced by using a threshold for the ratio of occurrence
in positive documents. It ensured that only those words that “represented” positivity
well were selected.
2) Over-fitting was also reduced by deciding not to use too many positive features.
This was indicated by a reduced training performance and an enhanced test
performance.
Table 4: List of positive words and bigrams with ratio of appearance in
positive reviews
Word Ratio = positive appearance/total
appearance
wonderful 14/16 = 87.5%
excellent 9/10 = 90%
perfect 9/12 = 75%
amazing 4/5 = 80%
unique 3/3 = 100%
deep 8/9 = 89%
must_read 3/3 = 100%
recommend 24/27 = 89%
fortunate 7/9 = 78%
well written 4/4 = 100%
worth 13/17 = 76%
masterpiece 3/3 = 100%
insight 13/13 = 100%
examples 5/5 = 100%
correct 5/6 = 83%
interesting 13/18 = 72%
loved 6/7 = 86%
entertain 6/7 = 86%
Table 5: List of positive words and bigrams with ratio of appearance in
negative reviews
Word Ratio = negative appearance/total
appearance
tedious 3/3 = 100%
boring 8/10 = 80%
disgust 4/5 = 80%
Different threshold values were tried out for the ratio. For this a validation set was
created out of the dataset to compare performances of different threshold ratios and
then the threshold with the best performance chosen to build the model over the
entire dataset. This was done to avoid “training” the optimal threshold ratio over the
test data and ensure that the test data remains truly “unseen”. The optimum
threshold value was found to be 89%. Though the exact value of the threshold is not
very reliable due to the limited number of words studied, it was seen that
performance increased steadily on being more selective about the words. The
threshold used for negative features was kept lower to counteract for the positive
bias and also because negativity was not found to be a strong predictor of “non
helpfulness” and thus having too many of those features was found not useful.
Also, a strict threshold was not always maintained for negative features but different
subsets were tried out to compare performances since their frequency of appearance
was overall low and hence the ratio could not be trusted very well.
The comparison of performances over different thresholds is as shown below.
Table 5: Performance Comparison for different ratio thresholds of positive
features
Features used Threshold Performance (kappa)
U:TRUE=ANY(wonderful, excellent, 85% 0.36
perfect, amazing, unique, deep,
must_read, fortunate, well_written,
worth, masterpiece, insight)
U:FALSE = ANY(boring ,tedious )
U:TRUE = ANY(excellent , unique, 89% with 0.28
deep, must_read, recommended, more negative
recommend, well_written, features
masterpiece, insight, examples)
U:FALSE = ANY(boring , tedious )
U:TRUE = ANY(excellent , unique, 89% 0.4
deep, must_read, recommended,
recommend, well_written,
masterpiece, insight, examples)
U:FALSE = ANY(boring , tedious )
The validation set was also used to compare the performances on selecting 50 and
100 features over the optimum threshold of user defined features (row 3 of table 5).
It was decided not to go beyond that to avoid the algorithm getting very slow.
The performances were:
Kappa = 0.40 for N = 50
Kappa = 0.36 for N = 100
Thus N = 50 was seen to perform better and work more efficiently. This indicates
that features below rank 50 did not contribute significantly to predict usefulness.
Final Results:
The final results are summarized below:
Table 6: Final Results
Algorithm User Defined Features Performance over test set
(kappa)
SVM with attribute U:TRUE = ANY(excellent , 0.405
selection classifier using unique, deep, must_read,
top 50 attributes recommended,
recommend, well_written,
masterpiece, insight,
examples)
U:FALSE = ANY(boring ,
tedious )
Thus an improvement over the baseline kappa value of 0.3188 was observed. The
result however were not found significant by a paired t-test at a 5% confidence level.
Confusion Matrix:
a b <-- classified as
18 11 | a = false
11 40 | b = true
It can be seen from the confusion matrix that the model is not biased either way.
Error Analysis:
The primary source of error was found to be the insufficiency of individual words and
pairs of words to describe sentiment completely. Positive words are found in negative
examples in a different sense. For example a negative review describes a book to be
“not so insightful” which cannot be captured either by unigrams or bigrams. In some
other cases, books have been described as positive in some limited respect.
However, using a measure of exclusiveness helps minimize such errors.
The other sources of error were over-fitting and positive bias (in the initial
iterations). From a look at the training data, it was apparent that positivity is a good
predictor of helpfulness but negativity does not have that good a predictive power.
To capture this aspect in the user defined features, positive features were used more
than negative ones, leading to a positive bias. Overfitting happened whenever too
many non exclusive positive features were added which appeared regularly in both
positive and negative reviews.
Discussion:
Though the results obtained were not found to be significant, it is possible that this is
due to inadequate capturing of positive features which could indicate “usefulness” of
reviews rather than a flaw with the approach of correlating positivity with usefulness
and vice versa. More advanced linguistic tools to classify positive and negative
reviews could prove effective as usefulness predictors too.
Since a lot of research has already been done on sentiment classification, it would be
beneficial if it could be extended effectively to usefulness prediction as well.
However, there presumably are several more attributes unrelated to the polarity of
book reviews which contribute to perceived usefulness which have not been dealt
with in this paper.
Conclusion:
The experiments indicate the possibility of a correlation between the semantic
orientation of book reviews and their perceived usefulness. Though no significant
results have been obtained, this approach did give an enhanced performance which
can be further improved by using more sophisticated models.
References:
1) Anindya Ghose and Panagiotis G. Ipeirotis
Designing Novel Review Ranking Systems: Predicting the Usefulness and Impact of
Reviews, ICEC'07
2) Alistair Kennedy and Diana Inkpen
Sentiment Classification of Movie and Product Reviews Using Contextual Valence
Shifters, FINEXIN 2005
3) Jin-Cheon Na, Haiyang Sui , Christopher Khoo , Syin Chan ,Yunyun Zhou
Effectiveness of Simple Linguistic Processing in Automatic Sentiment Classification of
Product Reviews, ISKO Conference (pp. 49-54). Wurzburg, Germany