Embed
Email

Diya Final Project Applied machine learning

Document Sample

Categories
Tags
Stats
views:
0
posted:
11/27/2011
language:
English
pages:
10
The Effect of Review Polarity on usefulness



Diya Gangopadhyay



Abstract:

This paper uses Machine Learning to predict if a book review would be perceived as

useful by users. It uses a dataset of book reviews from Amazon and is primarily

focused on finding the relationship between positive or negative sentiment expressed

in the review and its usefulness rating. The algorithm used is SMO with an Attribute

Selection wrapper. User defined unigram and bi-gram features that represent

positivity and negativity in reviews have been included in the feature space to boost

performance. The user defined feature was found to be the most predictive by the

Ranking mechanism of the Attribute Selector and a small improvement in

performance (kappa value) has been observed due to the addition of these features.

This indicates a possible correlation between the polarity and the perceived

usefulness of a book review which can be verified in the future using more advanced

experimentation.





Introduction:

There has been considerable research on sentiment mining of product reviews using

text classification algorithms to classify reviews as positive and negative. Predicting

usefulness of a review from text has also been studied in several research papers,

for instance in [1] where subjectivity has been used as a measure. Predicting

usefulness can find several applications, one of them being creating a more efficient

interface by displaying reviews in decreasing order of predicted usefulness. This

would help users in getting relevant information about a product they are interested

in, without having to browse through too many reviews. From the perspective of the

product manufacturer, studying the reviews predicted to be most useful can help in

projecting general public opinion on the product at a relatively early stage.

The focus of this study is the relationship between the positive/negative sentiment of

a review and its helpful rating by users. If such a relationship can be established, it

can give an insight into patterns of user behavior. For instance if positive reviews are

generally perceived to be more useful, it is an indication that users typically look for

recommendations rather than warnings against specific products (books in this case)

that they have in mind. It is worth noting that positive and negative reviews help

users in different ways. While a positive review can be useful by convincing the user

to buy any book, a negative review is useful only if it recommends against a book

the user was considering buying. Thus, a negative review, intuitively has a lower

possibility of being useful as it has to be about a book that users are likely to be

considering beforehand.

The dataset used consists of around 200 book reviews from Amazon with ratings on

a scale of 1 to 5, review text and the number of people who found it helpful out of all

those who rated it. For the purpose of classification only those reviews were

considered which were tagged by more than 5 users and had above 60% majority

either in favor of “helpful” or “not helpful”. This filtering was done to ensure that no

review which is marginally helpful or non helpful is included in the data since these

could be merely by chance and hence not contain enough features to indicate

“helpfulness”.

The idea of a possible correlation between review polarity and its usefulness was

generated through observing a high correlation between ratings 4 and 5 and

usefulness (helpfulness). It led to the hypothesis about whether positivity could be a

predictor of a review being useful.





Polarity detection is a theme which has been studied at length in several research

articles, for example [2] uses Semantic Orientation - Pointwise Mutual Information

(SO-PMI) scores of words to measure their positive/negative orientation. To get this

value, the differences of PMI scores of the given word with certain query words

(representing positive and negative polarity) is computed. Thus the idea here is to

check for similarity with standard representatives of positive or negative words.





In [3] SVM has been used to classify text as per sentiment expressed using features

like unigrams, selected words and POS. TF-IDF has been applied to take into account

the rarity of these features.





In this study the “exclusiveness” of positive and negative words have been measured

statistically by counting their frequency of appearance in positive versus negative

reviews. The assumption is that a word with a positive meaning that appears

exclusively in the positive reviews is more likely to indicate positivity of a review it

appears in, and hence helpfulness. It is similar to the idea of rarity that TF-IDF

measures, but rather than measuring how exclusive a word is to a particular review,

it measures how specific it is to positive reviews in general.

Thus the approach used is to measure helpfulness indirectly through semantic

orientation.





SVM has been used due to its good performance with text classification in general

and similar applications on semantic polarity detection in particular. A feature

selection wrapper has been used to ensure that only the optimum number of

features are selected to prevent generating an over complex model which over-fit’s

the training data.





Research Method:

To get a baseline performance, SVM with feature selection wrapper using ranker to

select 50 features was used in the Tag helper Tools. All rare features were removed

with a threshold of 2 and stemming was set off since stemming could remove

meaningful information about the semantic orientation of words by reducing them to

roots. The results were as under:





Table 1: Baseline Performance for text only

Algorithm Performance - Performance - test

training (kappa) (kappa)

SVM with Attribute Selection 0.1893 0.3188

wrapper, ranker used to select top

50 features







A separate study was performed to see the predictive power of the review ratings

using decision trees and OneR., JRIP (rule) and pruned J48 (decision trees). All the

algorithms gave comparable performance and found higher rating to be strongly

associated with usefulness. The results are summarized as under:

Table 2: Performance of review rating as a predictor of helpfulness

Algorithm Model Generated Performance (kappa)

J48 rating 1: TRUE

JRIP rating helpful=FALSE

=> helpful=TRUE

OneR FALSE 0.5397

>= 1.5-> TRUE







The inspiration to use polarity as a usefulness predictor was derived from the above

results. The initial approach was to add user defined features in the Tag Helper Tool

comprising positive adjectives which occur in the training dataset multiple (>4)

times. It is to be noted that while high rating was a strong predictor of helpfulness,

low rating was not as strongly correlated with non-helpfulness. Thus, initially, the

focus was on using positive features only, to predict helpfulness.

Though this approach gave a higher kappa over the training set, it did not generalize

well enough to the test data. The selected words were found to be highly predictive,

appearing at the top of the feature list. But in the test set, they gave a result biased

towards helpful, as evident from the confusion matrix.

No changes were made to the baseline algorithm in any experiment to focus on the

effect of added features only. SVM was chosen due to its applicability to text

classification problems and proved effectiveness in problems dealing with semantic

orientation [1 and 2]

The results are summarized as under:

Table 3 - Iteration 1

Approach Features added Performance (kappa)

training

Adding positive adjectives U:TRUE = ANY( best,well, Training - 0.2072

as “helpfulness” excellent, amazing, Test - 0.2643

indicators wonderful, enjoy,

great,feel,interest,

well_written )







Some negative features were also added to prevent the model from getting too

positive biased, but no improvement was observed as the negative features had

lower predictive power than positive ones. For the training set an enhanced

performance was observed which can be attributed to over-fitting due to more

features added.

Table 3 - Positive and negative

Approach Features added Performance

(kappa) training

Adding positive adjectives U:TRUE = ANY( best,well, excellent, Training - 0.338

as “helpfulness” amazing, wonderful, enjoy, Test - 0.2643

indicators and negative great,feel,interest, well_written )

ones as “non helpfulness” U:FALSE = ANY(horrible, suffer,

indicators waste, smug, no_empathy,pander,

distasteful,fluff,

didn't_flow,wasn't_real,hackneyed

,rant, dud, slam, disappoint, error)







The next modification made was to detect the “strength” of the positive adjectives.

For instance the word “excellent” has a stronger positive connotation than the word

“good”. It was hypothesized that the presence of “stronger” positive words are better

indicators of the review itself being positive. This experiment did not yield any good

result firstly due to lack of clear definition for a strong adjective and secondly due to

lack of sufficient examples of such “strong” adjectives.

The need for a more concrete measure of exclusiveness was perceived. A different

approach was now taken to measure the frequency of occurrence of the selected

words in positive and negative reviews and look at the ratio of the two. The ratio of

appearance in positive review out of all appearances of the word was used as the

selection criterion. The rationale here is that those words which appear very

frequently in positive reviews and very rarely in negative ones are likely to be good

predictors of positivity and hence, according to the hypothesis in this study, of

helpfulness.

Words occurring very rarely (< 3 times) in the entire dataset were not used.

This approach yielded a more promising result for two reasons.

1) The positive bias could be reduced by using a threshold for the ratio of occurrence

in positive documents. It ensured that only those words that “represented” positivity

well were selected.

2) Over-fitting was also reduced by deciding not to use too many positive features.

This was indicated by a reduced training performance and an enhanced test

performance.





Table 4: List of positive words and bigrams with ratio of appearance in

positive reviews





Word Ratio = positive appearance/total

appearance

wonderful 14/16 = 87.5%

excellent 9/10 = 90%

perfect 9/12 = 75%

amazing 4/5 = 80%

unique 3/3 = 100%

deep 8/9 = 89%

must_read 3/3 = 100%

recommend 24/27 = 89%

fortunate 7/9 = 78%

well written 4/4 = 100%

worth 13/17 = 76%

masterpiece 3/3 = 100%

insight 13/13 = 100%

examples 5/5 = 100%

correct 5/6 = 83%

interesting 13/18 = 72%

loved 6/7 = 86%

entertain 6/7 = 86%









Table 5: List of positive words and bigrams with ratio of appearance in

negative reviews

Word Ratio = negative appearance/total

appearance

tedious 3/3 = 100%

boring 8/10 = 80%

disgust 4/5 = 80%

Different threshold values were tried out for the ratio. For this a validation set was

created out of the dataset to compare performances of different threshold ratios and

then the threshold with the best performance chosen to build the model over the

entire dataset. This was done to avoid “training” the optimal threshold ratio over the

test data and ensure that the test data remains truly “unseen”. The optimum

threshold value was found to be 89%. Though the exact value of the threshold is not

very reliable due to the limited number of words studied, it was seen that

performance increased steadily on being more selective about the words. The

threshold used for negative features was kept lower to counteract for the positive

bias and also because negativity was not found to be a strong predictor of “non

helpfulness” and thus having too many of those features was found not useful.

Also, a strict threshold was not always maintained for negative features but different

subsets were tried out to compare performances since their frequency of appearance

was overall low and hence the ratio could not be trusted very well.

The comparison of performances over different thresholds is as shown below.





Table 5: Performance Comparison for different ratio thresholds of positive

features









Features used Threshold Performance (kappa)

U:TRUE=ANY(wonderful, excellent, 85% 0.36

perfect, amazing, unique, deep,

must_read, fortunate, well_written,

worth, masterpiece, insight)

U:FALSE = ANY(boring ,tedious )

U:TRUE = ANY(excellent , unique, 89% with 0.28

deep, must_read, recommended, more negative

recommend, well_written, features

masterpiece, insight, examples)

U:FALSE = ANY(boring , tedious )

U:TRUE = ANY(excellent , unique, 89% 0.4

deep, must_read, recommended,

recommend, well_written,

masterpiece, insight, examples)

U:FALSE = ANY(boring , tedious )







The validation set was also used to compare the performances on selecting 50 and

100 features over the optimum threshold of user defined features (row 3 of table 5).

It was decided not to go beyond that to avoid the algorithm getting very slow.

The performances were:

Kappa = 0.40 for N = 50

Kappa = 0.36 for N = 100

Thus N = 50 was seen to perform better and work more efficiently. This indicates

that features below rank 50 did not contribute significantly to predict usefulness.









Final Results:

The final results are summarized below:





Table 6: Final Results

Algorithm User Defined Features Performance over test set

(kappa)

SVM with attribute U:TRUE = ANY(excellent , 0.405

selection classifier using unique, deep, must_read,

top 50 attributes recommended,

recommend, well_written,

masterpiece, insight,

examples)

U:FALSE = ANY(boring ,

tedious )







Thus an improvement over the baseline kappa value of 0.3188 was observed. The

result however were not found significant by a paired t-test at a 5% confidence level.

Confusion Matrix:

a b <-- classified as

18 11 | a = false

11 40 | b = true

It can be seen from the confusion matrix that the model is not biased either way.





Error Analysis:

The primary source of error was found to be the insufficiency of individual words and

pairs of words to describe sentiment completely. Positive words are found in negative

examples in a different sense. For example a negative review describes a book to be

“not so insightful” which cannot be captured either by unigrams or bigrams. In some

other cases, books have been described as positive in some limited respect.

However, using a measure of exclusiveness helps minimize such errors.





The other sources of error were over-fitting and positive bias (in the initial

iterations). From a look at the training data, it was apparent that positivity is a good

predictor of helpfulness but negativity does not have that good a predictive power.

To capture this aspect in the user defined features, positive features were used more

than negative ones, leading to a positive bias. Overfitting happened whenever too

many non exclusive positive features were added which appeared regularly in both

positive and negative reviews.





Discussion:

Though the results obtained were not found to be significant, it is possible that this is

due to inadequate capturing of positive features which could indicate “usefulness” of

reviews rather than a flaw with the approach of correlating positivity with usefulness

and vice versa. More advanced linguistic tools to classify positive and negative

reviews could prove effective as usefulness predictors too.

Since a lot of research has already been done on sentiment classification, it would be

beneficial if it could be extended effectively to usefulness prediction as well.

However, there presumably are several more attributes unrelated to the polarity of

book reviews which contribute to perceived usefulness which have not been dealt

with in this paper.





Conclusion:

The experiments indicate the possibility of a correlation between the semantic

orientation of book reviews and their perceived usefulness. Though no significant

results have been obtained, this approach did give an enhanced performance which

can be further improved by using more sophisticated models.









References:

1) Anindya Ghose and Panagiotis G. Ipeirotis

Designing Novel Review Ranking Systems: Predicting the Usefulness and Impact of

Reviews, ICEC'07



2) Alistair Kennedy and Diana Inkpen

Sentiment Classification of Movie and Product Reviews Using Contextual Valence

Shifters, FINEXIN 2005



3) Jin-Cheon Na, Haiyang Sui , Christopher Khoo , Syin Chan ,Yunyun Zhou

Effectiveness of Simple Linguistic Processing in Automatic Sentiment Classification of

Product Reviews, ISKO Conference (pp. 49-54). Wurzburg, Germany



Related docs
Other docs by Stariya Js @ B...
How we become literate
Views: 0  |  Downloads: 0
15189
Views: 0  |  Downloads: 0
Enrollment Agreement
Views: 0  |  Downloads: 0
seddc 061009 pm
Views: 0  |  Downloads: 0
Juvanec-KamenNaKamen-eng
Views: 0  |  Downloads: 0
Syllabus Macro Fall 10
Views: 0  |  Downloads: 0
23401
Views: 0  |  Downloads: 0
9-11-RPH-stonefabrication-ord-memo-agss
Views: 0  |  Downloads: 0
Junior_Pre_season_Soccer_League_application
Views: 0  |  Downloads: 0
guide_to_moodle_quizzes
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!