Zhang - sentinement in stock bulletin boards

Document Sample
Zhang - sentinement in stock bulletin boards Powered By Docstoc
					                Issues in Creating a Sentiment Measurement Based on
                            Internet Stock Message Boards
                                        Ying (Clement) Zhang
                                  The University of Texas at Arlington


                                         Peggy E. Swanson*
                                  The University of Texas at Arlington

JEL Classification: G11; G12

Keywords: Internet Stock Message Boards, Sentiment Index, Text Classifiers, Artificial Intelligence


         Recent studies have attempted to answer the question whether sentiment extracted from stock
message boards has predictive value. A major obstacle in previous works when testing the relation
between sentiment indexes (measures of bullishness) and subsequent stock returns is how to best
construct a sentiment index. This paper focuses on two crucial aspects of index creation. The first relates
to how various messages should be evaluated, i.e., are reply messages as relevant as non-reply (original)
messages in constructing the index. The second issue relates to identifying the best text classification
algorithm to use when creating the sentiment index. Our results indicate that a reply message should be
given a 0.125 discounted weight when constructing the sentiment index and that the Maximum Entropy
text classifier is superior to the Naive Bayesian or Support Vector Machine, among others, for creating
the sentiment index from stock message boards.

*Corresponding Author
       Peggy E. Swanson
       John and Judy Goolsby Distinguished Professor
        and Professor of Finance
       Department of Finance and Real Estate
       UTA Box 19449
       The University of Texas at Arlington
       Arlington, TX 76019

        Phone: 817 272 3841
        FAX: 817 272 2252


              Electronic copy of this paper is available at:
                                                I.      Introduction

           Rapid growth in stock message boards, chat rooms, and other means for investors to

share market information makes clear the ever-increasing interest in internet stock trading.1 As

an example, a powerful and popular board tracker——tracks 50 million message

postings on more than 25,000 message boards and related blog sites every day as of December

2006 2 In addition to the increasing number of sites, growth in the number of participations in

these sites has exploded.

           Of specific interest to investors and to policymakers is the effect internet postings of

sentiment have on stock returns. The sentiments are closely watched by investors who seek input

to enhance their portfolio performance and by regulators who are concerned with stock price

manipulation and information leakage. The influence of aggregated sentiments from individual

investors has been shown to be far from just noise as evidenced by abnormal stock movements

especially around the 2000 internet bubble and by internet stock message board studies such as

those of Tumarkin and Whitelaw (2001), Antweiler and Frank (2004), Das and Sisk (2005), Das,

Martínez-Jerez and Tufano (2005), Das and Chen (2003, 2007) and Gu, Konana, Liu,

Rajagopalan and Ghosh (2006) among others. To determine whether the sentiment extracted

from online investors’ commentaries is associated with stock price movements, researchers often

create a sentiment index based on modern artificial intelligence techniques as a vehicle to test the

relationship between messages and returns. The sentiment index should be unbiased, accurate,

and sufficiently representative of individual investor consensus. However, due to problems in
  There are many prevailing public bulletin boards among investors, such as Yahoo!Finance, RagingBull,
Thelion!WallStreetPit, Motley, CBSMarketWatch, ClearStation , SiliconInvestor, etc. The most popular site based
on the number of posts per day is Yahoo!Finance followed by RagingBull. Thelion!WallStreetPit is becoming more
and more popular among investors reporting more than 1 million messages by the end of 2006 due to its organized
platform, all-in-one messages search engine and account abuse-free monitoring system. Chat rooms, such as
Reedfloren and B4utrade, are also available for investors. However, differing from bulletin boards, chat rooms do
not keep historical archives of messages making retrieval of historical data impossible.
    For information of, please see and

                  Electronic copy of this paper is available at:
data handling (message selection) and selection methods (text classifier selection), current

sentiment indexes are often problematic and lacking in representative power. Weaknesses of the

sentiment index directly affect the results of testing the relationship between investors’ sentiment

and stock price movements.

        This study proposes improvements over past studies in both data handling and methodology

selection for constructing a sentiment index with high accuracy and low bias to proxy for individual

investor sentiment. For data handling, we propose that reply messages have appreciably less influence on

the forum community than non-reply (original) messages, and that a discount weight should be given to a

reply message when constructing a sentiment index. We also suggest that an information imbalance

problem and inferior class assignment procedure could affect the sentiment index’s representative power.

For methodology selection, we conduct a thorough evaluation and comparison of eight different and most

popular text classification algorithms, with the result that the Maximum Entropy classifier is indicated to

be the superior algorithm for recognizing online messages that are in the format of conversational

discourse in spoken language.

        With the improvements proposed in this paper which have not been addressed in previous finance

literature, we enhance the ability to construct a quality proxy for individual investor sentiment indexes

thereby providing researchers a better vehicle for testing the relationship between individual investor

sentiment and stock price movements.

                                II.    Background and Literature Review

II.A. Does investor sentiment lead or lag security prices?

        Sentiment, known as the feeling about or tone of a security (a market) is reflected in the activity

and price movement of the security (market). Investors Intelligence reports the sentiments of investment

newsletter writers, Barron’s reports the sentiments of individual investors, and many other financial

service firms track both analysts’ and individual investor sentiments.

        One group of researchers believe investors’ sentiment is an ex post phenomenon arguing that

causality is from the security (market) movement to the sentiment change. Rising prices would indicate a
bullish market sentiment while a bearish market sentiment would be indicated by falling prices. In support,

Chen, Kan and Miller (1993) reject the claim that individual investor sentiment drives the returns of

stocks held and traded by individual investors. A similar result is found by Elton, Gruber and Busse

(1998). Chan and Fong (2004) report that the publication of individual investors’ sentiment in the column

of Individual Investors’ Sentiment Index (IISI) of Apple Daily in Hong Kong fails to predict the coming

week’s return on large, medium, or small stocks. Similarly, Brown and Cliff (2004), using survey data

and short-term stock market returns conclude that sentiment has little predictive power for short term

future stock returns, even though sentiment levels are correlated with contemporaneous market returns.

They suggest that past market returns strongly affect investors’ sentiment.

        Another group of researchers reject this argument by showing that investors’ sentiments do affect

the market and conclude that sentiment is an ex ante factor to the market. Sentiment should reflect

investors’ opinions of the future; otherwise, disclosing sentiment based on historical information has no

value to investors. Under this argument, when investors provide a sentiment, that sentiment is a leading


        Lee, Shleifer and Thaler (1991) propose that individual investor sentiment causes fluctuations in

the discount of closed-end funds. Individual investors’ pessimistic opinions cause discounts on closed-end

funds while optimistic sentiments narrow the discounts. By testing the relation between individual

investor sentiment and returns for a long period (1933 to 1993), Neal and Wheatley (1998) conclude that

two proxies for individual sentiment—discounts on closed-end funds and net mutual fund redemptions—

both have predictive power for the difference between small and large firm returns. Fisher and Statman

(2000) find the relationship between the sentiment of both small investors and large investors and future

S&P 500 returns to be significantly negative, concluding that market sentiment provides contrarian-

forecasting power for future S&P returns. Similar findings are reported by Kumar and Lee (2006), Brown

and Cliff (2004), Baker and Wurgler (2006) and Kaniel, Saar, Titman (2006). More recently, Brown and

Cliff (2005) further support the argument that investors’ sentiment affects asset valuation by showing that

overly optimistic (pessimistic) investors drive prices above (below) fundamental values and these pricing

errors tend to revert over a multi-year horizon.

II. B. Previous measures of investor sentiment

        The determination of whether investors’ sentiment is a lagging or leading factor depends upon the

specific sentiment measurement selected and its level of accuracy and power. Although there is no perfect

way to proxy broad investor sentiment, past literature has applied numerous proxies.

        An early study by Lee, Shleifer and Thaler (1991) suggested that closed-end fund discounts are a

proxy for individual investor sentiment. Other proxies proposed include those of Neal and Wheatley

(1998) who use level of discounts on closed-end funds, the ratio of odd-lot sales to purchases, and net

mutual fund redemptions; Kumar and Lee (2006) who measure the changes in retail sentiment by

calculating the buy-sell imbalance ratios; Kaniel, Saar and Titman (2004) who use NYSE’s Consolidated

Equity Audit Trail Data (CAUD) to construct a daily measure of investor sentiment; Brown and Cliff

(2005) who use survey data from Investors Intelligence (II), which tracks the number of market

newsletters that are bullish, bearish, or neutral to establish investors’ sentiment. A more exhaustive study

of sentiment is presented by Baker and Wurgler (2006) who consider six separate proxies as well as a

composite index for investor sentiment: average closed-end fund discount, NYSE share turnover, number

of IPOs (NIPO) and average first-day returns on IPOs (RIPO), equity share in new issues, and dividend

premium. This allows an examination of the robustness of the various measures. Fisher and Statman

(2000) utilize three different data sources to construct sentiments for large, medium and small investors

separately. For large investors’ sentiment, they use data from Merrill Lynch’s compilation of the

sentiment of Wall Street sell-side strategists. For medium investors, they use Investors Intelligence (II)

published by an investment service company, Chartcraft, and for small investors, measurement comes

from the American Association of Individual Investors (AAII) which conducts a sentiment survey among

its members.

II. C. Innovations in the measurement of investor sentiment.

            Even when using the same data source to construct a sentiment proxy, studies report inconsistent

results. For instance, based on closed-end fund discounts as a proxy for individual investor sentiment,

Chen, Kan and Miller (1993) disagree with Lee, Shleifer and Thaler (1991). Fisher and Statman (2000)

classify investors in three different groups and proxy the semiprofessional investors by using Investors

Intelligence (II) as their source while Brown and Cliff (2005) treat the same II as a proxy for general

investors’ opinions. Because there are diverse approaches and inconsistent results relating to how to

correctly measure investor sentiment, researchers continue to seek additional insight and new information

to identify more unbiased and efficient measures.

            In recent studies, one prominent group of researchers has been using the sentiment posted by

individual investor from online stock message boards to proxy for investor sentiment. Starting in the mid

1990s, many financial websites, such as Yahoo!Finance, RagingBull and set up online stock

message boards to allow people to share information and post opinions online. Moreover, starting in the

early 2000’s, most of these message boards provided a sentiment function where posters can explicitly

disclose their sentiment on a voluntary basis. For instance, Yahoo!Finance allows a poster to choose one

of the following sentiments: strong buy, buy, hold, sell, strong sell, and do not disclose (by default).

Similar functions can be found on RagingBull message boards where both short-term sentiment and long-

term sentiment options can be chosen. Moreover, in addition to the above sentiment choices,

Thelion!WallStreetPit offers Short and Scalp sentiment options.3 Not like chat rooms or live conversation

forums, stock message boards store all the historical information of the messages and allow other users to

browse or retrieve the information later. 4 Since the historical content and sentiment of each post are

retrievable, one can calculate the average or aggregated sentiment from the posted messages during a

certain time period. This sentiment can be used as a proxy for individual investor sentiment. 5 The

 The short sentiment means to short sell the stock while the scalp sentiment means to buy or sell quickly with
possible day trade profit but does not disclose specific sentiment.
    A small number of historical messages are deleted by the web administrator due to violation of posting policy.

  While it is possible that brokerage firms, institutional investors or third parties hire people to disseminate sentiment on the
forums in order to influence stock price, hopefully this is unlikely and inefficient since it is easy for the SEC to detect these kinds
of activities by using IP address techniques
broadest measure of online investors’ sentiment would require all the available historical messages on the

internet. Such, of course, is not possible and selection must be limited by availability. Among all the

online stock message boards, a few dominate posting activities based on the number of messages posted

every day and number of users. Many studies include these dominant message boards while omitting

lesser active forums hoping the former provide adequate representativeness to derive a meaningful

sentiment index. Wysocki (1999) uses the messages exclusively from Yahoo!Finance after comparing the

posting activities between Yahoo and The Motley Fool; Tumarkin and Whitelaw (2001) collect messages

from RagingBull only; Das, Martínez-Jerez and Tufano (2005) use The Motley Fool; Antweiler and

Frank (2004) choose Yahoo!Finance and RagingBull as data sources; and Das and Chen (2007) focus on

Yahoo!Finance; Instead of using only a single data source, Das and Sisk download data from four online

message boards: Yahoo!Financ, The Motley Fool, RagingBull and Silicon Investor. Gu, Konana, Liu,

Rajagopalan and Ghosh (2006) select Yahoo!Finance only.

        Online posters may voluntarily provide sentiment meaning that not all messages include self-

reported sentiment. Because the sentiment index should exhibit minimal bias, ideally it would include all

messages both with and without self-reported sentiment. Tumarkin and Whitelaw (2001) and Gu, Konana,

Liu, Rajagopalan and Ghosh (2006) use only self-reported sentiment messages and ignore all the

messages without self-reported sentiment, but Antweiler and Frank (2004) argue that omitting non-self-

reported-sentiment messages generates bias. They suggest including all messages when generating a

proxy for sentiment. However, messages without self-disclosed sentiment require interpretation of

sentiment based on reading the content. We know that human interpretation has bias and is subjective

since what we understand may not be consistent with the poster’s intent. Fortunately, for non-self-

reported-sentiment messages, mature artificial intelligence methods for text classification can aid in

determining the best and most unbiased sentiment based on linguistic structure and key words in the

content without human intervention. A text classifier utilizes its computational linguistic algorithm to

recognize and then memorize distinct characteristics among different classes, such as buy and sell, in the

training data. Training data is a pool of self-reported sentiment messages that are learned by the text

classification algorithm. Based on the active learning, the classification algorithm builds a model that can

distinguish among classes. After the model is built, the text classifier is set as a classifier server and ready

to evaluate individual messages. After receiving an individual message from the client program, the

classifier server will compare the message’s characteristics with the model built from the training data

and return a probability score under each class for the message. For a detailed flowchart for using a text

classifier, see Das and Chen (2003).

        With the help of this new text retrieval technology and artificial intelligence methodology,

researchers in finance now have the ability to generate an improved proxy for individual investor

sentiment. The aim of this paper is to utilize these new abilities and to provide a foundation for

constructing an improved sentiment index from internet stock message boards. We suggest that

previously inconsistent and insignificant relationships between individual investor sentiment and

subsequent stock returns may be due to low representative power and measurement error of the sentiment

indices used.

II. D. Issues in Creating an Effective Sentiment Index

II.D.1: Reply messages differ from non-reply messages in the construction of a sentiment index.

        Our sentiment index formula is an extension of Antweiler and Frank 2004 (AF hereafter) who

recognize that reply and non-reply messages indicate different levels of importance. They suggest giving

different weights to messages based on message characteristics such as message length, number of

messages posted by the same author, and number of citations (reply messages). However, they do not

differentiate reply from non-reply messages because the number of citations a particular message receives

is unknown at the time it is posted causing serious problems for their time sequencing analysis. However,

we propose a different approach. Rather than attempting to measure the reply posts that follow a specific

original post in a time sequence order, we simply identify all reply messages independent of the original

post to which they relate. Any further reply to an existing reply is uniformly treated as a reply. Thus, a

message is either a reply or non-reply. If reply and non-reply messages carry different levels of

importance and if we can generate a weight for reply messages to differentiate this heterogeneity, the

sentiment index should be more accurate and representative.

         We propose a formula based on AF (2004)’s bullishness ratio R t where we incorporate reply

versus non-reply messages using sentiment types from Thelion!WallStreetPit. The ratio of bullish to

bearish messages, R t , (see AF 2004, 1266-1267) becomes:

                                              N D( t )
                                                i =1
                                                             *(1 − ρ )γ * xiB *ln(δ i )
                                       Rt =     s
                                              N D( t )
                                              ∑tj =1
                                                             *(1 − ρ )γ * x S *ln(δ j )


          N   D(t)   is number of messages that have buy-side sentiment (such as buy and strong
                       buy) during time period D(t).
          N   D(t)   is number of messages that have sell-side sentiment (such as sell and strong
                   sell) during D(t).
         t is number of messages posted by the same author during D(t).
         xiB = 1 if the post has a buy sentiment and xiB = 2 if it has a strong buy sentiment.
         xjS= -1 if the post has a sell sentiment, xjS = -2 if it has a strong sell sentiment, etc.
         ln(δ) is natural logarithm of message length in number of characters.6
         ρ is weight to capture the discount on reply messages.
         γ has two values, if the post is a reply, γ = 1, otherwise γ = 0.

         Our focus is the computation of R t which includes the measurement of ρ, the discount factor for

reply messages. ρ is expected to be less than 1 in value if reply messages under the same sentiment type

have less influence on the community than non-reply messages. Previous literature of financial stock

message boards has not differentiated between reply and non-reply messages assuming all the messages

(reply and non-reply) have the same construction pattern and importance. We are arguing that not

differentiating reply messages from non-reply messages by giving different weights to the two can

decrease a sentiment index’s representative power.

II. D.2. Identifying the best text classifier for internet message board messages

  More content is associated with more information and value. Using a log function is because accumulative information deploys
a concave relationship which shows that the marginal increase of information is decreasing

        In order to generate a highly representative sentiment index for a stock during a certain time

period D(t), we must include all the messages posted for that stock on the message board during that time.

However, since not all messages posted come with self-reported sentiment, text classifiers are essential in

obtaining the best and most unbiased sentiment type for those messages without self-disclosed sentiment,.

        There exist many popular text classifiers, such as Naive Bayesian (NB), Support Vector Machine

(SVM), K nearest neighbor (KNN), Term Frequency Inverse Document Frequency (TFIDF), Expectation

Maximization (EM), Kullback-Leibler divergence (KL), Probabilistic Indexing (PI), and Maximum

Entropy (ME), but they have not been tested in the context of financial message board analysis. AF (2004)

select the NB text classifier and compare the NB classifier results to those derived from the SVM

classifier and report similar findings for the two classifiers. Das and Chen (2007) create and test their own

five text classifiers and devise a voting system which selects the consensus result from at least three out of

five classifiers. In non-financial fields, different classifiers appear to be appropriate for different

circumstances. For instance, in recognizing tissue samples in the biomedical field, SVM is an

outperforming classifier (Furey, et al. (2000)) while in object recognition, PI is a promising approach

(Olson (1995)). Unlike research in other areas such as computer science, biology, physics and music,

where text classification methods have been thoroughly examined and evaluated, financial message board

studies are few and the accuracy and efficiency comparison of existing text classification algorithms

almost nonexistent. This study fills that literature gap by comparing and evaluating eight of the most

frequently used text classifiers and identifying the best text classifier for recognizing posters’ sentiment.

        NB is the most popular algorithm for text classifiers in financial message board studies, but it

relies upon a strong conditional independence assumption about the observed dataset. Such a severe

assumption often generates systemic problems that reduce the accuracy of the probability estimates

(Rennie, Shih, Teevan and Karger, (2003)). Recent research has found Maximum Entropy (ME), which

makes no conditional independence assumptions, to be a promising technique for text classification.

Ratnaparkhi (1998) reports that ME probability models for text categorization outperform the decision

tree when trained and tested under identical or similar conditions; Nigam, Lafferty and McCallum (1999)

find that ME is competitive with, and two out of three times better than, NB on experimental results; and

Raychaudhuri, Chang, Sutphin and Altman (2002) identify ME to be the most successful algorithm for

classifying biomedical articles. Similar results can be found in Macskassy, Hirsh, Banerjee and Dayanik

(2003) and Li, Wang, Chen, Tao and Hu (2005).

         Importantly, we believe the messages posted on financial stock message boards exist in the form

of conversational discourse. Unlike articles in newspapers or reports from analysts, forum messages

posted by individual investors are style free, short, elliptical and in a dialogue-like format. Therefore,

forum messages are more appropriately treated as conversational discourse than as any other format, such

as an article or lecture. This assumption is consistent with Admati and Pfleiderer (2001) who study the

pattern of internet stock forums and treat the broadcasting messages process as communication among

people. Berger, Pietra and Pietra (1996), Choi, Cho and Seo (1999) and Pettibone and Pon-Barry (2003)

report that for the classification of conversational discourse relations, ME has consistently exhibited high

performance. 7 If forum messages exist in the form of fragmentary utterances which is a common

phenomenon in spoken language, ME obviously should be a promising classifier with a high degree of

accuracy potential. Bender, Macherey, Och and Ney (2003) and Fernandez, Ginzburg and Lappin (2005)

support this standpoint based on their empirical results.

        A second aspect of the classifier identification process is to test consistency of performance when

separating training data into reply and non-reply messages. A text classifier is built based on the training

data which should contain as much relevant and useful information as possible without severe noise.

Noisy training data can significantly affect a text classifier’s performance. Earlier studies of financial

stock message boards have not differentiated between reply and non-reply messages when selecting the

training data thereby assuming that all the messages (reply and non-reply) have the same construction

pattern and importance. Based on this assumption, so long as a message comes with self-disclosed

sentiment, it is a good candidate for training data without considering whether the message is a reply or

non-reply. However, if a reply message is significantly different from a non-reply message in terms of

 Although there is the chance that some posters paste articles or news, most messages posted on forums appear to
be conversational discourse.
relevant information of content, a classifier’s performance may be lower under noisy training data (when

reply messages are included). A robustness test of separating the training data into clean (non-reply

messages only) and noisy (reply messages only) allows determination of whether the classifier is

amenable to noisy training data.

          In summary, this paper argues that a reason why no significant relationship between a sentiment

index and future stock returns has been found in the literature may be the result of imperfect construction

of the sentiment index. We propose that the quality can be improved by (1) giving a discounted weight to

reply messages relative to non-reply messages in the sentiment index formula (2) selecting a better text

classifier with only non-reply messages as training data (3) applying a better class assignment procedure

and (4) using an equal number of training messages in each one of the classes to avoid information biases.

          The following section describes the methodology, Section IV set forth construction of the data,

Section V reports descriptive statistics and Section VI presents empirical results. Summary and

conclusions are provided in Section VII.

                                           III.    Methodology
III. A. Overview

          This study tests three hypotheses which are described below.

          III. B. Hypothesis I: Reply messages differ in importance from non-reply messages and should

be treated differently when constructing the sentiment index. For this hypothesis, a “reply” discount

weight ρ is identified by use of a fixed effect regression model to capture the different influence on the

financial community of reply and non-reply messages.

          Although AF (2004) suggest heavier weights for posts which generate replies, they cannot

differentiate the replies because of the nature of the data set used. We argue that the different impact on

the community between reply and non-reply messages should not be ignored and it is crucial to

differentiate between these two different functional messages. The data set used in this paper is from

Thelion!WallStreetPit and allows this distinction. Thus, we not only capture the different influence on the

community generated by reply and non-reply messages, but also avoid the time sequence problem of AF


          Determination of the magnitude of the discount for reply messages requires an assessment of their

influence on the community. Fortunately, Thelion’s organized forum and membership credit system

allows us to test the effect of reply and non-reply messages by using the credit score as a proxy for

influence. 8 A poster’s credit consists of cumulative reward points voluntarily given by other forum

participants. A poster’s credit score represents its credibility and reputation among other participants and

the higher the score the greater the chance more people will read its post reflecting more community

influence. 9

          One approach to measuring the effect of reply versus non-reply messages is to evaluate each’s

influence on the poster’s credit. Based on past literature and observations of message board behaviors, a

poster’s credit is dependent upon many factors represented, in general, by the following model in

Equation (2). In this fixed effects regression model variables X1i through X3i are exogenous variables to

user’ credit, X4i through X8i are control variables, X9i is a dummy variable to control for days the markets

are open, and X10i is a dummy variable to test the hypothesis that reply messages have less influence on

the community than non-reply messages:

Yi = β0+µi+ β1X1i+ β2X2i+ β3X3i+ β4X4i+ β5X5i+ β6X6i+ β7X7i + β8X8i                                   +     β9X9i   +   ρX10i,t   +   εi   ,

          i = 1, ..., n                                                                               (2)


          Y i is a proxy for influence to the message board community for stock i
          β0 is a constant
  Tumarkin and Whitelaw (2001) point it out that Thelion!WallStreetPit is a private bulletin board. However, membership on
Thelion!WallStreetPit is costless and there is no hierarchy membership system. There is no limitation to read, to post and to reply
to any message on the forum by any user. Thelion!WallStreetPit has its credit system and high-end statistical tools such as All-In-
One stock information search engine. Only high-end statistical tools require the user to have prepaid credits, but these valuable
functions are to let paid members search information more quickly and efficiently. These features do not affect the liberty of
posting, reading and replying to messages like other public message boards and therefore we should not put
Thelion!WallStreetPit into the “private” bulletin category.

  According to the poster credit’s property that it can be given but not be taken, using the absolute poster’s credit (as reported in
each post after the poster’s name) to represent influence may be biased. In order to control this bias, we generate a relative
measure that is equal to the absolute poster’s credit divided by the number of total message posted by the poster. If the poster
does not add value to the community, the total number of messages posted increases but the poster’s credit remains unchanged
which dilutes its influence. Although using relative poster’s credit may logically solve the influence bias problem empirically, it
probably does not reflect actual events. It is unlikely a reader will divide a poster’s credit by the poster’s total number of
messages every time he/she reads a post. Moreover, by using the relative credit measure, it is possible that the influence of recent
contributions with a relatively small number of posts is diluted by a large number of valueless posts in its past. Therefore, we
prefer the absolute poster’s credit as a proxy for influence based on the argument that forum users differentiate quality posts from
inferior posts purely based on the poster’s absolute credit. However, for comparative purposes we report results for both relative
poster’s credit and absolute poster’s credit.
           µi is a fixed effect dummy variable controlling for stock i.
           X1 i is the number of messages posted by the poster since the first day of its
           X2 i, is the number of other forum participants who put the poster into their
           watch list.10
           X3 i, is the number of days since the first day the poster registered.
           X4 i, controls for message length in number of characters
           X5 i, controls for corporate size and is the total outstanding shares times closing
           market price on the message posting day.
           X6 i is the market closing price of the recommended stock in a post.
           X7 i, is the sentiment scale where 1 = buy sentiment, 2 = strong buy, 0 =hold
           (scalp), -1 = sell, -2 = strong sell and -3 = short.
           X8 i, is a dummy variable to control for whether or not the message is posted
           during the trading session.
           X9 i, is a dummy variable to control for market open days (Monday through Friday). 11
           X10 i, is the dummy variable indicating whether the post is a reply (dummy =1) or
           a non-reply (dummy=0)
           Of specific interest is the X10i coefficient that indicates whether there is a different influence on

the community between a reply and a non-reply post. If the coefficient ρ is significantly negative, the

implication is that the reply posts have less influential power than the non-reply posts on the community

and vice versa.

           III. C. Hypothesis II: A superior text classifier for classifying message board messages can be

identified. We demonstrate the process of comparing text classifiers’ performance with the confusion

matrix technique together with the Chi-square test to identify the most consistent and accurate classifier.

           Eight different widely used text classifiers are investigated: ME (Maximum Entropy), NB (Naive

Bayesian), SVM (Support Vector Machine), KNN (K Nearest Neighbor), TFIDF (Term Frequency

Inverse Document Frequency), EM (Expectation Maximization), PRIND (Probabilistic Indexing) and KL

(Kullback-Leibler Divergence). Maximum Entropy proves to be the best classifier (see Empirical Results

section below) and it is the one discussed in detail in this section. For detailed information about the

remaining algorithms see Friedman (1997) for NB; Vapnik (1995) for SVM; Rocchio (1971) for TFIDF;

Dempster, Laird and Rubin (1977) for EM; Yang (1994) for K-NN; Kullback and Leibler (1951) for KL

and Maron and Kuhns (1960) for PRIND.

  Any messages posted by the posters who are in a user’s watch list will be highlighted to that user as an indication of important

     We exclude all holidays during which the market is closed.
         Maximum entropy is a method for analyzing known information from training data in order to

determine a unique epistemic probability distribution that satisfies given constraints. The principle of

maximum entropy states that when nothing is known, the distribution should be as uniform as possible.

Therefore, the least biased model that encodes the given information is the one that maximizes the

uncertainty measure H(p)— conditional entropy, while remaining consistent with this information.

“Maximum entropy probability models offer a clean way to combine diverse pieces of contextual

evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic

context.” (Ratnaparkhi, 1998, page 6)

         The following parameters and terminology are adapted for demonstrating ME:

1. θj —a class or category, such as strong buy, buy, hold, sell or strong sell.
     r                          r
2. θ j —a vector of classes, θ j ={θ1, ···,θp}.

3. di —a document with self-disclosed sentiment in the training data.
     r                                                 r
4.   d j —a vector of documents in training data set, θ j ={d1, ···,dq}.

5. dm—a document whose class in unknown and which needs to be evaluated by the text classifier

6. xi —a feature parameter or attribute of a single word in a document. For example, the key words in

     the category “strong buy” that relate to its sentiment may include “up”, “right”, “nice”, etc. Key

     words in “strong sell” sentiment type may be “off”, “wrong”, “weak”, etc.
     r                         r
7.   xi —a vector of features, xi ={x1, ···,xn}.

8. w —the weight or a weight vector which describes the word’s importance.

         The task is to observe linguistic contexts dm which contain xm and predict the correct linguistic

class θj to which dm belongs. This includes constructing a maximum entropy classifier that can be

implemented with a conditional probability distribution p, such that p (θ j | d m ) is the probability of class

θj given a dm.

           To build the ME classifier model, we first define θ j = {θ1, θ2, . . . θp} as the set of classes of

                             r                                                                                                   r
interest in the system ( θ j = {strong sell, sell, hold, buy, strong buy} in this paper) and d j = {d1, d2, . . .

           r   r         r
dq} = { x1 , x2 , . . . xq } as the set of contexts or textual materials that we can observe from training data.

The training data Tn = {(θ1,d1) . . . (θp,dq)} is a large set of contexts d1 . . . dq that have been assigned to

their correct classes θ1 . . . θp that express certain relationships between features and outcomes.12 To build

a normalized probability distribution model based on this training data, the classifier weights the features

by using them in a log-linear model:

                                                             K                     f k (θ j , di )
                                     p(θ j | di ) =        ⋅ ∏ wk                                                     (3)
                                                    Z (di ) k =1
                                                   K        f k (θ j , di )

                                 Z (di ) =   ∑∏ wk
                                              θj   k =1

where Z (di ) is a normalization factor to ensure that                        ∑ p(θ
                                                                                            j   | di ) = 1 . Each feature function f k (θ j , di ) is

a binary function with the value of either 1 or 0:

                                                1,       if di belongs to a word class θ j
                              f k (θ j , di ) = 
                                                 0,           otherwise

Each parameter wk , where wk > 0, corresponds to one feature fk and can be interpreted as a “weight” for

that feature. The parameters { w1 , ···, wk } are found with Generalized Iterative Scaling (Darroch and

Ratcliff (1972)). The probability model p (θ j | di ) is a normalized product of those observed features.

           To evaluate a test document dm, the classifier chooses a distribution p that maximizes the entropy

H (p) for a test document dm whose correct class is unknown at the time. To maximize entropy is to

     We assume the self-disclosed sentiment originated by the poster is the correct class for the message.
maximize conditional entropy H(p) = H (θj | dm) subject to the above constraints (3) and (4) (Ratnaparkhi


                                                     H (θj | dm) = -      ∑
                                                                         θ j , dm
                                                                                    p(θ j , d m ) ⋅ log n p(θ j | d m )                   (5)

                                                              p*= argmax H (θ j | d m )

where p* is the H (θj | dm) that maximizes the Entropy of H (θj | dm) under the constrains of (3) and (4)

making the model match feature expectations with those observed in the training data.

                 The classifier returns a p* for each one of the class θj according to maximum entropy. Rather than

choosing the class θj with the highest p*, we exploit our own process for selecting the most accurate class

θj for dm. A better class assignment procedure is to determine at least the correct sentiment side (sentiment

index sign). For example, the returned value p* for each of the four sentiment types is strong buy 29%,

buy 28%, sell 30% and strong sell 13%. If we follow the traditional rule to select the category with the

highest p*, then sell should be our target. However, the sum of strong buy and buy exceeds the sum of

strong sell and sell which suggests that the aggregate buy side sentiment is stronger than that of sell side

sentiment. Therefore, buy sentiment in this example is more appropriate as our determined sentiment than

sell sentiment.

                 Given a pool of available self-disclosed (pre-known) sentiment type messages, an easy way to

examine the text classifier’s recognition accuracy is to check the power of the classifier: how many of

messages’ given sentiment by the classifier match with their self-disclosed sentiment type. The higher the

matched percentage, the higher the classifier’s power. In order to obtain the accuracy for each text

classifier and statistically compare the accuracy among them, we use the confusion matrix technique and

the Chi-square test. A confusion matrix as provided by Kohavi and Provost (1998) contains information

about actual classes and evaluates classes and provides the means for comparing the eight classifiers

     See Berger et al (1996), Lau et al. (1993), and Rosenfeld (1996) who use the equivalent format of        H (θ j | d m ) =
-    ∑
    θ j , dm
               p (d m ) p(θ j | d m ) ⋅ log n p(θ j | d m )
               %                                              , where   p(d m )
                                                                        %            is the observed probability and      p(θ j | d m )   is the conditional

tested. Performance of such systems is commonly evaluated using the data in a matrix. Each column of

the matrix represents the instances given by the text classifier, while each row represents the instances in

an actual class. A cross-tabulation of occurrence frequencies is constructed. For example, following is a

confusion matrix with 100 messages under each sentiment type. The diagonal cells show the number of

messages that are correctly assigned by a text classifier to their actual sentiment type. The accuracy for

this classifier is 80% which is the average accuracy of each sentiment type:

                            Type                  0 buy    1 hold    2 sell    : total   Accuracy
                            0 buy                  80        10       10         :100     80.00%
                            1 hold                  5        90        5         :100     90.00%
                             2 sell                10        20       70         :100     70.00%

           Next, to determine whether any pair of our eight distinct text classifiers are significantly different

from each other, one classifier’s confusion matrix is compared to another classifier’s confusion matrix

that has the same number of rows and columns by using the standard Chi-square statistic [Das and Chen


                                         1        ( Si − I i ) 2
                    x   2
                         [( n −1)2 ]
                                       = 2
                                             ∑ I , with degrees of freedom of (n-1)2
                                             i =1

where i is the corresponding cell in each of the classifier’s confusion matrix and n is the number of

classes in the confusion matrix. S represents the superior classifier, and I represents the inferior classifier.

We assign a classifier with higher confusion matrix accuracy to the superior model S and assign the lesser

accurate classifier to the inferior model I. If the two models are significantly different from each other, the

superior model has statistically higher performance. The better classifier between two is sequentially

tested against the other classifiers, and eventually the classifier with highest accuracy and consistency is

selected as the best text classifier algorithm. 14

  Das and Chen (2007) suggest using a vote system among their five classifiers, i.e. 3 of 5 classifiers should agree
on the message sentiment type, otherwise the message is not classified. Since all the classifier methods used in their
paper are analytical and do not require optimization or search algorithms, a vote system does a better job than any
single classifier. However, unlike Das and Chen, the classifier methods in this paper require optimization or search
algorithms. Additionally, care of the over fitting of data problem is taken by testing both the In-Sample and Out-of-
Sample performance. Thus we do not use a vote system in our classifier selection process.

         III. D. Hypothesis III: A superior classifier exhibits strong performance under both reply

messages only training data and non-reply messages only training data. This hypothesis is an inference

from hypothesis I as well as a robustness test for hypothesis II.

         In order to test whether a classifier’s performance is amenable to noisier training data (inclusion

of only reply messages), we separate the training data into two groups: a reply messages group and a non-

reply messages group. Each group has an equal number of messages under each sentiment type. We

compare the performance of ME under reply and non-reply groups. The null hypothesis is that ME’s

performance is not statistically different between the two groups. The comparison is made with the

confusion matrix together with the Chi-square test.

                                                    IV.       Data

         The data in this study is collected from Thelion!WallStreetPit message board through our self-

programmed Visual Basic software. 15 Only the messages that provide self-reported sentiment are

downloaded. Moreover, according to our three different hypotheses tests, we arrange our data in three

ways. Each different arrangement of data serves a special testing purpose. First, for hypothesis I that tests

the value of the discount weight for reply messages in the sentiment index formula, the data requires real

time downloads (see below) and is from July 18, 2005 through March 31, 2006 and only messages are

downloaded if they have a reply or they themselves are a reply to another message.16 Since this test does

not relate to text classifiers, we do not use a text classifier to generate sentiment. Second, for hypothesis II,

the data is extended to from July 2, 2001 to March 31, 2006 due to an insufficient number of messages for

  Thelion!WallStreetPit is a stock trading forum that allows people to post their opinions for any stock with the
stock symbol listed before the message title. Unlike Finance!Yahoo and RagingBull which allocate messages under
the stock symbol, Thelion shows all the messages in the same platform and sorts them by time. Messages posted on
Thelion!WallStreetPit include both self-reported and non-self-reported sentiment messages. For detail of Thelion’s
forum structure, go to Our Visual Basic software filters out the
poster’s disclaimer and signature appended to the message content as well as any unrecognized symbols.
   We know that an original post (non-reply) may later generate reply/replies, but we do not know in advance
whether or not an original post will have reply/replies in the future. This is the time series problem that A&F
encounter. However, we are able to identify whether a post is a reply or not at the time it is posted. We simply
identify all reply messages independent of the original post to which they relate in order to avoid the time series
problem. When downloading data, once a message is a reply, its title starts with “RE:” Plus, only a reply message
has a “Reply to Msg” key word embedded in the web page which allows us to identify which previous message that
it replies to. For example, if a reply post has “Reply to Msg 1234”, if the post 1234 itself is not already a reply, then
post 1234 is treated as an original post. Any original post which may have reply/replies that occur after our time
period (later than March 31, 2006) are not included in our sample.
testing this hypothesis from the preceding data set (July 2005 to March 2006).17 Each message’s content

is stored in a text file. These text files are used as training and testing data for the text classifiers. Third,

for hypothesis III, the data is the same as for hypothesis II but separated into reply and non-reply message

groups. More detail about each data arrangement is provided below.

IV.A. Data for Hypothesis I: Fixed effect regression model

        The selection criteria for the data in this arrangement are (1) a message itself either is a reply post

or has reply/replies to it, (2) a message has a stock symbol associated with it since not all the messages

talk about specific stocks, and (3) a message provides a self-reported sentiment in order to control for

sentiment type. Although the poster’s credit shown on the web reflects its current credit and is updated

whenever new credit is given, Thelion!WallStreetPit does not offer historical member’s credit. Thus, this

test is valid only when the credit is the real time credit at the time the message is posted. Although

Thelion!WallStreetPit started in 1999 and has over one million messages to date, the valid data for our

model starts July 18, 2005 when we began retrieving real time data by downloading messages every

trading day after the market close. The poster’s credit for each message before July 2005 is time

mismatched in that the downloading day is not the same day as the message posting.

IV.B. Data for Hypothesis II: Selecting the best text classifier

        As indicated above, only self-reported sentiment messages are used because the messages used as

training data must have pre-known (true) sentiment types to be categorized and compared with the

sentiment given by the text classifier.

        To generate as little bias as possible from the eight text classifiers studied, not only is the data

that is learned by the text classifiers for each of the classes needed but also noise in the data needs to be

minimized. Based on these criteria, Thelion!WallStreetPit message data is selected over Yahoo!Finance

data for three reasons: (1) Thelion!WallStreetPit controls duplicated identity problems by using a credit

system which prevents a poster from generating influence on the community by using multiple accounts.

(2) The Thelion!WallStreetPit’s forum administrator has a simpler job than the Yahoo!Finance forum

17 started its message board in 1999 with its first message on Nov, 29, 1999. However, the sentiment
function is not available until July 2, 2001 which is the starting date for hypothesis II data.
administrator since it is more organized and experiences a relatively smaller message flow. (3) Users who

violate forum policy or abuse their accounts will be warned or suspended, further preventing account


         As pointed out by Das and Chen (2007), a small sample of between 300 and 500 messages as

training data for each class can prevent overfitting of data which may lead to poor out-of-sample

performance in spite of high in-sample accuracy. We restrict the training data to 350 messages for each of

the five classes (strong buy, buy, hold, sell, and strong sell). 18 The number of messages for each class

must be equal in order to avoid class imbalance problem which causes information bias and is crucial to

the accuracy of a text classifier. For further discussion of the class imbalance problem, see Japkowicz and

Stephen (2002) and Estabrooks and Japkowicz (2004). The in-sample data is a sub sample of the training

data which is randomly selected by RAINBOW (programmed by McCallum) to test a classifier’s inner

accuracy. 19 High out-of-sample accuracy is the goal since the purpose of using a text classifier is to return

a probability of each sentiment class for a testing (non-training) document. By utilizing data from

Thelion!WallStreetPit, we report not only the in-sample training data accuracy, but also and more

importantly, we can report the out-of-sample accuracy for each of the text classifiers, something not done

in earlier studies. 20 We randomly select 350 messages with self-reported sentiment for each class as

training data to build the classifier model.21 Then, from the remainder of messages in each class, another

350 messages, also with self-reported sentiment for each class, are randomly selected as out-of-sample

data for testing purposes.

IV.C. Data for Hypothesis III: Robustness of text classifier’s performance

  In order not to drop “Short” sentiment messages, we combine short sentiment together with sell and strong sell
into sell-side sentiment messages in a separate test.
  McCallum, Andrew Kachites. "Bow: A toolkit for statistical language modeling, text retrieval, classification and
clustering." 1996. RAINBOW software can be downloaded free for
academic purposes.
  AF (2004) use message data from Yahoo!Fiannce in 2000 which does not have self-reported sentiment making it
impossible for them to compare the sentiment given by the poster with that given by the text classifier.
  We program software to randomly select a number of files from a folder, such as randomly selecting 500 text files
from a folder which consists of 1000 text files.
          Data in this test is the same as in the previous test. However, unlike the preceding section where

reply and non-reply messages are mixed randomly in each class, the data in this section is separated into a

reply group and a non-reply group for each class so that we have ten combination classes. The purpose is

to test whether the selected classifier consistently exhibits high recognition ability under noisier training

data. Of all the data, 140 messages qualify under the reply-sell category. Thus, in order for each class to

have an equal number of training data, 140 messages are randomly selected as training data from the

larger pool of data for each of the remaining categories.

                                           V. Descriptive Statistics

          As an introduction to the characteristics of the data, Figures A through C present various

descriptive statistics of posting activities and sentiment reporting on Thelion!WallStreetPit from 2001 to


                                        >>>Insert figures A, B and C here<<<

          Figure A shows that the use of strong sentiment (strong buy, strong sell and short) greatly

exceeds the use of “normal” sentiment (buy and sell) indicating that when individuals elect to give out

sentiment they usually prefer strong sentiment. 23

          Figure B reveals that the number of reply messages is much smaller than original posts (non-reply

messages) when sentiment is disclosed and this pattern is consistent over time.

          Figure C reflects that posting activities for all days have been increasing from 2002 to 2005. This

pattern may be due to an increasing number of investors switching to online discount brokers or to the

recovery of the stock market since 2003. Posting activities during weekdays dramatically exceed those

during weekends. There is little difference among weekdays even though Wednesday is the peak of

posting activities.

          Many of these posting patterns shown in Figures A through C are in line with previous studies

such as in Wysocki (1999), Tumarkin and Whitelaw (2001), AF (2004), Das, Martínez-Jerez and Tufano

  Although our data is from 2001 July to 2006 March, the figures omit 2001 and 2006 because they lack a full year
of observations.
  We do not include the Scalp sentiment because it conveys day trade activities and does not disclose specific
(2005) and Das and Chen (2003, 2007), etc. Because these studies use different data from different

message boards and over different time periods, similarities among them indicate online posters follow

certain patterns when posting messages online.

                                  VI. Empirical Results and Evaluations

VI.A. Importance of reply versus non-reply messages and determination of the discount weight ρ
for non-reply messages
        The reply effect cannot be tested directly because no direct measurement for influence on the

financial community exists to indicate the extent of influence that a message generates. Among all online

financial message boards, only TheLion’s forum offers real time poster credit which is the sum of all

previous credit voluntarily given by other forum participants. Poster credit is public information that can

be seen by readers of the message and we assume the higher the poster credit the greater the influence on

the community. Although we hypothesize that, on average, reply messages have less influence on the

community than non-reply messages, ceteris paribus, obviously, a poster’s credit is not determined solely

by whether its posts are reply or non-reply but also by other considerations such as number of total

messages the poster posts, how long the poster has participated in the community, message length of the

post, and characteristics of the stock associated with the post. Also, other posting activity variables need

to be controlled, such as whether or not the message is posted during the market trading period and under

which specific sentiment. The following table sets forth selected characteristics of the data used in the

model to identify the appropriate discount factors to be assigned to reply messages.

                                             >>>Insert Table I here<<<

        Similar to the posting activity patterns described above, buy-side sentiment messages exceed sell-

side sentiment messages, strong sentiment messages exceed “normal” sentiment messages, most posting

activities occur during trading hours, and most posting activities occur during weekdays when the market

is open. Again, due to the design of this test, only those posts that are replies or that are original but have

reply/replies are included. Thus, because one original post can have more than one reply, the number of

non-reply messages is smaller than that of reply messages in the table. Each message has a self-reported

sentiment in order to control for sentiment type.

         Before the model in III.B can be tested, modifications are necessary to reduce multicollinearity.

The number of messages posted, membership days and number of other users watching are highly

correlated with each other and market capitalization is highly correlated with stock closing price.

Therefore the number of membership days, other users watching and stock closing price are omitted from

the fixed effect regression model. In addition, many of the quantitative variables are not normally

distributed and are included in logarithmic form. Therefore, the modified model for testing is:

LPC= β0+ µi+ β1LMP+β2 LML+β3 LMC+β4 SEN+β5MOi+β6TP+ρ IsReply+ ε                      (7)


        β0 – the constant
        µi – the fixed effect dummy variable controlling for stock i
        LPC–the logarithm of the poster’s credit
        LMP–the logarithm of the number of messages that are posted by the poster
        LML–the logarithm of length of a message
        LMC–the logarithm of stock’s market capital
       SEN–the scale of sentiment types with -3 for short, -2 for strong sell, -1 for sell, 0 for hold
                 (scalp), 1 for buy, and 2 for strong buy
        MO= 1 if it is weekday (market open) and MO= 0 if it is weekends (market close)
        TP=1 if it is during trading period and TP=0 if it is not during trading period
        IsReply=1 if it is a reply post and IsREply=0 if it is a non-reply post.

         The fixed effect model is selected because it controls within stock effect and the Hausman test

favors it over the random effect model. Results are reported in Table II:

                                        >>> Insert Table II here <<<
         Of specific interest is the dummy variable for reply/non-reply (IsReply). Its coefficient is -0.125

which is significant at the 0.01 level, indicating that reply messages are 12.5% less influential in the

forum community than non-reply messages. This suggests that the discount ρ for a reply message is 0.125

(12.5%) making the weight for a reply (in the bullishness formula) 0.875 while a non-reply message’s

weight is unity (1.0). Following Tumarkin and Whitelaw’s (2001) suggestion of eliminating messages

posted by hosts of the forums, we repeat the test excluding the webmaster’s (Lionmaster) messages and

results do not change appreciably. Moreover, five additional robustness tests are performed: (1) excluding

messages whose recommended stock is less than $1 in price, (2) including messages with scalp sentiment

that is assigned a zero score, (3) adding a book-to-market ratio to represent stock characteristics, (4) rather

than using the absolute measure as a proxy for influence, using the relative measure which is equal to
poster’s absolute credit divided by its number of messages posted, and (5) adding a time sequence

variable according to posting time and applying the Newey-West correction for serial autocorrelation. All

robustness results support the original conclusions.24

VI. B. Performance Comparison among Different Text classifiers

           To determine which classifier has the greatest accuracy and consistency, we use a random mix of

our reply and non-reply messages for training data and create eight separate text classification models,

one for each of eight different tested algorithms. We record RAINBOW’s in-sample average accuracy
output for each classifier and then test each model’s out-of-sample accuracy on the same dataset.

Results of accuracy tests are comparable only when models are built from the same training sample and

tested on the same testing sample.

           To conserve space, only the Naive Bayesian (NB) and Maximum Entropy (ME) confusion matrix

are demonstrated for calculating accuracy. Table III sets forth the original confusion matrix from the

RAINBOW output. Although the meaning of a message’s content under a strong buy sentiment should

not be appreciably different from that under a buy sentiment message, it should be very different from

that of a sell sentiment message. In order to see how well the classifier assigns messages to at least the

correct sentiment side (buy/sell) guaranteeing at least the correct sentiment score sign when constructing

the sentiment index, it is appropriate to “adjust” the tests in Table III by combining buy and strong buy

into buy-side, sell and strong sell into sell-side, and leaving hold unchanged as shown in Table IV.26

Table IV shows the adjusted confusion matrix for NB and ME. The remaining classifiers’ accuracy and

adjusted accuracy are calculated in the same way.

                                             >>> Insert Table III here <<<
                                             >>> Insert Table IV here <<<

     Robustness tests results are available from the authors upon request.
  When setting up the classifier model, AF (2004) suggest using the top 1000 words as ranked by their importance
such that only the 1000 words with the highest mutual information (a quantity that measures the mutual dependence
of the two variables) are treated as features for that class. We follow this suggestion and choose “—prune-vocab-by-
infogain=1000” option for each class when running RAINBOW.
     AF (2004) treat the hold sentiment as neutral, and, similarly, we assume hold is between buy-side and sell-side.
        While in Table III, the average accuracy for NB was 44.8% compared with its adjusted accuracy

in Table IV of 62.6%. For ME, the average accuracy was 54% (Table III) and its adjusted average

accuracy is 70% (Table IV).

        Table V reports both in-sample accuracy results from RAINBOW and out-of-sample results

calculated by the authors. The in-sample data consists of 350 files in each of five classes (strong buy, buy,

hold, sell, and strong sell) where 28.58% (100 files for each class) are used for testing model accuracy

(totaling 500 testing files), while 71.43% (250 files for each class) are used to generate the model. The

out-of-sample data set also has five classes with 350 messages in each class and all of the data is used for

testing purposes (totaling 1750 testing files). In order not to drop “Short” sentiment type messages and

consistent with previous literature, we use three sentiment side classes: buy-side, hold and sell-side. A

buy-side class contains 350 randomly selected files from buy and strong buy classes, a hold class contains

350 randomly selected files from hold class, and a sell-side class contains 350 randomly selected files

from sell, strong sell and short classes. The sentiment side out-of-sample dataset also has three classes

with 350 messages in each class and all of the data is used for testing. In-sample accuracy and standard

error values are reported from five independent trials. Out-of-sample values represent one trial.

                                    >>> Insert Table V here <<<
        The table reveals that Maximum Entropy consistently outperforms the other text classification

algorithms for both in-sample and out-of-sample tests. As a robustness check, we verify our results by

testing the most widely used sample in the classification literature—20NewsGroup data.27 Our finding is

unaltered. Because absolute percentage comparisons do not tell whether one classifier statistically

outperforms another, Chi-square based on the confusion matrix is employed.

        To demonstrate the procedure of calculating the Chi-square statistic, we use the numbers for the

ME and NB in Table IV as an example. First, to test each algorithm’s own interpretive ability each

classifier is first compared to a random classifier that would result in a 33% accuracy level since there are

   20NewsGroup is sample data that can be downloaded from McCallum’s website together with the RAINBOW
software. Among 20NewsGroup’s many classes, in order to use conversational discourse messages, we choose the
“talk.politics.*” category which consists of three different classes: talk.politics.misc, talk.politics.guns and
talk.politics.mideast, with 1,000 messages in each.

three classes of messages [for rationale, see Das and Chen 2007]. 28 Second, the standard chi-square

                  1 9 ( MEi − NBi ) 2
test: x 2 (4) =     ∑ NB
                  9 i =1
                                      , with 4 degrees of freedom, is used to compare the interpretive ability

between ME and NB on the adjusted confusion matrix. The hypothesis is that ME significantly

outperforms NB. As an example, the computed test statistic from Table IV is 9.66 which is significant at

the 0.05 level.

          Comparisons between each pair of classifiers are shown in Table VI. The table includes four

comparisons: (1) adjusted in-sample, (2) adjusted out-of-sample, (3) sentiment side in-sample, and (4)

sentiment side out-of-sample.

                                                      >>> Insert Table VI here <<<

          The out-of-sample accuracy comparison is more relevant than the in-sample comparison because

the purpose of using the classifier ultimately is to obtain a sentiment score for a document that is outside

the training dataset. Therefore, Panels B and D are the focus. In Panel B, ME statistically outperforms KL,

SVM, PRIND, and KNN (4 of 7) and in Panel D, ME statistically outperforms all other text classifiers

except PRIND and SVM (5 of 7). Based on the performance consistency, we conclude that ME is the

superior classifier.29 This finding is consistent with the literature in non-financial fields that suggests that

maximum entropy outperforms other classifiers such as NB and SVM for recognizing conversational

discourse in spoken language.

VI.C. Robustness test for the Maximum Entropy’s Performance
          With ME being selected as the best text classifier, a robustness test for its ability to sustain high

performance under noisy data is conducted. Earlier tests indicate that reply posts have less influence than

  Our Chi-square results suggest that all classifiers’ interpretive ability is significantly higher than the random
algorithm. Details of these results are not reported but are available from the authors upon request.
  Due to specific message content problems such as empty content, unrecognizable symbol, damaged file, etc, classifiers drop
these unreadable files during the training process. Different algorithms may treat such files differently which means the number
of dropped files is different under different classifiers. The Chi-square test requires the total message number of Ii (inferior) in the
confusion matrix must equal to total message number of Si (superior). Except for KNN, all other classifiers report almost equal
numbers of dropped files allowing us to use the Chi-square test between each two classifiers’ result. For KNN, since the Chi-
square result is inflated due to a smaller number of observations in the denominator, we make the comparison between KNN and
other classifiers based upon the percentage.
non-reply posts on the community and hints that when building a text classifier, the content of reply

messages is noisy in terms of key words and contains less relevant information for the training dataset.

          We separate data not only by different sentiment classes but also by reply and non-reply types.

With five sentiment categories: strong buy, buy, hold, sell and strong sell, and reply and non-reply, we

have ten groups. We randomly select 140 files for the in-sample test for each of them where training data

occupies 71.43% (100 files for each group) and testing data occupies 28.58% (40 files for each group,

totaling 400 in-sample testing files).30 Since the sell sentiment class under the reply group has only 140

files which are all used as training data, we do not have enough data for out-of-sample tests based on five

sentiment categories. Therefore, we realign the data and separate it into three sentiment side categories:

buy-side, hold, and sell-side sentiment and into reply and non-reply groups, yielding six groups. For an

in-sample test for sentiment side, we randomly select 350 files for each category, where training data

occupies 71.43% (250 files in each group) and testing data occupies 28.58% (100 files in each group,

totaling 600 in-sample testing files). Then, from the remaining data, we randomly select 350 files for each

of the six groups as out-of-sample testing data (totaling 2100 out-of-sample testing files).

          Panel A in Table VII overviews the classification of training data based on sentiment types and

reply/non-reply. Panel B demonstrates the accuracy comparison between using reply and non-reply

messages as training data under the same text classifier ME. The purpose of this robustness test is to

check whether the ME classifier is amenable to noisier training data. The procedure of comparison is

again the confusion matrix as in section VI.B.

                                                   >>> Insert Table VII here <<<

          ME’s superior performance basically is unchanged under the reply training data in all four tests

indicating that ME is amenable to noisy training data. Although Chi-square results do not support a

significant difference between model A (generated from only reply messages) and model B (generated

from only non-reply messages), in general, model A underperforms model B by roughly 5%. The five

percent gain in accuracy of model B may be economically if not statistically significant and should not be

   The reason we choose 140 files per folder is that sell sentiment class under reply group has only 140 files. Therefore we use up
all the available data for sell-reply group.
ignored. This suggests that, in practice, we should exclude reply messages from training data in order to

increase accuracy by 5% when building a text classifier.

                                    VII Summary and Conclusions

        Based on artificial intelligence technology and statistical methodology, this study proposes three

important improvements to extend previous literature in the construction of an online investors’ sentiment

index. First, we explicitly consider and measure the importance of reply versus non-reply messages in

generating a sentiment index for internet financial message boards. Results suggest that a 12.5% discount

should apply to reply messages such that reply messages carry 0.875 and non-reply messages carry 1.0

weight when calculating a sentiment index. Second, after taking care of the class imbalance problem in

training data and using a better class assignment procedure, eight text classifiers are tested and compared

and the best text classification algorithm is determined to be Maximum Entropy. Noteworthy is that

Maximum Entropy not only is superior to all classifiers tested (including Naive Bayesian and Support

Vector Machine, classifiers commonly used in the literature), but also is amenable to noisy training data.

Third, using training data that consists of only non-reply messages can improve a text classifier’s

recognition by approximately 5%. All three of these improvements aim to provide a better proxy for an

online investors’ sentiment index such that researchers have a better vehicle with which to evaluate the

relationship between individual investor sentiment and stock price movements, helping investors make

more prudent investment decisions, and helping policymakers to better monitor internet message board

influences on equity markets..

Admati, A.R. & Pfleiderer, P.C. (2001) Disclosing Information on the Internet: Is It Noise or Is It News?,
Technical report, Graduate School of Business, Stanford University.

Antweiler, W. & Frank, M.Z. (2004) Is All That Talk Just Noise? The Information Content of Internet
Stock Message Boards, Journal of Finance, 59(3): 1259-1295.

Baker, M.P. & Wurgler, J. (2006) Investor Sentiment and the Cross-Section of Stock Returns, Journal of
Finance, 61(4): 1645-1680

Bender, O., Macherey, K., Och, F.J. & Ney, H. (2003) Comparison of Alignment Templates and
Maximum Entropy Models for Natural Language Understanding, Proceedings of the tenth conference on
European chapter of the Association for Computational Linguistics, Budapest, Hungary: 11–18

Berger, A.L., Della Pietra, V.J. & Della Pietra, S.A. (1996) A maximum entropy approach to natural
language processing, Computational Linguistics, 22(1):39-71

Brown, G.W. & Cliff, M.T. (2004) Investor sentiment and the near-term stock market, Journal of
Empirical Finance, 11:1-27

Brown, G.W. & Cliff, M.T. (2005) Investor Sentiment and Asset Valuation, Journal of Business, 78:405-

Chan, S.Y. & Fong, W.M. (2004) Individual Investors' Sentiment and Temporary Stock Price Pressure.
Journal of Business Finance & Accounting, 31(5-6), 823-836

Chen, N.F., Kan, R. & Miller, M.H. (1993) Are the Discounts on Closed-End Funds a Sentiment Index?,
Journal of Finance, 48:795-800

Choi, W.S., Cho, J.M. & Seo, J. (1999) Analysis system of speech acts and discourse structures using
maximum entropy model, Proceedings of the 37th conference on Association for Computational
Linguistics, Morgan Kaufman, San Francisco, 230-237

Darroch, J.N. & Ratcliff, D. (1972) Generalized iterative scaling for log-linear models, The Annals of
Mathematical Statistics, 43:1470-1480.

Das, S.R. & Chen, M.Y. (2003) Yahoo! for Amazon: Sentiment Parsing from Small Talk on the Web,
Proceedings of the 8th Asia Pacific Finance Association annual conference, Working paper.

Das, S.R. & Chen, M.Y. (2007) Yahoo! For Amazon: Sentiment extraction from small talk on the Web,
Management Science forthcoming

Das, S.R., Martinez-Jerez, A. Fa. De. & Tufano, P. (2005) eInformation: A Clinical Study of Investor
Discussion and Sentiment, Financial Management, 34(3):103-137

Das, S.R. & Sisk, J. (2005) Financial Communities, Journal of Portfolio Management, v31(4): 112-123

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the
EM algorithm, Journal of the Royal Statistical Society, B.39, 1:1-38.

Estabrooks, A., Jo, T. & Japkowicz, N. (2004) A Multiple Resampling Method for Learning from
Imbalances Data Sets, Computational Intelligence, 20, Number 1:18-36.

Elton, E.J., Gruber, M.J. & Busse, J.A. (1998) Do Investors Care About Sentiment? Journal of Business

Fisher, K.L. & Statman, M. (2000) Investor sentiment and stock returns, Financial Analysts Journal
(March-April): 16-23

Friedman, J.H. (1997) On Bias, Variance, 0/1-loss, and the Curse of Dimensionality, Data Mining and
Knowledge Discovery, 1:55-77.

Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M. & Haussler D. (2000) Support
vector machine classification and validation of cancer tissue samples using microarray expression data,
Bioinformatics, 16:906-914

Gu, B., Konana, P., Liu, A., Rajagopalan B., & Ghosh, J. (2006) Predictive Value of Stock Message
Board Sentiments, working paper, the University of Texas at Austin.

Japkowicz, N. & Stephen, S. (2002) The class imbalance problem: A systematic study, Intelligent Data
Analysis Journal, 6(5), 429-450

Kaniel, R., Saar, G. & Titman, S. (2004) Individual Investor Sentiment and Stock Returns, working paper,
Duke University.

Kullback, S. & Leibler, R.A. (1951) On information and sufficiency, Annals of Mathematical Statistics,

Kohavi, R. & Provost, F. (1998) Glossary of Term, Machine Learning, 30:271-274.

Kumar, A. & Lee, C.M.C. (2006) Retail Investor Sentiment and Return Comovements, Journal of
Finance, 61, 2451-2486

Lee, C.M.C., Shleifer, A. & Thaler, R.H. (1991) Investor Sentiment and the Closed-End Fund Puzzle,
Journal of Finance, 46: 75-109.

Li, R.L., Wang, J.H.; Chen, X.U.; Tao, X.P. & Hu, Y.F. (2005) Using maximum entropy model for
Chinese text categorization, Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.),42(1): 94-101

Macskassy, S.A., Hirsh, H., Banerjee, A.& Dayanik, A.A. (2002) Converting numerical classification into
text classification, Artificial Intelligence, 143(1):51-77

Maron, M.E. & Kuhns, J.L. (1960) On Relevance, Probabilistic Indexing and Information Retrieval,
Journal of the Association for Computing Machinery, 7:216-244.

Neal, R. & Wheatley, S.M. (1998) Do Measures of Investor Sentiment Predict Returns? Journal of
Financial and Quantitative Analysis, 33: 523-547

Nigam, K., Lafferty, J. & McCallum A.K. (1999) Using Maximum Entropy for Text Classification, In
Proceedings of IJCAI99 Workshop on Machine Learning for Information Filtering, 61-67.

Olson C.F. (1995) Probabilistic indexing for object recognition, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(5): 518-521

Pettibone, J. & Pon-Barry, H. (2003) A Maximum Entropy Approach to Recognizing Discourse Relations
in Spoken Language, Working paper, The Stanford Natural Language Processing Group. June 6.

Ratnaparkhi, A. (1997) A simple introduction to maximum entropy models for natural language
processing, Technical Report 97-08, Institute for Research in Cognitive Science, University of

Ratnaparkhi, A. (1998) Maximum Entropy Models for Natural Language Ambiguity Resolution, PhD
thesis, University of Pennsylvania

Raychaudhuri, S., Chang, J.T., Sutphin, P.D. & Altman, R.B. (2002) Associating Genes with Gene
Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature, Genome Research,

Rennie, J.D.M., Shih, L., Teevan, J. & Karger, D.R. (2003) Tackling the Poor Assumptions of Naive
Bayes Text Classifiers, In T. Fawcett and N. Mishra (eds), International Conference on Machine
Learning, Washington D.C, Morgan Kaufmann

Rocchio, J.J. (1971) Relevance feedback in information retrieval. In G. Salton, (ED.), The SMART
Retrieval System: Experiments in Automatic Document Processing, page 313-323, Prentice-Hall.

Tumarkin, R. & Whitelaw, R.F. (2001) News or Noise? Internet Message Board Activity and Stock
Prices, Financial Analysts Journal, 57(3), 41-51.

Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer-Verlag: New York.

Wysocki, P.D. (1999) Cheap Talk on the Web: The Determinants of Postings on Stock Message Boards,
Working paper, University of Michigan, 1999

Yang, Y. (1994) Expert Network: Effective and Efficient Learning from Human Decisions
in Text Categorization and Retrieval, In Proceeding of the 17th annual international conference on
research and development in information retrieval (SIGIR’94), page 13-22.

                                                                                    Table I
                                                                      Sentiment Distributions
                 Data is downloaded from Thelion!WallStreetPit and covers the messages that meet our test requirements from July 18, 2005 through March 31,
                 2006, totaling 5,898 messages. Self-reported sentiments include strong buy, buy, hold, sell, strong sell, and short. During Market Trading
                 measures the posting activities that occur from 9:30:00AM to 4:00:00PM ET. Holidays posting activities are omitted.

                     Sentiment Type             Messages          Percentage               Specific Partition                 Messages         Percentage
                         strong Buy                4425             75.03%                    Weekday:                         5740              97.32%
                            buy                     730             12.38%                    Weekend:                          158               2.68%
                           hold                     124              2.10%             During Market Trading:                  4354              73.82%
                            sell                    45               0.76%            Not During Market Trading                1544              26.18%
                        strong sell                159              2.70%                      Reply:                          3398              57.61%
                           short                   415               7.04%                      Non-reply:                     2500              42.39%

                                                                                   Table II

                                        Regression Model to Test Discount Weight on IsReply variable

We include the fixed effect and random effect regression models in this table:

                                               Fixed effect: LPCi = β0+ µi+ β1LMPi+β2 LMLi+β3 LMCi+β4 SENi+β5MOi+β6TPi+ρ IsReplyi+ εi

                                              Random effect: LPCi = β0+ τi+ β1LMPi+β2 LMLi+β3 LMCi+β4 SENi+β5MOi+β6TPi+ρ IsReplyi+ εi

Where: LPCi–the logarithm of the poster’s credit; µi – the fixed effect intercept dummy variable controlling for stock i; τi – the random effect error term dummy variable
controlling stock i; LMPi–the logarithm of the number of messages that are posted by the poster; LMLi–the logarithm of length of a message; LMCi–the logarithm of stock’s
market capital; SENi–the scale of sentiment types where -3 for short, -2 for strong sell, -1 for sell, 0 for hold (scalp), 1 for buy, and 2 for strong buy; MOi= 1 if it is weekday
(market open) and MOi= 0 if it is weekends (market close); TPi=1 if it is during trading period and TPi=0 if it is not during trading period; IsReplyi=1 if it is a reply post and
IsREplyi=0 if it is a non-reply post. t-statistics are reported in parentheses. Number of observation is 5898 with 966 groups. The fixed effect model has an R-square of 0.8.

                                                      LMPi            LMLi           LMCi           SENi            MO i            TPi          IsReplyi           Constant

    Fixed Effect                Coefficient          0.986            0.085         -0.042          -0.083         -0.017         -0.133           -0.125            -2.344

                                  T-Stat            (85.79)***       (6.94)***        (-0.8)         (-7)***        (-0.69)       (-1.98)**       (-6.13)***        (-3.78)***

  Random Effect                 Coefficient          1.005            0.087          0.004          -0.079         -0.031         -0.139           -0.114            -2.897

                                  T-Stat           (111.43)***       (8.27)***       (0.48)        (-8.98)***       (-1.42)       (-2.38)**       (-6.08)***        (-18.55)***

   Hausman Test                   (13.43)*

* Significant at 0.05 level
** Significant at 0.025 level
*** Significant at 0.01 level

                                                                         Table III
                                 Accuracy Measures for Unadjusted Confusion Matrix

Panel A: RAINBOW In-Sample for NB (224 out of 500—44.8 % accuracy).
   Class Name              0 strong buy          1 buy       2 hold        3 sell         4 strong sell         : total        Accuracy
  0 strong buy                  40                13           31            7                 9                 :100          40.00%
      1 buy                     30                27           26            8                 9                 :100          27.00%
      2 hold                     3                 1           78            9                 9                 :100          78.00%
       3 sell                    2                11           28           35                 24                :100          35.00%
  4 strong sell                  2                 5           27           22                 44                :100          44.00%

 Panel B: RAINBOW In-Sample for ME (270 out of 500—54 % accuracy).
  Class Name              0 strong buy           1 buy      2 hold         3 sell         4 strong sell         : total        Accuracy
  0 strongbuy                  51                 25          9             7                  8                 :100          51.00%
     1 buy                     16                 53          13            9                  9                 :100          53.00%
      2 hold                   13                 10          58           15                  4                 :100          58.00%
      3 sell                   5                  10          12           57                  16                :100          57.00%
  4 strong sell                10                  4          12            23                 51                :100          51.00%

Note: For each trial, 350 files for each class, training data occupy 71.43% (250 files) and testing data occupies 28.58% (100 files). Each
cell’s result is the average of five trials.

                                                                         Table IV
                                     Accuracy Measures for Adjusted Confusion Matrix

                        Panel A: NB In-Sample Adjusted Confusion Matrix
                                                  (313 out of 500—62.6% accuracy).
                          ClassName            0 buy-side       1 hold           2 sell-side    : total     Accuracy
                          0 buy-side             110                57               33             :200    55.00%
                            1 hold                 4                78               18             :100    78.00%
                          2 sell-side             20                55              125             :200    62.50%

                        Panel B: ME In-Sample Adjusted Confusion Matrix
                                                   (350 out of 500—70% accuracy).
                          ClassName            0 buy-side       1 hold           2 sell-side    : total     Accuracy
                          0 buy-side             145                22               33            :200     72.50%
                            1 hold                23                58               19            :100     58.00%
                          2 sell-side             29                24              147            :200     73.87%

                         Note: Based on its five classes original average output, adjusted average output merges strong buy
                         and buy into buy-side, strong sell and sell into sell-side, and leaves hold unchanged. After merging,
                         there are 200 files for buy-side class, 100 files for hold class and 200 files for sell-side class.

                                                                   Table V
                  Comparison of Relative Accuracy of Text Classification Algorithms

ME represents Maximum Entropy, SVM-Support Vector Machine, TFIDF-Term Frequency Inverse Document Frequency, PRIND-Probabilistic
Indexing, KL-Kullback-Leibler divergence, EM-Expectation Maximization, NB-Naive Bayesian, and KNN-K nearest neighbor. The classifiers in
this table are sorted from left to right by the percentage under “Adjusted In-Sample”. “Adjusted” means combining the reported number in buy
and strong buy recorded from RAINBOW output into buy-side sentiment, sell and strong sell into sell-side sentiment, and leaving hold
unchanged. Similar groupings are indicated as “Sentiment Side” when allocating training data. KNN drops a significant number of files during
the test suggesting that KNN has relative low recognition power.

   Test Accuracy / Algorithm                ME           SVM        TFIDF      PRIND         KL          EM          NB          KNN

              In-Sample                     54%         45.8%       46.4%      47.8%       45.8%       46.2%       44.8%       37.42%
      RAINBOW Standard Error               (0.57)      (1.15)        (1.06)     (0.99)      (0.55)      (0.77)     (0.49)       (0.86)

         Adjusted In-Sample                 70%         65.8%       65.6%      65.4%       65.2%        64%        62.6%      58.42%
      Calculated Standard Error            (0.95)        (1.68)      (0.93)     (1.13)      (0.89)      (1.22)     (0.87)       (0.73)

            Out-of-Sample                 44.35%       35.77%       39.66%    36.15%       39.35%      42.1%      41.92%       28.26%

      Adjusted Out-of-Sample              61.27%       51.86%       54.9%     49.79%       53.07%       58%       58.93%       48.15%

     Sentiment Side In-Sample               72%          69%        68.33%    67.67%       68.67%      70.7%      69.33%       56.74%
      RAINBOW Standard Error               (1.15)        (1.32)      (0.9)      (1.48)      (0.91)      (0.97)     (0.64)       (1.03)

       Sentiment Side Out-of-                                                                                                       b
              Sample                      60.68%       58.48%       58.37%    55.24%       56.97%      56.9%      57.27%      53.44%

      20NewsGroup In-sample               92.33%       91.71%       91.7%     85.44%       92.15%       91%       92.22%        84.1%
     RAINBOW Standard Error        (0.23)     (0.72)    (0.17)     (0.82)    (0.06)    (0.18)   (0.05)                          (0.85)

a. KNN drops 19 files in the strong sell class
b. KNN drops 30% of the data during testing (classifier only recognizes 70% of files) which may inflate the result due to a small

                                                                    Table VI
                                        Chi-square Comparison of Text Classifiers
Panel A. Adjusted In-sample: Chi-Square results between pairs of classifiers. Except KNN, all other classifiers have 200 files for buy-side
class, 100 files for hold class and 200 files for sell-side class in the confusion matrix (totaling 500 files). KNN drops 19 files in sell-side
leaving a total of 481 files in three classes.
   Accuracy               70%           65.80%         65.60%         65.40%          65.20%            64%          62.60%           58.42%
                          ME             SVM           TFIDF          PRIND             KL              EM             NB              KNN
      ME                    0            6.46        24.2***         15.61***        13.69***         11.2**        17.05***         11.88**
      SVM                                  0         16.62***        14.90***          8.01            6.96         14.70***         17.02***
     TFIDF                                              0              2.46            0.77            0.82           0.84           45.37***
     PRIND                                                               0             3.41            2.45           2.05           48.24***
       KL                                                                                0             0.27            1.5           35.49***
      EM                                                                                                 0            1.33           38.99***
       NB                                                                                                               0            43.34***
      KNN                                                                                                                               0
Panel B. Adjusted Out-of-Sample: Chi-Square results between pairs of classifiers. There are 700 files for buy-side class, 350 files for hold
class and 700 files for sell-side class in the confusion matrix (totaling 1750 files). Different classifiers drop different small numbers of
unrecognizable files, and, except for KNN that drops a large number of files, the difference is not significant.
    Accuracy            61.27%          58.93%         57.99%          54.90%         53.07%          51.86%         49.79%          48.15%
                          ME              NB             EM            TFIDF            KL             SVM           PRIND            KNN
       ME                   0             2.95           6.54           7.18          10.80*          10.45*        38.57***        44.63***
        NB                                  0            4.58           8.88           5.56          14.12***       24.24***        63.16***
       EM                                                  0           10.64*          2.85          16.89***       21.83***        79.46***
      TFIDF                                                              0            10.40*           3.18         33.39***        34.77***
        KL                                                                               0           16.17***       13.22**         83.72***
       SVM                                                                                               0          44.62***        20.56***
      PRIND                                                                                                            0            206.98***
      K-NN                                                                                                                              0
Panel C. Sentiment side In-Sample: Chi-Square results between pairs of classifiers. Except KNN, all other classifiers each have 100 files for
buy-side class, 100 files for hold class and 100 files for sell-side class in the confusion matrix (totaling 300 files). KNN drops 18 files in sell-
side class leaving 282 files for three classes.
    Accuracy              72%           70.67%         69.33%           69%           68.67%          68.33%         67.67%          56.74%
                          ME              EM             NB             SVM             KL            TFIDF          PRIND            KNN
       ME                   0             4.5            7.36           1.16            2.19           2.56            5.98          23.14***
       EM                                  0             4.63           4.86            1.37           1.75            2.55         67.13***
        NB                                                 0            3.62            2.52           4.44            2.98         69.95***
       SVM                                                                0             1.38           2.33            4.14         43.76***
        KL                                                                                0            0.74            0.82         61.01***
      TFIDF                                                                                              0             1.91         48.52***
      PRIND                                                                                                             0           100.17***
      K-NN                                                                                                                              0
Panel D. Sentiment side Out-of-Sample: Chi-Square results between pairs of classifiers. For each out-of-sample, there are three classes
(Buy-side, Hold and Sell-side) each with 350 files for testing (totaling 1050 files). Different classifiers drop different small numbers of
unrecognizable files, except for KNN that drops a large numbers of files, the difference is not significant.

    Accuracy            60.68%          58.48%         58.37%         57.27%          56.97%         56.93%          55.24%          55.86%
                          ME             SVM           TFIDF            NB              KL             EM            PRIND            KNN
       ME                   0            1.19         14.81***        9.65*          16.52***       16.53***          8.65         88.52***
       SVM                                 0          26.51***       17.95***        28.39***       28.39***        15.19***       81.44***
      TFIDF                                              0               0.75            0.77         0.8             6.48         127.13***

               NB                                                                0              0.55        0.555           3.11         149.73***
               KL                                                                               0          0.0005           4.02         169.06***
              EM                                                                                              0             3.98         35.40***
             PRIND                                                                                                            0          204.23***
              KNN                                                                                                                            0
       * Significant at 0.05 level with degree of freedom = 4
       ** Significant at 0.025 level with degree of freedom = 4
       *** Significant at 0.01 level with degree of freedom = 4

       Note: In each panel, KNN drops significant large amount of files which makes the Chi-square comparison results not meaningful. We report
       the Chi-square test results for KNN but the significance may be due to loss of degrees of freedom problem. Therefore, for comparison between
       any other text classifier and KNN, we focus on the accuracy comparison in percentage terms

                                                                         Table VII
                                   Comparison of Accuracy of Reply and Non-reply Models

Panel A: Classification of Training Data

                             Training data                                                       Sentiment Categories
              Reply In-Sample with 140 files in each class:                    strong buy         buy        hold         sell              strong sell
           Non-reply In-Sample with 140 files in each class:                   strong buy         buy        hold          sell             strong sell
                                                                                          b                                                             c
       Reply sentiment side sample with 350 files in each class:               buy-side                      hold                           sell-side
     Non-reply sentiment side sample with 350 files in each class:             buy-side                      hold                            sell-side

  a Sell sentiment under reply In-Sample has only 140 files which are all used as training data, for remaining categories we randomly select 140 files.
  b Buy-side includes messages with strong buy and buy. 350 files are randomly selected from the pool of buy-side messages.
  c Sell-side includes messages with sell, strong sell and short. 350 files are randomly selected from the pool of sell-side messages.

Panel B: ME’s Accuracy Comparison between Reply and Non-reply

                                                             Test 1                                                                         Test 3
                  Model A: Reply In-Sample                   54.2%             Model A: Reply sentiment side In-Sample                     77.67%
                  RAINBOW Standard Error                       0.97                     RAINBOW Standard Error                                0.75
                Model B: Nonreply In-Sample                  60.5%           Model B: Non-reply sentiment side In-Sample                   79.33%
                  RAINBOW Standard Error                       0.62                     RAINBOW Standard Error                                1.34
         Chi-square results between model A and B              2.91           Chi-square results between model A and B                       2.47
                                                             Test 2                                                                         Test 4
             Model A: Adjusted Reply In-Sample               67.84%         Model A: Reply sentiment side Out-of-Sample                    65.95%
                  Calculated Standard Error                    1.56
           Model B: Adjusted Nonreply In-Sample              74.5%         Model B: Non-reply sentiment side Out-of-Sample                 69.94%
                  Calculated Standard Error                    0.97
         Chi-square results between model A and B              2.12           Chi-square results between Model A and B                       7.57

        a In-Sample test executes five trials while Out-of-Sample executes one trial.
        * Significant at 0.05 level with degrees of freedom = 4


Shared By: