Issues in Creating a Sentiment Measurement Based on Internet Stock Message Boards
Ying (Clement) Zhang The University of Texas at Arlington yzhang@uta.edu and Peggy E. Swanson* The University of Texas at Arlington swanson@uta.edu JEL Classification: G11; G12 Keywords: Internet Stock Message Boards, Sentiment Index, Text Classifiers, Artificial Intelligence Abstract Recent studies have attempted to answer the question whether sentiment extracted from stock message boards has predictive value. A major obstacle in previous works when testing the relation between sentiment indexes (measures of bullishness) and subsequent stock returns is how to best construct a sentiment index. This paper focuses on two crucial aspects of index creation. The first relates to how various messages should be evaluated, i.e., are reply messages as relevant as non-reply (original) messages in constructing the index. The second issue relates to identifying the best text classification algorithm to use when creating the sentiment index. Our results indicate that a reply message should be given a 0.125 discounted weight when constructing the sentiment index and that the Maximum Entropy text classifier is superior to the Naive Bayesian or Support Vector Machine, among others, for creating the sentiment index from stock message boards. *Corresponding Author Peggy E. Swanson John and Judy Goolsby Distinguished Professor and Professor of Finance Department of Finance and Real Estate UTA Box 19449 The University of Texas at Arlington Arlington, TX 76019 Email: swanson@uta.edu Phone: 817 272 3841 FAX: 817 272 2252
1
Electronic copy of this paper is available at: http://ssrn.com/abstract=968576
I.
Introduction
Rapid growth in stock message boards, chat rooms, and other means for investors to share market information makes clear the ever-increasing interest in internet stock trading.1 As an example, a powerful and popular board tracker—TheLion.com—tracks 50 million message postings on more than 25,000 message boards and related blog sites every day as of December 2006 2 In addition to the increasing number of sites, growth in the number of participations in these sites has exploded. Of specific interest to investors and to policymakers is the effect internet postings of sentiment have on stock returns. The sentiments are closely watched by investors who seek input to enhance their portfolio performance and by regulators who are concerned with stock price manipulation and information leakage. The influence of aggregated sentiments from individual investors has been shown to be far from just noise as evidenced by abnormal stock movements especially around the 2000 internet bubble and by internet stock message board studies such as those of Tumarkin and Whitelaw (2001), Antweiler and Frank (2004), Das and Sisk (2005), Das, Martínez-Jerez and Tufano (2005), Das and Chen (2003, 2007) and Gu, Konana, Liu, Rajagopalan and Ghosh (2006) among others. To determine whether the sentiment extracted from online investors’ commentaries is associated with stock price movements, researchers often create a sentiment index based on modern artificial intelligence techniques as a vehicle to test the relationship between messages and returns. The sentiment index should be unbiased, accurate, and sufficiently representative of individual investor consensus. However, due to problems in
1
There are many prevailing public bulletin boards among investors, such as Yahoo!Finance, RagingBull, Thelion!WallStreetPit, Motley, CBSMarketWatch, ClearStation , SiliconInvestor, etc. The most popular site based on the number of posts per day is Yahoo!Finance followed by RagingBull. Thelion!WallStreetPit is becoming more and more popular among investors reporting more than 1 million messages by the end of 2006 due to its organized platform, all-in-one messages search engine and account abuse-free monitoring system. Chat rooms, such as Reedfloren and B4utrade, are also available for investors. However, differing from bulletin boards, chat rooms do not keep historical archives of messages making retrieval of historical data impossible. For information of Thelion.com, please see http://www.thelion.com/aboutus/ and http://thelion.com/aboutus/ir/
2
2
Electronic copy of this paper is available at: http://ssrn.com/abstract=968576
data handling (message selection) and selection methods (text classifier selection), current sentiment indexes are often problematic and lacking in representative power. Weaknesses of the sentiment index directly affect the results of testing the relationship between investors’ sentiment and stock price movements.
This study proposes improvements over past studies in both data handling and methodology selection for constructing a sentiment index with high accuracy and low bias to proxy for individual investor sentiment. For data handling, we propose that reply messages have appreciably less influence on the forum community than non-reply (original) messages, and that a discount weight should be given to a reply message when constructing a sentiment index. We also suggest that an information imbalance problem and inferior class assignment procedure could affect the sentiment index’s representative power. For methodology selection, we conduct a thorough evaluation and comparison of eight different and most popular text classification algorithms, with the result that the Maximum Entropy classifier is indicated to be the superior algorithm for recognizing online messages that are in the format of conversational discourse in spoken language. With the improvements proposed in this paper which have not been addressed in previous finance literature, we enhance the ability to construct a quality proxy for individual investor sentiment indexes thereby providing researchers a better vehicle for testing the relationship between individual investor sentiment and stock price movements. II. Background and Literature Review
II.A. Does investor sentiment lead or lag security prices? Sentiment, known as the feeling about or tone of a security (a market) is reflected in the activity and price movement of the security (market). Investors Intelligence reports the sentiments of investment newsletter writers, Barron’s reports the sentiments of individual investors, and many other financial service firms track both analysts’ and individual investor sentiments. One group of researchers believe investors’ sentiment is an ex post phenomenon arguing that causality is from the security (market) movement to the sentiment change. Rising prices would indicate a
3
bullish market sentiment while a bearish market sentiment would be indicated by falling prices. In support, Chen, Kan and Miller (1993) reject the claim that individual investor sentiment drives the returns of stocks held and traded by individual investors. A similar result is found by Elton, Gruber and Busse (1998). Chan and Fong (2004) report that the publication of individual investors’ sentiment in the column of Individual Investors’ Sentiment Index (IISI) of Apple Daily in Hong Kong fails to predict the coming week’s return on large, medium, or small stocks. Similarly, Brown and Cliff (2004), using survey data and short-term stock market returns conclude that sentiment has little predictive power for short term future stock returns, even though sentiment levels are correlated with contemporaneous market returns. They suggest that past market returns strongly affect investors’ sentiment. Another group of researchers reject this argument by showing that investors’ sentiments do affect the market and conclude that sentiment is an ex ante factor to the market. Sentiment should reflect investors’ opinions of the future; otherwise, disclosing sentiment based on historical information has no value to investors. Under this argument, when investors provide a sentiment, that sentiment is a leading indicator. Lee, Shleifer and Thaler (1991) propose that individual investor sentiment causes fluctuations in the discount of closed-end funds. Individual investors’ pessimistic opinions cause discounts on closed-end funds while optimistic sentiments narrow the discounts. By testing the relation between individual investor sentiment and returns for a long period (1933 to 1993), Neal and Wheatley (1998) conclude that two proxies for individual sentiment—discounts on closed-end funds and net mutual fund redemptions— both have predictive power for the difference between small and large firm returns. Fisher and Statman (2000) find the relationship between the sentiment of both small investors and large investors and future S&P 500 returns to be significantly negative, concluding that market sentiment provides contrarianforecasting power for future S&P returns. Similar findings are reported by Kumar and Lee (2006), Brown and Cliff (2004), Baker and Wurgler (2006) and Kaniel, Saar, Titman (2006). More recently, Brown and Cliff (2005) further support the argument that investors’ sentiment affects asset valuation by showing that
4
overly optimistic (pessimistic) investors drive prices above (below) fundamental values and these pricing errors tend to revert over a multi-year horizon. II. B. Previous measures of investor sentiment The determination of whether investors’ sentiment is a lagging or leading factor depends upon the specific sentiment measurement selected and its level of accuracy and power. Although there is no perfect way to proxy broad investor sentiment, past literature has applied numerous proxies. An early study by Lee, Shleifer and Thaler (1991) suggested that closed-end fund discounts are a proxy for individual investor sentiment. Other proxies proposed include those of Neal and Wheatley (1998) who use level of discounts on closed-end funds, the ratio of odd-lot sales to purchases, and net mutual fund redemptions; Kumar and Lee (2006) who measure the changes in retail sentiment by calculating the buy-sell imbalance ratios; Kaniel, Saar and Titman (2004) who use NYSE’s Consolidated Equity Audit Trail Data (CAUD) to construct a daily measure of investor sentiment; Brown and Cliff (2005) who use survey data from Investors Intelligence (II), which tracks the number of market newsletters that are bullish, bearish, or neutral to establish investors’ sentiment. A more exhaustive study of sentiment is presented by Baker and Wurgler (2006) who consider six separate proxies as well as a composite index for investor sentiment: average closed-end fund discount, NYSE share turnover, number of IPOs (NIPO) and average first-day returns on IPOs (RIPO), equity share in new issues, and dividend premium. This allows an examination of the robustness of the various measures. Fisher and Statman (2000) utilize three different data sources to construct sentiments for large, medium and small investors separately. For large investors’ sentiment, they use data from Merrill Lynch’s compilation of the sentiment of Wall Street sell-side strategists. For medium investors, they use Investors Intelligence (II) published by an investment service company, Chartcraft, and for small investors, measurement comes from the American Association of Individual Investors (AAII) which conducts a sentiment survey among its members. II. C. Innovations in the measurement of investor sentiment.
5
Even when using the same data source to construct a sentiment proxy, studies report inconsistent results. For instance, based on closed-end fund discounts as a proxy for individual investor sentiment, Chen, Kan and Miller (1993) disagree with Lee, Shleifer and Thaler (1991). Fisher and Statman (2000) classify investors in three different groups and proxy the semiprofessional investors by using Investors Intelligence (II) as their source while Brown and Cliff (2005) treat the same II as a proxy for general investors’ opinions. Because there are diverse approaches and inconsistent results relating to how to correctly measure investor sentiment, researchers continue to seek additional insight and new information to identify more unbiased and efficient measures. In recent studies, one prominent group of researchers has been using the sentiment posted by individual investor from online stock message boards to proxy for investor sentiment. Starting in the mid 1990s, many financial websites, such as Yahoo!Finance, RagingBull and Thelion.com set up online stock message boards to allow people to share information and post opinions online. Moreover, starting in the early 2000’s, most of these message boards provided a sentiment function where posters can explicitly disclose their sentiment on a voluntary basis. For instance, Yahoo!Finance allows a poster to choose one of the following sentiments: strong buy, buy, hold, sell, strong sell, and do not disclose (by default). Similar functions can be found on RagingBull message boards where both short-term sentiment and longterm sentiment options can be chosen. Moreover, in addition to the above sentiment choices, Thelion!WallStreetPit offers Short and Scalp sentiment options.3 Not like chat rooms or live conversation forums, stock message boards store all the historical information of the messages and allow other users to browse or retrieve the information later. 4 Since the historical content and sentiment of each post are retrievable, one can calculate the average or aggregated sentiment from the posted messages during a certain time period. This sentiment can be used as a proxy for individual investor sentiment. 5 The
3
The short sentiment means to short sell the stock while the scalp sentiment means to buy or sell quickly with possible day trade profit but does not disclose specific sentiment.
A small number of historical messages are deleted by the web administrator due to violation of posting policy.
4 5
While it is possible that brokerage firms, institutional investors or third parties hire people to disseminate sentiment on the forums in order to influence stock price, hopefully this is unlikely and inefficient since it is easy for the SEC to detect these kinds of activities by using IP address techniques
6
broadest measure of online investors’ sentiment would require all the available historical messages on the internet. Such, of course, is not possible and selection must be limited by availability. Among all the online stock message boards, a few dominate posting activities based on the number of messages posted every day and number of users. Many studies include these dominant message boards while omitting lesser active forums hoping the former provide adequate representativeness to derive a meaningful sentiment index. Wysocki (1999) uses the messages exclusively from Yahoo!Finance after comparing the posting activities between Yahoo and The Motley Fool; Tumarkin and Whitelaw (2001) collect messages from RagingBull only; Das, Martínez-Jerez and Tufano (2005) use The Motley Fool; Antweiler and Frank (2004) choose Yahoo!Finance and RagingBull as data sources; and Das and Chen (2007) focus on Yahoo!Finance; Instead of using only a single data source, Das and Sisk download data from four online message boards: Yahoo!Financ, The Motley Fool, RagingBull and Silicon Investor. Gu, Konana, Liu, Rajagopalan and Ghosh (2006) select Yahoo!Finance only. Online posters may voluntarily provide sentiment meaning that not all messages include selfreported sentiment. Because the sentiment index should exhibit minimal bias, ideally it would include all messages both with and without self-reported sentiment. Tumarkin and Whitelaw (2001) and Gu, Konana, Liu, Rajagopalan and Ghosh (2006) use only self-reported sentiment messages and ignore all the messages without self-reported sentiment, but Antweiler and Frank (2004) argue that omitting non-selfreported-sentiment messages generates bias. They suggest including all messages when generating a proxy for sentiment. However, messages without self-disclosed sentiment require interpretation of sentiment based on reading the content. We know that human interpretation has bias and is subjective since what we understand may not be consistent with the poster’s intent. Fortunately, for non-selfreported-sentiment messages, mature artificial intelligence methods for text classification can aid in determining the best and most unbiased sentiment based on linguistic structure and key words in the content without human intervention. A text classifier utilizes its computational linguistic algorithm to recognize and then memorize distinct characteristics among different classes, such as buy and sell, in the training data. Training data is a pool of self-reported sentiment messages that are learned by the text
7
classification algorithm. Based on the active learning, the classification algorithm builds a model that can distinguish among classes. After the model is built, the text classifier is set as a classifier server and ready to evaluate individual messages. After receiving an individual message from the client program, the classifier server will compare the message’s characteristics with the model built from the training data and return a probability score under each class for the message. For a detailed flowchart for using a text classifier, see Das and Chen (2003). With the help of this new text retrieval technology and artificial intelligence methodology, researchers in finance now have the ability to generate an improved proxy for individual investor sentiment. The aim of this paper is to utilize these new abilities and to provide a foundation for constructing an improved sentiment index from internet stock message boards. We suggest that previously inconsistent and insignificant relationships between individual investor sentiment and subsequent stock returns may be due to low representative power and measurement error of the sentiment indices used. II. D. Issues in Creating an Effective Sentiment Index II.D.1: Reply messages differ from non-reply messages in the construction of a sentiment index. Our sentiment index formula is an extension of Antweiler and Frank 2004 (AF hereafter) who recognize that reply and non-reply messages indicate different levels of importance. They suggest giving different weights to messages based on message characteristics such as message length, number of messages posted by the same author, and number of citations (reply messages). However, they do not differentiate reply from non-reply messages because the number of citations a particular message receives is unknown at the time it is posted causing serious problems for their time sequencing analysis. However, we propose a different approach. Rather than attempting to measure the reply posts that follow a specific original post in a time sequence order, we simply identify all reply messages independent of the original post to which they relate. Any further reply to an existing reply is uniformly treated as a reply. Thus, a message is either a reply or non-reply. If reply and non-reply messages carry different levels of
8
importance and if we can generate a weight for reply messages to differentiate this heterogeneity, the sentiment index should be more accurate and representative. We propose a formula based on AF (2004)’s bullishness ratio R t where we incorporate reply versus non-reply messages using sentiment types from Thelion!WallStreetPit. The ratio of bullish to bearish messages, R t , (see AF 2004, 1266-1267) becomes:
b N D( t )
Rt =
∑t
i =1 j =1
1
i
*(1 − ρ )γ * xiB *ln(δ i )
(1)
s N D( t )
∑t
1
j
*(1 − ρ )γ * x S *ln(δ j ) j
where
N N
b D(t) s D(t)
is number of messages that have buy-side sentiment (such as buy and strong buy) during time period D(t). is number of messages that have sell-side sentiment (such as sell and strong
sell) during D(t). t is number of messages posted by the same author during D(t). xiB = 1 if the post has a buy sentiment and xiB = 2 if it has a strong buy sentiment. xjS= -1 if the post has a sell sentiment, xjS = -2 if it has a strong sell sentiment, etc. ln(δ) is natural logarithm of message length in number of characters.6 ρ is weight to capture the discount on reply messages. γ has two values, if the post is a reply, γ = 1, otherwise γ = 0. Our focus is the computation of R t which includes the measurement of ρ, the discount factor for reply messages. ρ is expected to be less than 1 in value if reply messages under the same sentiment type have less influence on the community than non-reply messages. Previous literature of financial stock message boards has not differentiated between reply and non-reply messages assuming all the messages (reply and non-reply) have the same construction pattern and importance. We are arguing that not differentiating reply messages from non-reply messages by giving different weights to the two can decrease a sentiment index’s representative power. II. D.2. Identifying the best text classifier for internet message board messages
6
More content is associated with more information and value. Using a log function is because accumulative information deploys a concave relationship which shows that the marginal increase of information is decreasing
9
In order to generate a highly representative sentiment index for a stock during a certain time period D(t), we must include all the messages posted for that stock on the message board during that time. However, since not all messages posted come with self-reported sentiment, text classifiers are essential in obtaining the best and most unbiased sentiment type for those messages without self-disclosed sentiment,. There exist many popular text classifiers, such as Naive Bayesian (NB), Support Vector Machine (SVM), K nearest neighbor (KNN), Term Frequency Inverse Document Frequency (TFIDF), Expectation Maximization (EM), Kullback-Leibler divergence (KL), Probabilistic Indexing (PI), and Maximum Entropy (ME), but they have not been tested in the context of financial message board analysis. AF (2004) select the NB text classifier and compare the NB classifier results to those derived from the SVM classifier and report similar findings for the two classifiers. Das and Chen (2007) create and test their own five text classifiers and devise a voting system which selects the consensus result from at least three out of five classifiers. In non-financial fields, different classifiers appear to be appropriate for different circumstances. For instance, in recognizing tissue samples in the biomedical field, SVM is an outperforming classifier (Furey, et al. (2000)) while in object recognition, PI is a promising approach (Olson (1995)). Unlike research in other areas such as computer science, biology, physics and music, where text classification methods have been thoroughly examined and evaluated, financial message board studies are few and the accuracy and efficiency comparison of existing text classification algorithms almost nonexistent. This study fills that literature gap by comparing and evaluating eight of the most frequently used text classifiers and identifying the best text classifier for recognizing posters’ sentiment. NB is the most popular algorithm for text classifiers in financial message board studies, but it relies upon a strong conditional independence assumption about the observed dataset. Such a severe assumption often generates systemic problems that reduce the accuracy of the probability estimates (Rennie, Shih, Teevan and Karger, (2003)). Recent research has found Maximum Entropy (ME), which makes no conditional independence assumptions, to be a promising technique for text classification. Ratnaparkhi (1998) reports that ME probability models for text categorization outperform the decision tree when trained and tested under identical or similar conditions; Nigam, Lafferty and McCallum (1999)
10
find that ME is competitive with, and two out of three times better than, NB on experimental results; and Raychaudhuri, Chang, Sutphin and Altman (2002) identify ME to be the most successful algorithm for classifying biomedical articles. Similar results can be found in Macskassy, Hirsh, Banerjee and Dayanik (2003) and Li, Wang, Chen, Tao and Hu (2005). Importantly, we believe the messages posted on financial stock message boards exist in the form of conversational discourse. Unlike articles in newspapers or reports from analysts, forum messages posted by individual investors are style free, short, elliptical and in a dialogue-like format. Therefore, forum messages are more appropriately treated as conversational discourse than as any other format, such as an article or lecture. This assumption is consistent with Admati and Pfleiderer (2001) who study the pattern of internet stock forums and treat the broadcasting messages process as communication among people. Berger, Pietra and Pietra (1996), Choi, Cho and Seo (1999) and Pettibone and Pon-Barry (2003) report that for the classification of conversational discourse relations, ME has consistently exhibited high performance. 7 If forum messages exist in the form of fragmentary utterances which is a common phenomenon in spoken language, ME obviously should be a promising classifier with a high degree of accuracy potential. Bender, Macherey, Och and Ney (2003) and Fernandez, Ginzburg and Lappin (2005) support this standpoint based on their empirical results. A second aspect of the classifier identification process is to test consistency of performance when separating training data into reply and non-reply messages. A text classifier is built based on the training data which should contain as much relevant and useful information as possible without severe noise. Noisy training data can significantly affect a text classifier’s performance. Earlier studies of financial stock message boards have not differentiated between reply and non-reply messages when selecting the training data thereby assuming that all the messages (reply and non-reply) have the same construction pattern and importance. Based on this assumption, so long as a message comes with self-disclosed sentiment, it is a good candidate for training data without considering whether the message is a reply or non-reply. However, if a reply message is significantly different from a non-reply message in terms of
7
Although there is the chance that some posters paste articles or news, most messages posted on forums appear to be conversational discourse.
11
relevant information of content, a classifier’s performance may be lower under noisy training data (when reply messages are included). A robustness test of separating the training data into clean (non-reply messages only) and noisy (reply messages only) allows determination of whether the classifier is amenable to noisy training data. In summary, this paper argues that a reason why no significant relationship between a sentiment index and future stock returns has been found in the literature may be the result of imperfect construction of the sentiment index. We propose that the quality can be improved by (1) giving a discounted weight to reply messages relative to non-reply messages in the sentiment index formula (2) selecting a better text classifier with only non-reply messages as training data (3) applying a better class assignment procedure and (4) using an equal number of training messages in each one of the classes to avoid information biases. The following section describes the methodology, Section IV set forth construction of the data, Section V reports descriptive statistics and Section VI presents empirical results. Summary and conclusions are provided in Section VII. III. III. A. Overview This study tests three hypotheses which are described below. III. B. Hypothesis I: Reply messages differ in importance from non-reply messages and should be treated differently when constructing the sentiment index. For this hypothesis, a “reply” discount weight ρ is identified by use of a fixed effect regression model to capture the different influence on the financial community of reply and non-reply messages. Although AF (2004) suggest heavier weights for posts which generate replies, they cannot differentiate the replies because of the nature of the data set used. We argue that the different impact on the community between reply and non-reply messages should not be ignored and it is crucial to differentiate between these two different functional messages. The data set used in this paper is from Thelion!WallStreetPit and allows this distinction. Thus, we not only capture the different influence on the community generated by reply and non-reply messages, but also avoid the time sequence problem of AF (2004). Methodology
12
Determination of the magnitude of the discount for reply messages requires an assessment of their influence on the community. Fortunately, Thelion’s organized forum and membership credit system allows us to test the effect of reply and non-reply messages by using the credit score as a proxy for influence. 8 A poster’s credit consists of cumulative reward points voluntarily given by other forum participants. A poster’s credit score represents its credibility and reputation among other participants and the higher the score the greater the chance more people will read its post reflecting more community influence. 9 One approach to measuring the effect of reply versus non-reply messages is to evaluate each’s influence on the poster’s credit. Based on past literature and observations of message board behaviors, a poster’s credit is dependent upon many factors represented, in general, by the following model in Equation (2). In this fixed effects regression model variables X1i through X3i are exogenous variables to user’ credit, X4i through X8i are control variables, X9i is a dummy variable to control for days the markets are open, and X10i is a dummy variable to test the hypothesis that reply messages have less influence on the community than non-reply messages: Yi = β0+µi+ β1X1i+ β2X2i+ β3X3i+ β4X4i+ β5X5i+ β6X6i+ β7X7i + β8X8i i = 1, ..., n Where: Y i is a proxy for influence to the message board community for stock i β0 is a constant
Tumarkin and Whitelaw (2001) point it out that Thelion!WallStreetPit is a private bulletin board. However, membership on Thelion!WallStreetPit is costless and there is no hierarchy membership system. There is no limitation to read, to post and to reply to any message on the forum by any user. Thelion!WallStreetPit has its credit system and high-end statistical tools such as All-InOne stock information search engine. Only high-end statistical tools require the user to have prepaid credits, but these valuable functions are to let paid members search information more quickly and efficiently. These features do not affect the liberty of posting, reading and replying to messages like other public message boards and therefore we should not put Thelion!WallStreetPit into the “private” bulletin category.
9 8
+
β9X9i
+
ρX10i,t
+
εi
,
(2)
According to the poster credit’s property that it can be given but not be taken, using the absolute poster’s credit (as reported in each post after the poster’s name) to represent influence may be biased. In order to control this bias, we generate a relative measure that is equal to the absolute poster’s credit divided by the number of total message posted by the poster. If the poster does not add value to the community, the total number of messages posted increases but the poster’s credit remains unchanged which dilutes its influence. Although using relative poster’s credit may logically solve the influence bias problem empirically, it probably does not reflect actual events. It is unlikely a reader will divide a poster’s credit by the poster’s total number of messages every time he/she reads a post. Moreover, by using the relative credit measure, it is possible that the influence of recent contributions with a relatively small number of posts is diluted by a large number of valueless posts in its past. Therefore, we prefer the absolute poster’s credit as a proxy for influence based on the argument that forum users differentiate quality posts from inferior posts purely based on the poster’s absolute credit. However, for comparative purposes we report results for both relative poster’s credit and absolute poster’s credit.
13
µi is a fixed effect dummy variable controlling for stock i. X1 i is the number of messages posted by the poster since the first day of its registration. X2 i, is the number of other forum participants who put the poster into their watch list.10 X3 i, is the number of days since the first day the poster registered. X4 i, controls for message length in number of characters X5 i, controls for corporate size and is the total outstanding shares times closing market price on the message posting day. X6 i is the market closing price of the recommended stock in a post. X7 i, is the sentiment scale where 1 = buy sentiment, 2 = strong buy, 0 =hold (scalp), -1 = sell, -2 = strong sell and -3 = short. X8 i, is a dummy variable to control for whether or not the message is posted during the trading session. X9 i, is a dummy variable to control for market open days (Monday through Friday). 11 X10 i, is the dummy variable indicating whether the post is a reply (dummy =1) or a non-reply (dummy=0) . Of specific interest is the X10i coefficient that indicates whether there is a different influence on the community between a reply and a non-reply post. If the coefficient ρ is significantly negative, the implication is that the reply posts have less influential power than the non-reply posts on the community and vice versa. III. C. Hypothesis II: A superior text classifier for classifying message board messages can be identified. We demonstrate the process of comparing text classifiers’ performance with the confusion matrix technique together with the Chi-square test to identify the most consistent and accurate classifier. Eight different widely used text classifiers are investigated: ME (Maximum Entropy), NB (Naive Bayesian), SVM (Support Vector Machine), KNN (K Nearest Neighbor), TFIDF (Term Frequency Inverse Document Frequency), EM (Expectation Maximization), PRIND (Probabilistic Indexing) and KL (Kullback-Leibler Divergence). Maximum Entropy proves to be the best classifier (see Empirical Results section below) and it is the one discussed in detail in this section. For detailed information about the remaining algorithms see Friedman (1997) for NB; Vapnik (1995) for SVM; Rocchio (1971) for TFIDF; Dempster, Laird and Rubin (1977) for EM; Yang (1994) for K-NN; Kullback and Leibler (1951) for KL and Maron and Kuhns (1960) for PRIND.
10
Any messages posted by the posters who are in a user’s watch list will be highlighted to that user as an indication of important messages
11
We exclude all holidays during which the market is closed.
14
Maximum entropy is a method for analyzing known information from training data in order to determine a unique epistemic probability distribution that satisfies given constraints. The principle of maximum entropy states that when nothing is known, the distribution should be as uniform as possible. Therefore, the least biased model that encodes the given information is the one that maximizes the uncertainty measure H(p)— conditional entropy, while remaining consistent with this information. “Maximum entropy probability models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context.” (Ratnaparkhi, 1998, page 6) The following parameters and terminology are adapted for demonstrating ME: 1. θj —a class or category, such as strong buy, buy, hold, sell or strong sell. 2. θ j —a vector of classes, θ j ={θ1, ···,θp}. 3. di —a document with self-disclosed sentiment in the training data. 4.
r
r
r r d j —a vector of documents in training data set, θ j ={d1, ···,dq}.
5. dm—a document whose class in unknown and which needs to be evaluated by the text classifier 6. xi —a feature parameter or attribute of a single word in a document. For example, the key words in the category “strong buy” that relate to its sentiment may include “up”, “right”, “nice”, etc. Key words in “strong sell” sentiment type may be “off”, “wrong”, “weak”, etc. 7.
r r xi —a vector of features, xi ={x1, ···,xn}.
8. w —the weight or a weight vector which describes the word’s importance.
The task is to observe linguistic contexts dm which contain xm and predict the correct linguistic class θj to which dm belongs. This includes constructing a maximum entropy classifier that can be implemented with a conditional probability distribution p, such that p (θ j | d m ) is the probability of class θj given a dm.
r
15
To build the ME classifier model, we first define θ j = {θ1, θ2, . . . θp} as the set of classes of interest in the system ( θ j = {strong sell, sell, hold, buy, strong buy} in this paper) and d j = {d1, d2, . . . dq} = { x1 , x2 , . . . xq } as the set of contexts or textual materials that we can observe from training data. The training data Tn = {(θ1,d1) . . . (θp,dq)} is a large set of contexts d1 . . . dq that have been assigned to their correct classes θ1 . . . θp that express certain relationships between features and outcomes.12 To build a normalized probability distribution model based on this training data, the classifier weights the features by using them in a log-linear model:
f k (θ j , di )
r
r
r
r
r
r
K 1 p(θ j | di ) = ⋅ ∏ wk Z (di ) k =1
(3)
Z (di ) =
∑∏ wk
θj
k =1
K
f k (θ j , di )
(4)
where Z (di ) is a normalization factor to ensure that a binary function with the value of either 1 or 0:
∑ p(θ θ
j
j
| di ) = 1 . Each feature function f k (θ j , di ) is
1, f k (θ j , di ) = 0,
if di belongs to a word class θ j otherwise
Each parameter wk , where wk > 0, corresponds to one feature fk and can be interpreted as a “weight” for that feature. The parameters { w1 , ···, wk } are found with Generalized Iterative Scaling (Darroch and Ratcliff (1972)). The probability model p (θ j | di ) is a normalized product of those observed features.
To evaluate a test document dm, the classifier chooses a distribution p that maximizes the entropy H (p) for a test document dm whose correct class is unknown at the time. To maximize entropy is to
12
We assume the self-disclosed sentiment originated by the poster is the correct class for the message.
16
maximize conditional entropy H(p) = H (θj | dm) subject to the above constraints (3) and (4) (Ratnaparkhi (1997)):13
H (θj | dm) = -
θ j , dm
∑
p(θ j , d m ) ⋅ log n p(θ j | d m )
(5)
p*= argmax H (θ j | d m ) where p* is the H (θj | dm) that maximizes the Entropy of H (θj | dm) under the constrains of (3) and (4) making the model match feature expectations with those observed in the training data. The classifier returns a p* for each one of the class θj according to maximum entropy. Rather than choosing the class θj with the highest p*, we exploit our own process for selecting the most accurate class θj for dm. A better class assignment procedure is to determine at least the correct sentiment side (sentiment index sign). For example, the returned value p* for each of the four sentiment types is strong buy 29%, buy 28%, sell 30% and strong sell 13%. If we follow the traditional rule to select the category with the highest p*, then sell should be our target. However, the sum of strong buy and buy exceeds the sum of strong sell and sell which suggests that the aggregate buy side sentiment is stronger than that of sell side sentiment. Therefore, buy sentiment in this example is more appropriate as our determined sentiment than sell sentiment. Given a pool of available self-disclosed (pre-known) sentiment type messages, an easy way to examine the text classifier’s recognition accuracy is to check the power of the classifier: how many of messages’ given sentiment by the classifier match with their self-disclosed sentiment type. The higher the matched percentage, the higher the classifier’s power. In order to obtain the accuracy for each text classifier and statistically compare the accuracy among them, we use the confusion matrix technique and the Chi-square test. A confusion matrix as provided by Kohavi and Provost (1998) contains information about actual classes and evaluates classes and provides the means for comparing the eight classifiers
13
See Berger et al (1996), Lau et al. (1993), and Rosenfeld (1996) who use the equivalent format of
H (θ j | d m ) = p(θ j | d m )
is the conditional
-
θ j , dm
∑
% p (d m ) p(θ j | d m ) ⋅ log n p(θ j | d m )
, where
% p(d m )
is the observed probability and
probability.
17
tested. Performance of such systems is commonly evaluated using the data in a matrix. Each column of the matrix represents the instances given by the text classifier, while each row represents the instances in an actual class. A cross-tabulation of occurrence frequencies is constructed. For example, following is a confusion matrix with 100 messages under each sentiment type. The diagonal cells show the number of messages that are correctly assigned by a text classifier to their actual sentiment type. The accuracy for this classifier is 80% which is the average accuracy of each sentiment type: Sentiment Type 0 buy 1 hold 2 sell 0 buy 80 5 10 1 hold 10 90 20 2 sell 10 5 70 : total :100 :100 :100 Accuracy 80.00% 90.00% 70.00%
Next, to determine whether any pair of our eight distinct text classifiers are significantly different from each other, one classifier’s confusion matrix is compared to another classifier’s confusion matrix that has the same number of rows and columns by using the standard Chi-square statistic [Das and Chen (2007)]:
x
2 [( n −1)2 ]
1 = 2 n
( Si − I i ) 2 ∑ I , with degrees of freedom of (n-1)2 i =1 i
n2
(6)
where i is the corresponding cell in each of the classifier’s confusion matrix and n is the number of classes in the confusion matrix. S represents the superior classifier, and I represents the inferior classifier. We assign a classifier with higher confusion matrix accuracy to the superior model S and assign the lesser accurate classifier to the inferior model I. If the two models are significantly different from each other, the superior model has statistically higher performance. The better classifier between two is sequentially tested against the other classifiers, and eventually the classifier with highest accuracy and consistency is selected as the best text classifier algorithm. 14
14
Das and Chen (2007) suggest using a vote system among their five classifiers, i.e. 3 of 5 classifiers should agree on the message sentiment type, otherwise the message is not classified. Since all the classifier methods used in their paper are analytical and do not require optimization or search algorithms, a vote system does a better job than any single classifier. However, unlike Das and Chen, the classifier methods in this paper require optimization or search algorithms. Additionally, care of the over fitting of data problem is taken by testing both the In-Sample and Out-ofSample performance. Thus we do not use a vote system in our classifier selection process.
18
III. D. Hypothesis III: A superior classifier exhibits strong performance under both reply messages only training data and non-reply messages only training data. This hypothesis is an inference from hypothesis I as well as a robustness test for hypothesis II. In order to test whether a classifier’s performance is amenable to noisier training data (inclusion of only reply messages), we separate the training data into two groups: a reply messages group and a nonreply messages group. Each group has an equal number of messages under each sentiment type. We compare the performance of ME under reply and non-reply groups. The null hypothesis is that ME’s performance is not statistically different between the two groups. The comparison is made with the confusion matrix together with the Chi-square test. IV. Data
The data in this study is collected from Thelion!WallStreetPit message board through our selfprogrammed Visual Basic software. 15 Only the messages that provide self-reported sentiment are downloaded. Moreover, according to our three different hypotheses tests, we arrange our data in three ways. Each different arrangement of data serves a special testing purpose. First, for hypothesis I that tests the value of the discount weight for reply messages in the sentiment index formula, the data requires real time downloads (see below) and is from July 18, 2005 through March 31, 2006 and only messages are downloaded if they have a reply or they themselves are a reply to another message.16 Since this test does not relate to text classifiers, we do not use a text classifier to generate sentiment. Second, for hypothesis II, the data is extended to from July 2, 2001 to March 31, 2006 due to an insufficient number of messages for
15
Thelion!WallStreetPit is a stock trading forum that allows people to post their opinions for any stock with the stock symbol listed before the message title. Unlike Finance!Yahoo and RagingBull which allocate messages under the stock symbol, Thelion shows all the messages in the same platform and sorts them by time. Messages posted on Thelion!WallStreetPit include both self-reported and non-self-reported sentiment messages. For detail of Thelion’s forum structure, go to http://thelion.com/bin/forum.cgi?tf=wall_street_pit. Our Visual Basic software filters out the poster’s disclaimer and signature appended to the message content as well as any unrecognized symbols.
We know that an original post (non-reply) may later generate reply/replies, but we do not know in advance whether or not an original post will have reply/replies in the future. This is the time series problem that A&F encounter. However, we are able to identify whether a post is a reply or not at the time it is posted. We simply identify all reply messages independent of the original post to which they relate in order to avoid the time series problem. When downloading data, once a message is a reply, its title starts with “RE:” Plus, only a reply message has a “Reply to Msg” key word embedded in the web page which allows us to identify which previous message that it replies to. For example, if a reply post has “Reply to Msg 1234”, if the post 1234 itself is not already a reply, then post 1234 is treated as an original post. Any original post which may have reply/replies that occur after our time period (later than March 31, 2006) are not included in our sample.
16
19
testing this hypothesis from the preceding data set (July 2005 to March 2006).17 Each message’s content is stored in a text file. These text files are used as training and testing data for the text classifiers. Third, for hypothesis III, the data is the same as for hypothesis II but separated into reply and non-reply message groups. More detail about each data arrangement is provided below. IV.A. Data for Hypothesis I: Fixed effect regression model The selection criteria for the data in this arrangement are (1) a message itself either is a reply post or has reply/replies to it, (2) a message has a stock symbol associated with it since not all the messages talk about specific stocks, and (3) a message provides a self-reported sentiment in order to control for sentiment type. Although the poster’s credit shown on the web reflects its current credit and is updated whenever new credit is given, Thelion!WallStreetPit does not offer historical member’s credit. Thus, this test is valid only when the credit is the real time credit at the time the message is posted. Although Thelion!WallStreetPit started in 1999 and has over one million messages to date, the valid data for our model starts July 18, 2005 when we began retrieving real time data by downloading messages every trading day after the market close. The poster’s credit for each message before July 2005 is time mismatched in that the downloading day is not the same day as the message posting. IV.B. Data for Hypothesis II: Selecting the best text classifier As indicated above, only self-reported sentiment messages are used because the messages used as training data must have pre-known (true) sentiment types to be categorized and compared with the sentiment given by the text classifier. To generate as little bias as possible from the eight text classifiers studied, not only is the data that is learned by the text classifiers for each of the classes needed but also noise in the data needs to be minimized. Based on these criteria, Thelion!WallStreetPit message data is selected over Yahoo!Finance data for three reasons: (1) Thelion!WallStreetPit controls duplicated identity problems by using a credit system which prevents a poster from generating influence on the community by using multiple accounts. (2) The Thelion!WallStreetPit’s forum administrator has a simpler job than the Yahoo!Finance forum
17
Thelion.com started its message board in 1999 with its first message on Nov, 29, 1999. However, the sentiment function is not available until July 2, 2001 which is the starting date for hypothesis II data.
20
administrator since it is more organized and experiences a relatively smaller message flow. (3) Users who violate forum policy or abuse their accounts will be warned or suspended, further preventing account abuse. As pointed out by Das and Chen (2007), a small sample of between 300 and 500 messages as training data for each class can prevent overfitting of data which may lead to poor out-of-sample performance in spite of high in-sample accuracy. We restrict the training data to 350 messages for each of the five classes (strong buy, buy, hold, sell, and strong sell). 18 The number of messages for each class must be equal in order to avoid class imbalance problem which causes information bias and is crucial to the accuracy of a text classifier. For further discussion of the class imbalance problem, see Japkowicz and Stephen (2002) and Estabrooks and Japkowicz (2004). The in-sample data is a sub sample of the training data which is randomly selected by RAINBOW (programmed by McCallum) to test a classifier’s inner accuracy. 19 High out-of-sample accuracy is the goal since the purpose of using a text classifier is to return a probability of each sentiment class for a testing (non-training) document. By utilizing data from Thelion!WallStreetPit, we report not only the in-sample training data accuracy, but also and more importantly, we can report the out-of-sample accuracy for each of the text classifiers, something not done in earlier studies. 20 We randomly select 350 messages with self-reported sentiment for each class as training data to build the classifier model.21 Then, from the remainder of messages in each class, another 350 messages, also with self-reported sentiment for each class, are randomly selected as out-of-sample data for testing purposes. IV.C. Data for Hypothesis III: Robustness of text classifier’s performance
18
In order not to drop “Short” sentiment messages, we combine short sentiment together with sell and strong sell into sell-side sentiment messages in a separate test. McCallum, Andrew Kachites. "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering." http://www.cs.cmu.edu/~mccallum/bow. 1996. RAINBOW software can be downloaded free for academic purposes. AF (2004) use message data from Yahoo!Fiannce in 2000 which does not have self-reported sentiment making it impossible for them to compare the sentiment given by the poster with that given by the text classifier. We program software to randomly select a number of files from a folder, such as randomly selecting 500 text files from a folder which consists of 1000 text files.
19
20
21
21
Data in this test is the same as in the previous test. However, unlike the preceding section where reply and non-reply messages are mixed randomly in each class, the data in this section is separated into a reply group and a non-reply group for each class so that we have ten combination classes. The purpose is to test whether the selected classifier consistently exhibits high recognition ability under noisier training data. Of all the data, 140 messages qualify under the reply-sell category. Thus, in order for each class to have an equal number of training data, 140 messages are randomly selected as training data from the larger pool of data for each of the remaining categories. V. Descriptive Statistics As an introduction to the characteristics of the data, Figures A through C present various descriptive statistics of posting activities and sentiment reporting on Thelion!WallStreetPit from 2001 to 2006.22 >>>Insert figures A, B and C here<<< Figure A shows that the use of strong sentiment (strong buy, strong sell and short) greatly exceeds the use of “normal” sentiment (buy and sell) indicating that when individuals elect to give out sentiment they usually prefer strong sentiment. 23 Figure B reveals that the number of reply messages is much smaller than original posts (non-reply messages) when sentiment is disclosed and this pattern is consistent over time. Figure C reflects that posting activities for all days have been increasing from 2002 to 2005. This pattern may be due to an increasing number of investors switching to online discount brokers or to the recovery of the stock market since 2003. Posting activities during weekdays dramatically exceed those during weekends. There is little difference among weekdays even though Wednesday is the peak of posting activities. Many of these posting patterns shown in Figures A through C are in line with previous studies such as in Wysocki (1999), Tumarkin and Whitelaw (2001), AF (2004), Das, Martínez-Jerez and Tufano
22
Although our data is from 2001 July to 2006 March, the figures omit 2001 and 2006 because they lack a full year of observations. We do not include the Scalp sentiment because it conveys day trade activities and does not disclose specific sentiment.
23
22
(2005) and Das and Chen (2003, 2007), etc. Because these studies use different data from different message boards and over different time periods, similarities among them indicate online posters follow certain patterns when posting messages online. VI. Empirical Results and Evaluations VI.A. Importance of reply versus non-reply messages and determination of the discount weight ρ for non-reply messages The reply effect cannot be tested directly because no direct measurement for influence on the financial community exists to indicate the extent of influence that a message generates. Among all online financial message boards, only TheLion’s forum offers real time poster credit which is the sum of all previous credit voluntarily given by other forum participants. Poster credit is public information that can be seen by readers of the message and we assume the higher the poster credit the greater the influence on the community. Although we hypothesize that, on average, reply messages have less influence on the community than non-reply messages, ceteris paribus, obviously, a poster’s credit is not determined solely by whether its posts are reply or non-reply but also by other considerations such as number of total messages the poster posts, how long the poster has participated in the community, message length of the post, and characteristics of the stock associated with the post. Also, other posting activity variables need to be controlled, such as whether or not the message is posted during the market trading period and under which specific sentiment. The following table sets forth selected characteristics of the data used in the model to identify the appropriate discount factors to be assigned to reply messages. >>>Insert Table I here<<< Similar to the posting activity patterns described above, buy-side sentiment messages exceed sellside sentiment messages, strong sentiment messages exceed “normal” sentiment messages, most posting activities occur during trading hours, and most posting activities occur during weekdays when the market is open. Again, due to the design of this test, only those posts that are replies or that are original but have reply/replies are included. Thus, because one original post can have more than one reply, the number of non-reply messages is smaller than that of reply messages in the table. Each message has a self-reported sentiment in order to control for sentiment type.
23
Before the model in III.B can be tested, modifications are necessary to reduce multicollinearity. The number of messages posted, membership days and number of other users watching are highly correlated with each other and market capitalization is highly correlated with stock closing price. Therefore the number of membership days, other users watching and stock closing price are omitted from the fixed effect regression model. In addition, many of the quantitative variables are not normally distributed and are included in logarithmic form. Therefore, the modified model for testing is: LPC= β0+ µi+ β1LMP+β2 LML+β3 LMC+β4 SEN+β5MOi+β6TP+ρ IsReply+ ε Where: β0 – the constant µi – the fixed effect dummy variable controlling for stock i LPC–the logarithm of the poster’s credit LMP–the logarithm of the number of messages that are posted by the poster LML–the logarithm of length of a message LMC–the logarithm of stock’s market capital SEN–the scale of sentiment types with -3 for short, -2 for strong sell, -1 for sell, 0 for hold (scalp), 1 for buy, and 2 for strong buy MO= 1 if it is weekday (market open) and MO= 0 if it is weekends (market close) TP=1 if it is during trading period and TP=0 if it is not during trading period IsReply=1 if it is a reply post and IsREply=0 if it is a non-reply post. The fixed effect model is selected because it controls within stock effect and the Hausman test favors it over the random effect model. Results are reported in Table II: >>> Insert Table II here <<< Of specific interest is the dummy variable for reply/non-reply (IsReply). Its coefficient is -0.125 which is significant at the 0.01 level, indicating that reply messages are 12.5% less influential in the forum community than non-reply messages. This suggests that the discount ρ for a reply message is 0.125 (12.5%) making the weight for a reply (in the bullishness formula) 0.875 while a non-reply message’s weight is unity (1.0). Following Tumarkin and Whitelaw’s (2001) suggestion of eliminating messages posted by hosts of the forums, we repeat the test excluding the webmaster’s (Lionmaster) messages and results do not change appreciably. Moreover, five additional robustness tests are performed: (1) excluding messages whose recommended stock is less than $1 in price, (2) including messages with scalp sentiment that is assigned a zero score, (3) adding a book-to-market ratio to represent stock characteristics, (4) rather than using the absolute measure as a proxy for influence, using the relative measure which is equal to (7)
24
poster’s absolute credit divided by its number of messages posted, and (5) adding a time sequence variable according to posting time and applying the Newey-West correction for serial autocorrelation. All robustness results support the original conclusions.24 VI. B. Performance Comparison among Different Text classifiers To determine which classifier has the greatest accuracy and consistency, we use a random mix of our reply and non-reply messages for training data and create eight separate text classification models, one for each of eight different tested algorithms. We record RAINBOW’s in-sample average accuracy output for each classifier and then test each model’s out-of-sample accuracy on the same dataset.
25
Results of accuracy tests are comparable only when models are built from the same training sample and tested on the same testing sample. To conserve space, only the Naive Bayesian (NB) and Maximum Entropy (ME) confusion matrix are demonstrated for calculating accuracy. Table III sets forth the original confusion matrix from the RAINBOW output. Although the meaning of a message’s content under a strong buy sentiment should not be appreciably different from that under a buy sentiment message, it should be very different from that of a sell sentiment message. In order to see how well the classifier assigns messages to at least the correct sentiment side (buy/sell) guaranteeing at least the correct sentiment score sign when constructing the sentiment index, it is appropriate to “adjust” the tests in Table III by combining buy and strong buy into buy-side, sell and strong sell into sell-side, and leaving hold unchanged as shown in Table IV.26 Table IV shows the adjusted confusion matrix for NB and ME. The remaining classifiers’ accuracy and adjusted accuracy are calculated in the same way. >>> Insert Table III here <<< >>> Insert Table IV here <<<
24 25
Robustness tests results are available from the authors upon request.
When setting up the classifier model, AF (2004) suggest using the top 1000 words as ranked by their importance such that only the 1000 words with the highest mutual information (a quantity that measures the mutual dependence of the two variables) are treated as features for that class. We follow this suggestion and choose “—prune-vocab-byinfogain=1000” option for each class when running RAINBOW. AF (2004) treat the hold sentiment as neutral, and, similarly, we assume hold is between buy-side and sell-side.
26
25
While in Table III, the average accuracy for NB was 44.8% compared with its adjusted accuracy in Table IV of 62.6%. For ME, the average accuracy was 54% (Table III) and its adjusted average accuracy is 70% (Table IV). Table V reports both in-sample accuracy results from RAINBOW and out-of-sample results calculated by the authors. The in-sample data consists of 350 files in each of five classes (strong buy, buy, hold, sell, and strong sell) where 28.58% (100 files for each class) are used for testing model accuracy (totaling 500 testing files), while 71.43% (250 files for each class) are used to generate the model. The out-of-sample data set also has five classes with 350 messages in each class and all of the data is used for testing purposes (totaling 1750 testing files). In order not to drop “Short” sentiment type messages and consistent with previous literature, we use three sentiment side classes: buy-side, hold and sell-side. A buy-side class contains 350 randomly selected files from buy and strong buy classes, a hold class contains 350 randomly selected files from hold class, and a sell-side class contains 350 randomly selected files from sell, strong sell and short classes. The sentiment side out-of-sample dataset also has three classes with 350 messages in each class and all of the data is used for testing. In-sample accuracy and standard error values are reported from five independent trials. Out-of-sample values represent one trial. >>> Insert Table V here <<< The table reveals that Maximum Entropy consistently outperforms the other text classification algorithms for both in-sample and out-of-sample tests. As a robustness check, we verify our results by testing the most widely used sample in the classification literature—20NewsGroup data.27 Our finding is unaltered. Because absolute percentage comparisons do not tell whether one classifier statistically outperforms another, Chi-square based on the confusion matrix is employed. To demonstrate the procedure of calculating the Chi-square statistic, we use the numbers for the ME and NB in Table IV as an example. First, to test each algorithm’s own interpretive ability each classifier is first compared to a random classifier that would result in a 33% accuracy level since there are
20NewsGroup is sample data that can be downloaded from McCallum’s website together with the RAINBOW software. Among 20NewsGroup’s many classes, in order to use conversational discourse messages, we choose the “talk.politics.*” category which consists of three different classes: talk.politics.misc, talk.politics.guns and talk.politics.mideast, with 1,000 messages in each.
27
26
three classes of messages [for rationale, see Das and Chen 2007]. 28 Second, the standard chi-square test: x 2 (4) =
1 9 ( MEi − NBi ) 2 , with 4 degrees of freedom, is used to compare the interpretive ability ∑ NB 9 i =1 i
between ME and NB on the adjusted confusion matrix. The hypothesis is that ME significantly outperforms NB. As an example, the computed test statistic from Table IV is 9.66 which is significant at the 0.05 level. Comparisons between each pair of classifiers are shown in Table VI. The table includes four comparisons: (1) adjusted in-sample, (2) adjusted out-of-sample, (3) sentiment side in-sample, and (4) sentiment side out-of-sample. >>> Insert Table VI here <<< The out-of-sample accuracy comparison is more relevant than the in-sample comparison because the purpose of using the classifier ultimately is to obtain a sentiment score for a document that is outside the training dataset. Therefore, Panels B and D are the focus. In Panel B, ME statistically outperforms KL, SVM, PRIND, and KNN (4 of 7) and in Panel D, ME statistically outperforms all other text classifiers except PRIND and SVM (5 of 7). Based on the performance consistency, we conclude that ME is the superior classifier.29 This finding is consistent with the literature in non-financial fields that suggests that maximum entropy outperforms other classifiers such as NB and SVM for recognizing conversational discourse in spoken language. VI.C. Robustness test for the Maximum Entropy’s Performance With ME being selected as the best text classifier, a robustness test for its ability to sustain high performance under noisy data is conducted. Earlier tests indicate that reply posts have less influence than
28
Our Chi-square results suggest that all classifiers’ interpretive ability is significantly higher than the random algorithm. Details of these results are not reported but are available from the authors upon request.
Due to specific message content problems such as empty content, unrecognizable symbol, damaged file, etc, classifiers drop these unreadable files during the training process. Different algorithms may treat such files differently which means the number of dropped files is different under different classifiers. The Chi-square test requires the total message number of Ii (inferior) in the confusion matrix must equal to total message number of Si (superior). Except for KNN, all other classifiers report almost equal numbers of dropped files allowing us to use the Chi-square test between each two classifiers’ result. For KNN, since the Chisquare result is inflated due to a smaller number of observations in the denominator, we make the comparison between KNN and other classifiers based upon the percentage.
29
27
non-reply posts on the community and hints that when building a text classifier, the content of reply messages is noisy in terms of key words and contains less relevant information for the training dataset. We separate data not only by different sentiment classes but also by reply and non-reply types. With five sentiment categories: strong buy, buy, hold, sell and strong sell, and reply and non-reply, we have ten groups. We randomly select 140 files for the in-sample test for each of them where training data occupies 71.43% (100 files for each group) and testing data occupies 28.58% (40 files for each group, totaling 400 in-sample testing files).30 Since the sell sentiment class under the reply group has only 140 files which are all used as training data, we do not have enough data for out-of-sample tests based on five sentiment categories. Therefore, we realign the data and separate it into three sentiment side categories: buy-side, hold, and sell-side sentiment and into reply and non-reply groups, yielding six groups. For an in-sample test for sentiment side, we randomly select 350 files for each category, where training data occupies 71.43% (250 files in each group) and testing data occupies 28.58% (100 files in each group, totaling 600 in-sample testing files). Then, from the remaining data, we randomly select 350 files for each of the six groups as out-of-sample testing data (totaling 2100 out-of-sample testing files). Panel A in Table VII overviews the classification of training data based on sentiment types and reply/non-reply. Panel B demonstrates the accuracy comparison between using reply and non-reply messages as training data under the same text classifier ME. The purpose of this robustness test is to check whether the ME classifier is amenable to noisier training data. The procedure of comparison is again the confusion matrix as in section VI.B. >>> Insert Table VII here <<< ME’s superior performance basically is unchanged under the reply training data in all four tests indicating that ME is amenable to noisy training data. Although Chi-square results do not support a significant difference between model A (generated from only reply messages) and model B (generated from only non-reply messages), in general, model A underperforms model B by roughly 5%. The five percent gain in accuracy of model B may be economically if not statistically significant and should not be
The reason we choose 140 files per folder is that sell sentiment class under reply group has only 140 files. Therefore we use up all the available data for sell-reply group.
30
28
ignored. This suggests that, in practice, we should exclude reply messages from training data in order to increase accuracy by 5% when building a text classifier. VII Summary and Conclusions Based on artificial intelligence technology and statistical methodology, this study proposes three important improvements to extend previous literature in the construction of an online investors’ sentiment index. First, we explicitly consider and measure the importance of reply versus non-reply messages in generating a sentiment index for internet financial message boards. Results suggest that a 12.5% discount should apply to reply messages such that reply messages carry 0.875 and non-reply messages carry 1.0 weight when calculating a sentiment index. Second, after taking care of the class imbalance problem in training data and using a better class assignment procedure, eight text classifiers are tested and compared and the best text classification algorithm is determined to be Maximum Entropy. Noteworthy is that Maximum Entropy not only is superior to all classifiers tested (including Naive Bayesian and Support Vector Machine, classifiers commonly used in the literature), but also is amenable to noisy training data. Third, using training data that consists of only non-reply messages can improve a text classifier’s recognition by approximately 5%. All three of these improvements aim to provide a better proxy for an online investors’ sentiment index such that researchers have a better vehicle with which to evaluate the relationship between individual investor sentiment and stock price movements, helping investors make more prudent investment decisions, and helping policymakers to better monitor internet message board influences on equity markets..
References
Admati, A.R. & Pfleiderer, P.C. (2001) Disclosing Information on the Internet: Is It Noise or Is It News?, Technical report, Graduate School of Business, Stanford University. Antweiler, W. & Frank, M.Z. (2004) Is All That Talk Just Noise? The Information Content of Internet Stock Message Boards, Journal of Finance, 59(3): 1259-1295. Baker, M.P. & Wurgler, J. (2006) Investor Sentiment and the Cross-Section of Stock Returns, Journal of Finance, 61(4): 1645-1680 Bender, O., Macherey, K., Och, F.J. & Ney, H. (2003) Comparison of Alignment Templates and Maximum Entropy Models for Natural Language Understanding, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, Budapest, Hungary: 11–18
29
Berger, A.L., Della Pietra, V.J. & Della Pietra, S.A. (1996) A maximum entropy approach to natural language processing, Computational Linguistics, 22(1):39-71 Brown, G.W. & Cliff, M.T. (2004) Investor sentiment and the near-term stock market, Journal of Empirical Finance, 11:1-27 Brown, G.W. & Cliff, M.T. (2005) Investor Sentiment and Asset Valuation, Journal of Business, 78:405440 Chan, S.Y. & Fong, W.M. (2004) Individual Investors' Sentiment and Temporary Stock Price Pressure. Journal of Business Finance & Accounting, 31(5-6), 823-836 Chen, N.F., Kan, R. & Miller, M.H. (1993) Are the Discounts on Closed-End Funds a Sentiment Index?, Journal of Finance, 48:795-800 Choi, W.S., Cho, J.M. & Seo, J. (1999) Analysis system of speech acts and discourse structures using maximum entropy model, Proceedings of the 37th conference on Association for Computational Linguistics, Morgan Kaufman, San Francisco, 230-237 Darroch, J.N. & Ratcliff, D. (1972) Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 43:1470-1480. Das, S.R. & Chen, M.Y. (2003) Yahoo! for Amazon: Sentiment Parsing from Small Talk on the Web, Proceedings of the 8th Asia Pacific Finance Association annual conference, Working paper. Das, S.R. & Chen, M.Y. (2007) Yahoo! For Amazon: Sentiment extraction from small talk on the Web, Management Science forthcoming Das, S.R., Martinez-Jerez, A. Fa. De. & Tufano, P. (2005) eInformation: A Clinical Study of Investor Discussion and Sentiment, Financial Management, 34(3):103-137 Das, S.R. & Sisk, J. (2005) Financial Communities, Journal of Portfolio Management, v31(4): 112-123 Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, B.39, 1:1-38. Estabrooks, A., Jo, T. & Japkowicz, N. (2004) A Multiple Resampling Method for Learning from Imbalances Data Sets, Computational Intelligence, 20, Number 1:18-36. Elton, E.J., Gruber, M.J. & Busse, J.A. (1998) Do Investors Care About Sentiment? Journal of Business 71:477-500 Fisher, K.L. & Statman, M. (2000) Investor sentiment and stock returns, Financial Analysts Journal (March-April): 16-23 Friedman, J.H. (1997) On Bias, Variance, 0/1-loss, and the Curse of Dimensionality, Data Mining and Knowledge Discovery, 1:55-77. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M. & Haussler D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16:906-914
30
Gu, B., Konana, P., Liu, A., Rajagopalan B., & Ghosh, J. (2006) Predictive Value of Stock Message Board Sentiments, working paper, the University of Texas at Austin. Japkowicz, N. & Stephen, S. (2002) The class imbalance problem: A systematic study, Intelligent Data Analysis Journal, 6(5), 429-450 Kaniel, R., Saar, G. & Titman, S. (2004) Individual Investor Sentiment and Stock Returns, working paper, Duke University. Kullback, S. & Leibler, R.A. (1951) On information and sufficiency, Annals of Mathematical Statistics, 22:79-86. Kohavi, R. & Provost, F. (1998) Glossary of Term, Machine Learning, 30:271-274. Kumar, A. & Lee, C.M.C. (2006) Retail Investor Sentiment and Return Comovements, Journal of Finance, 61, 2451-2486 Lee, C.M.C., Shleifer, A. & Thaler, R.H. (1991) Investor Sentiment and the Closed-End Fund Puzzle, Journal of Finance, 46: 75-109. Li, R.L., Wang, J.H.; Chen, X.U.; Tao, X.P. & Hu, Y.F. (2005) Using maximum entropy model for Chinese text categorization, Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.),42(1): 94-101 Macskassy, S.A., Hirsh, H., Banerjee, A.& Dayanik, A.A. (2002) Converting numerical classification into text classification, Artificial Intelligence, 143(1):51-77 Maron, M.E. & Kuhns, J.L. (1960) On Relevance, Probabilistic Indexing and Information Retrieval, Journal of the Association for Computing Machinery, 7:216-244. Neal, R. & Wheatley, S.M. (1998) Do Measures of Investor Sentiment Predict Returns? Journal of Financial and Quantitative Analysis, 33: 523-547 Nigam, K., Lafferty, J. & McCallum A.K. (1999) Using Maximum Entropy for Text Classification, In Proceedings of IJCAI99 Workshop on Machine Learning for Information Filtering, 61-67. Olson C.F. (1995) Probabilistic indexing for object recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5): 518-521 Pettibone, J. & Pon-Barry, H. (2003) A Maximum Entropy Approach to Recognizing Discourse Relations in Spoken Language, Working paper, The Stanford Natural Language Processing Group. June 6. Ratnaparkhi, A. (1997) A simple introduction to maximum entropy models for natural language processing, Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania. Ratnaparkhi, A. (1998) Maximum Entropy Models for Natural Language Ambiguity Resolution, PhD thesis, University of Pennsylvania Raychaudhuri, S., Chang, J.T., Sutphin, P.D. & Altman, R.B. (2002) Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature, Genome Research, 12:203-214
31
Rennie, J.D.M., Shih, L., Teevan, J. & Karger, D.R. (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers, In T. Fawcett and N. Mishra (eds), International Conference on Machine Learning, Washington D.C, Morgan Kaufmann Rocchio, J.J. (1971) Relevance feedback in information retrieval. In G. Salton, (ED.), The SMART Retrieval System: Experiments in Automatic Document Processing, page 313-323, Prentice-Hall. Tumarkin, R. & Whitelaw, R.F. (2001) News or Noise? Internet Message Board Activity and Stock Prices, Financial Analysts Journal, 57(3), 41-51. Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer-Verlag: New York. Wysocki, P.D. (1999) Cheap Talk on the Web: The Determinants of Postings on Stock Message Boards, Working paper, University of Michigan, 1999 Yang, Y. (1994) Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval, In Proceeding of the 17th annual international conference on research and development in information retrieval (SIGIR’94), page 13-22.
32
33
Table I
Sentiment Distributions
Data is downloaded from Thelion!WallStreetPit and covers the messages that meet our test requirements from July 18, 2005 through March 31, 2006, totaling 5,898 messages. Self-reported sentiments include strong buy, buy, hold, sell, strong sell, and short. During Market Trading measures the posting activities that occur from 9:30:00AM to 4:00:00PM ET. Holidays posting activities are omitted.
Sentiment Type strong Buy
Messages
Percentage
Specific Partition Weekday: Weekend: During Market Trading: Not During Market Trading Reply: Non-reply:
Messages
Percentage
buy hold sell strong sell short
4425 730 124 45 159 415
75.03% 12.38% 2.10% 0.76% 2.70% 7.04%
5740 158 4354 1544 3398 2500
97.32% 2.68% 73.82% 26.18% 57.61% 42.39%
Table II Regression Model to Test Discount Weight on IsReply variable
We include the fixed effect and random effect regression models in this table:
Fixed effect: LPCi = β0+ µi+ β1LMPi+β2 LMLi+β3 LMCi+β4 SENi+β5MOi+β6TPi+ρ IsReplyi+ εi Random effect: LPCi = β0+ τi+ β1LMPi+β2 LMLi+β3 LMCi+β4 SENi+β5MOi+β6TPi+ρ IsReplyi+ εi
Where: LPCi–the logarithm of the poster’s credit; µi – the fixed effect intercept dummy variable controlling for stock i; τi – the random effect error term dummy variable controlling stock i; LMPi–the logarithm of the number of messages that are posted by the poster; LMLi–the logarithm of length of a message; LMCi–the logarithm of stock’s market capital; SENi–the scale of sentiment types where -3 for short, -2 for strong sell, -1 for sell, 0 for hold (scalp), 1 for buy, and 2 for strong buy; MOi= 1 if it is weekday (market open) and MOi= 0 if it is weekends (market close); TPi=1 if it is during trading period and TPi=0 if it is not during trading period; IsReplyi=1 if it is a reply post and IsREplyi=0 if it is a non-reply post. t-statistics are reported in parentheses. Number of observation is 5898 with 966 groups. The fixed effect model has an R-square of 0.8.
LMPi Fixed Effect Coefficient T-Stat Random Effect Coefficient T-Stat Hausman Test
* Significant at 0.05 level ** Significant at 0.025 level *** Significant at 0.01 level (13.43)*
LMLi 0.085
(6.94)***
LMCi -0.042
(-0.8)
SENi -0.083
(-7)***
MO i -0.017
(-0.69)
TPi -0.133
(-1.98)**
IsReplyi -0.125
(-6.13)***
Constant -2.344
(-3.78)***
0.986
(85.79)***
1.005
(111.43)***
0.087
(8.27)***
0.004
(0.48)
-0.079
(-8.98)***
-0.031
(-1.42)
-0.139
(-2.38)**
-0.114
(-6.08)***
-2.897
(-18.55)***
34
Table III Accuracy Measures for Unadjusted Confusion Matrix Panel A: RAINBOW In-Sample for NB (224 out of 500—44.8 % accuracy).
Class Name 0 strong buy 1 buy 2 hold 3 sell 4 strong sell 0 strong buy 40 30 3 2 2 1 buy 13 27 1 11 5 2 hold 31 26 78 28 27 3 sell 7 8 9 35 22 4 strong sell 9 9 9 24 44 : total :100 :100 :100 :100 :100 Accuracy 40.00% 27.00% 78.00% 35.00% 44.00%
Panel B: RAINBOW In-Sample for ME (270 out of 500—54 % accuracy).
Class Name 0 strongbuy 1 buy 2 hold 3 sell 4 strong sell 0 strong buy 51 16 13 5 10 1 buy 25 53 10 10 4 2 hold 9 13 58 12 12 3 sell 7 9 15 57 23 4 strong sell 8 9 4 16 51 : total :100 :100 :100 :100 :100 Accuracy 51.00% 53.00% 58.00% 57.00% 51.00%
Note: For each trial, 350 files for each class, training data occupy 71.43% (250 files) and testing data occupies 28.58% (100 files). Each cell’s result is the average of five trials.
Table IV Accuracy Measures for Adjusted Confusion Matrix Panel A: NB In-Sample Adjusted Confusion Matrix (313 out of 500—62.6% accuracy).
ClassName 0 buy-side 1 hold 2 sell-side 0 buy-side 110 4 20 1 hold 57 78 55 2 sell-side 33 18 125 : total :200 :100 :200 Accuracy 55.00% 78.00% 62.50%
Panel B: ME In-Sample Adjusted Confusion Matrix (350 out of 500—70% accuracy).
ClassName 0 buy-side 1 hold 2 sell-side 0 buy-side 145 23 29 1 hold 22 58 24 2 sell-side 33 19 147 : total :200 :100 :200 Accuracy 72.50% 58.00% 73.87%
Note: Based on its five classes original average output, adjusted average output merges strong buy and buy into buy-side, strong sell and sell into sell-side, and leaves hold unchanged. After merging, there are 200 files for buy-side class, 100 files for hold class and 200 files for sell-side class.
35
Table V Comparison of Relative Accuracy of Text Classification Algorithms
ME represents Maximum Entropy, SVM-Support Vector Machine, TFIDF-Term Frequency Inverse Document Frequency, PRIND-Probabilistic Indexing, KL-Kullback-Leibler divergence, EM-Expectation Maximization, NB-Naive Bayesian, and KNN-K nearest neighbor. The classifiers in this table are sorted from left to right by the percentage under “Adjusted In-Sample”. “Adjusted” means combining the reported number in buy and strong buy recorded from RAINBOW output into buy-side sentiment, sell and strong sell into sell-side sentiment, and leaving hold unchanged. Similar groupings are indicated as “Sentiment Side” when allocating training data. KNN drops a significant number of files during the test suggesting that KNN has relative low recognition power.
Test Accuracy / Algorithm
In-Sample RAINBOW Standard Error Adjusted In-Sample Calculated Standard Error Out-of-Sample Adjusted Out-of-Sample Sentiment Side In-Sample RAINBOW Standard Error Sentiment Side Out-ofSample 20NewsGroup In-sample
ME 54%
(0.57)
SVM 45.8%
(1.15)
TFIDF 46.4%
(1.06)
PRIND 47.8%
(0.99)
KL 45.8%
(0.55)
EM 46.2%
(0.77)
NB 44.8%
(0.49)
KNN 37.42%
(0.86)
70%
(0.95)
65.8%
(1.68)
65.6%
(0.93)
65.4%
(1.13)
65.2%
(0.89)
64%
(1.22)
62.6%
(0.87)
a 58.42%
(0.73)
44.35% 61.27% 72%
(1.15)
35.77% 51.86% 69%
(1.32)
39.66% 54.9% 68.33%
(0.9)
36.15% 49.79% 67.67%
(1.48)
39.35% 53.07% 68.67%
(0.91)
42.1% 58% 70.7%
(0.97)
41.92% 58.93% 69.33%
(0.64)
28.26% 48.15% 56.74%
(1.03)
60.68% 92.33%
58.48% 91.71%
58.37% 91.7%
55.24% 85.44%
56.97% 92.15%
56.9% 91%
57.27% 92.22%
b 53.44% 84.1%
(0.85)
RAINBOW Standard Error (0.23) (0.72) (0.17) (0.82) (0.06) (0.18) (0.05) ________________________________________________________________________________________________________
a. KNN drops 19 files in the strong sell class b. KNN drops 30% of the data during testing (classifier only recognizes 70% of files) which may inflate the result due to a small denominator.
36
Table VI Chi-square Comparison of Text Classifiers
Panel A. Adjusted In-sample: Chi-Square results between pairs of classifiers. Except KNN, all other classifiers have 200 files for buy-side class, 100 files for hold class and 200 files for sell-side class in the confusion matrix (totaling 500 files). KNN drops 19 files in sell-side leaving a total of 481 files in three classes.
Accuracy ME SVM TFIDF PRIND KL EM NB KNN
70% ME
65.80% SVM
65.60% TFIDF
65.40% PRIND
65.20% KL
64% EM
62.60% NB
58.42% KNN
0
6.46 0
24.2*** 16.62*** 0
15.61*** 14.90*** 2.46 0
13.69*** 8.01 0.77 3.41 0
11.2** 6.96 0.82 2.45 0.27 0
17.05*** 14.70*** 0.84 2.05 1.5 1.33 0
11.88** 17.02*** 45.37*** 48.24*** 35.49*** 38.99*** 43.34*** 0
Panel B. Adjusted Out-of-Sample: Chi-Square results between pairs of classifiers. There are 700 files for buy-side class, 350 files for hold class and 700 files for sell-side class in the confusion matrix (totaling 1750 files). Different classifiers drop different small numbers of unrecognizable files, and, except for KNN that drops a large number of files, the difference is not significant.
Accuracy ME NB EM TFIDF KL SVM PRIND K-NN
61.27% ME
58.93% NB
57.99% EM
54.90% TFIDF
53.07% KL
51.86% SVM
49.79% PRIND
48.15% KNN
0
2.95 0
6.54 4.58 0
7.18 8.88 10.64* 0
10.80* 5.56 2.85 10.40* 0
10.45* 14.12*** 16.89*** 3.18 16.17*** 0
38.57*** 24.24*** 21.83*** 33.39*** 13.22** 44.62*** 0
44.63*** 63.16*** 79.46*** 34.77*** 83.72*** 20.56*** 206.98*** 0
Panel C. Sentiment side In-Sample: Chi-Square results between pairs of classifiers. Except KNN, all other classifiers each have 100 files for buy-side class, 100 files for hold class and 100 files for sell-side class in the confusion matrix (totaling 300 files). KNN drops 18 files in sellside class leaving 282 files for three classes.
Accuracy ME EM NB SVM KL TFIDF PRIND K-NN
72% ME
70.67% EM
69.33% NB
69% SVM
68.67% KL
68.33% TFIDF
67.67% PRIND
56.74% KNN
0
4.5 0
7.36 4.63 0
1.16 4.86 3.62 0
2.19 1.37 2.52 1.38 0
2.56 1.75 4.44 2.33 0.74 0
5.98 2.55 2.98 4.14 0.82 1.91 0
23.14*** 67.13*** 69.95*** 43.76*** 61.01*** 48.52*** 100.17*** 0
Panel D. Sentiment side Out-of-Sample: Chi-Square results between pairs of classifiers. For each out-of-sample, there are three classes (Buy-side, Hold and Sell-side) each with 350 files for testing (totaling 1050 files). Different classifiers drop different small numbers of unrecognizable files, except for KNN that drops a large numbers of files, the difference is not significant.
Accuracy ME SVM TFIDF
60.68% ME
58.48% SVM
58.37% TFIDF
57.27% NB
56.97% KL
56.93% EM
55.24% PRIND
55.86% KNN
0
1.19 0
14.81*** 26.51*** 0
9.65* 17.95*** 0.75
16.52*** 28.39*** 0.77
16.53*** 28.39*** 0.8
8.65 15.19*** 6.48
88.52*** 81.44*** 127.13***
37
NB KL EM PRIND KNN
* Significant at 0.05 level with degree of freedom = 4 ** Significant at 0.025 level with degree of freedom = 4 *** Significant at 0.01 level with degree of freedom = 4
0
0.55 0
0.555 0.0005 0
3.11 4.02 3.98 0
149.73*** 169.06*** 35.40*** 204.23*** 0
Note: In each panel, KNN drops significant large amount of files which makes the Chi-square comparison results not meaningful. We report the Chi-square test results for KNN but the significance may be due to loss of degrees of freedom problem. Therefore, for comparison between any other text classifier and KNN, we focus on the accuracy comparison in percentage terms
Table VII Comparison of Accuracy of Reply and Non-reply Models Panel A: Classification of Training Data
Training data
Reply In-Sample with 140 files in each class: Non-reply In-Sample with 140 files in each class: Reply sentiment side sample with 350 files in each class: Non-reply sentiment side sample with 350 files in each class: strong buy strong buy buy-side
Sentiment Categories
buy buy hold hold hold hold sell
a
strong sell strong sell sell-side
b
sell
c
buy-side
sell-side
a Sell sentiment under reply In-Sample has only 140 files which are all used as training data, for remaining categories we randomly select 140 files. b Buy-side includes messages with strong buy and buy. 350 files are randomly selected from the pool of buy-side messages. c Sell-side includes messages with sell, strong sell and short. 350 files are randomly selected from the pool of sell-side messages.
Panel B: ME’s Accuracy Comparison between Reply and Non-reply
Test 1
Model A: Reply In-Sample RAINBOW Standard Error Model B: Nonreply In-Sample RAINBOW Standard Error Chi-square results between model A and B
Test 3
Model A: Reply sentiment side In-Sample RAINBOW Standard Error Model B: Non-reply sentiment side In-Sample RAINBOW Standard Error Chi-square results between model A and B
54.2%
0.97
77.67%
0.75
60.5%
0.62
79.33%
1.34
2.91 Test 2
2.47 Test 4 a
Model A: Adjusted Reply In-Sample Calculated Standard Error Model B: Adjusted Nonreply In-Sample Calculated Standard Error Chi-square results between model A and B
67.84%
1.56
Model A: Reply sentiment side Out-of-Sample
65.95% 69.94% 7.57
74.5%
0.97
Model B: Non-reply sentiment side Out-of-Sample Chi-square results between Model A and B
2.12
a In-Sample test executes five trials while Out-of-Sample executes one trial. * Significant at 0.05 level with degrees of freedom = 4
38