Docstoc

A Structural_ Content-Similarity Measure for Detecting Spam

Document Sample
A Structural_ Content-Similarity Measure for Detecting Spam Powered By Docstoc
					A Structural, Content-Similarity Measure for Detecting
            Spam Documents on the Web
                                     Maria Soledad Pera
                                        Yiu-Kai Ng∗
                           Computer Science Department
                             Brigham Young University
                                Provo, Utah, U.S.A.
                     Email: {mpera@cs.byu.edu, ng@cs.byu.edu}


                                            Abstract
        Purpose - The Web provides its users with abundant information. Unfortunately,
        when a Web search is performed, both users and search engines must deal with an
        annoying problem: the presence of spam documents that are ranked among legitimate
        ones. The mixed results downgrade the performance of search engines and frustrate
        users who are required to filter out useless information. To improve the quality of
        Web searches, the number of spam documents on the Web must be reduced, if they
        cannot be eradicated entirely.
        Design/methodology/approach - In this paper, we present a novel approach for
        identifying spam Web documents, which have mismatched titles and bodies and/or
        low percentage of hidden content in markup data structure.
        Findings - By considering the content and markup of Web documents, we develop
        a spam-detection tool that is (i) reliable, since we can accurately detect 84.5% of
        spam/legitimate Web documents, and (ii) computational inexpensive, since the word-
        correlation factors used for content analysis are precomputed.
        Research limitations/implications - Since the bigram-correlation values employed
        in our spam-detection approach are computed by using the unigram-correlation fac-
        tors, it imposes additional computational time during the spam-detection process and
        could generate higher number of misclassified spam Web documents.
        Originality/value - We have verified that our spam-detection approach outperforms
        existing anti-spam methods by at least 3% in terms of F -measure.

Keywords: Spam detection, structural content, content-similarity measure, accuracy and
error rates, F-measure
Paper type: Research paper
  ∗
      Corresponding Author


                                                 1
1    Introduction
The Web is populated with documents in many different subject areas from personal health
care to constitutional laws to religious beliefs as presented in online news articles, research
papers, and customer-generated media (i.e., blogs), to name a few, and virtually all kinds
of information can be found on the Web. With the huge amount of information to sort
through, users turn to Web search engines for assistance in locating information of interest.
Hence, analyzing the content of (relevant) Web documents and ranking them according
to the user’s information need is a crucial process in Web information retrieval (IR). As a
result, a search engine spider crawling the Web not only entices gathering useful information,
but it also imposes significant financial reward for the owners of various Web sites, such
as increasing the number of commercial transactions on the Web sites, that results from
ranking (the contents of) retrieved Web documents posted at the Web sites by the search
engine [26]. Hence, there is an economic incentive for manipulating search engine’s rankings
[1] by creating Web documents that score high independently of their real contents, even
though the intention is unethical. Gradually, more documents are introduced on the Web
that are considered legitimate when in fact they are not and ranked high when they should
not be by existing search engines. As reported in [7] and [23], a significant portion of
existing Web documents—between 14% and 22% in the year of 2006—are spam.
    Spamming is a serious Web IR problem, since it (i) affects the quality of Web searches,
(ii) damages the search engine’s reputation, and (iii) weakens the user’s confidence in re-
trieved results. In general, Web spamming are treated as an attempt to receive an unjusti-
fiably favorable relevance or high ranking for their Web documents, regardless of the true
values of the documents. Spamming can also be defined as an attempt to deceive a search
engine’s relevancy ranking algorithm [25].
    A number of existing spamming approaches rely on links within Web documents [11] to
manipulate Web ranking algorithms, such as PageRank [6], whereas others rely on infecting
the content of Web documents [23], e.g., stuffing popular or concatenating words to be
included in Web documents, to increase their chance for matching Web queries. However,
due to their complexity, neither considering links nor multiple statistical features of Web
document contents are effective approaches in identifying spam Web documents as the
volume of Web documents is huge.
    Fetterly et al. [12] use a semantic technique, i.e., actual word count, on Web documents
for detecting spam. Instead, we consider the actual content (i.e., words) of Web documents
and use the word-similarity values to identify and eliminate spam Web documents. We
will show that by (i) considering the degree of similarity among the words in the title and
body of a Web document D, which is computed by using their word-correlation factors, (ii)
using the percentage of hidden content in the markup data structure within D, and/or (iii)
considering the bigram or trigram phrase-similarity values of D, we can determine whether
D is spam with high accuracy.
    The remaining sections are organized as follows. In Section 2, we present existing anti-
spam methods and discuss their differences with ours. In Section 3, we introduce our
spam-detection approach. In Section 4, we propose an alternative method which further
enhances the accuracy in detecting spam Web documents. In Section 5, we analyze the
experimental results, which verify the performance of our spam-detection method, and in


                                              2
Section 5.6 we include a case study which demonstrates various degrees of accuracy in
detecting spam using different proposed spam-detection methods. In Section 6, we provide
the time complexity analysis of our spam-detection algorithm and its implementation. In
Section 7, we give a conclusion and present the future work.


2    Related Work
Previous anti-spam work focus on applying two different strategies: content analysis and
link connection. Ntoulas et al. [23] introduce and combine several anti-spam heuristics
based on the content of a Web document D, which include within D (i) the number of
words, (ii) the average length of the words, (iii) the amount of anchor text, (iv) the fraction
of visible content, (v) the fraction of globally popular words, and (vi) the likelihood of
independent n-grams. These heuristics are treated as features in classifying spam Web
documents.
     Gyongyi et al. [15] also present several anti-spam heuristics according to the content of
a document D that include (i) detecting the inclusion of terms in the anchor text of D that
are unrelated to the referenced Web documents, (ii) computing the amount of repetitive
terms in D introduced by a spammer with the intention to increase its relevance score,
(iii) verifying the existence of a large number of unrelated terms, and (iv) identifying the
presence of phrase stitching in D, which should increase its degree of relevance to several
posted queries. Likewise, Fetterly et al. [12] analyze the statistical features of the host
component of an URL and the excessive replication of content to establish content-based
features that can be used by Web spam classifiers.
     As a link-analysis approach, Becchetti et al. [2] introduce a damping function, which
classifies spam and non-spam Web documents using (incoming and outgoing) links within
a document without considering its content. Other well-known anti-spam techniques that
rely on link analysis are described in [14, 16]. Gyongyi et al. [14] present a semi-automatic
algorithm, TrustRank, which uses the link structure of the Web to discover documents that
are likely to be legitimate with respect to a small set of (seed) legitimate documents that
were manually identified by experts. Since this approach requires human assistance, it is
not fully automated. A spam mass metric, on the other hand, is defined in [16] that reflects
the impact of link spamming on the ranking of a document and is used for determining
Web documents that are significantly beneficiaries of link spamming. As yet in another
                            u
link-based method, Bencz´ r et al. [4] introduce SpamRank, which is based on PageRank,
to identify documents linked by a large number of other documents intended for misleading
search engines to rank their target higher.
     Following the premise that the purpose of spam Web sites is to obtain financial gain,
       u
Bencz´ r et al. [5] classify Web documents by extracting features from the documents,
which are known to have high advertisement or spam value, to capture their semantic con-
tent based on the keyword occurrences. These features include (i) the Online Commercial
Intention Value given to a Web site by Microsoft adCenter Labs Demonstration, (ii) the
distribution of Google AdSense words on a Web site, and (iii) the Yahoo! Mindset classifica-
tion to a particular Web site as (non-)commercial, etc. The extracted features are combined
with the set of features defined in [7] for identifying (content- and link-based) spam Web


                                              3
documents. According to [5], determining the commercial intent of Web documents can
significantly enhance the performance of the decision tree used in [5, 7] for detecting which
ones are spam. As opposed to our spam-detection approach, the approach in [5] relies
solely on the assumption that spammers seek profit from their Web sites; however, there is
no verification on its effectiveness when the Web sites under evaluation are not set up for
financial gain, but they exist for other purposes instead.
    Becchetti et al. [3] propose a link-based approach for detecting spam Web documents.
After conducting a statistical study on the (outgoing and incoming) links in a large collec-
tion of Web documents, Becchetti et al., who extract features such as in-degree, out-degree,
reciprocity, average in-degree of out-neighbors, etc., which are (some of) the features used
hereafter in a C4.5 decision tree for classifying spam Web documents, claim that the classi-
fier is as effective as other existing content-based Web spam-detection approaches. However,
they have not considered combining content-based classifiers and their (or any other ex-
isting) link-based approach, which could enhance the performance of their spam-detecting
method for correctly identifying spam Web documents.
    Urvoy et al. [27] present a spam Web document detection method based on the internal
structure of HTML documents rather than their contents. The authors develop an algorithm
that relies on the style similarity measure computed using both the textual and extra-
textual features in the HTML source code of a Web page, such as the average number
of a particular HTML tag and anchor links in a large collection of Web documents, for
clustering Web documents and identifying the spam clusters. Just like [3], Urvoy et al.
emphasize the necessity of combining their spam-detection method with content-based spam
Web document classifiers. Although in our approach we also consider the HTML markup
content in existing Web documents for spam detection, the effectiveness of our approach
largely depends on the content (i.e., words) of the Web documents, which does not require
any previous analysis of document structures in spam detection.
    Caverlee and Liu [9] introduce a ranking algorithm which, as opposed to PageRank
or TrustRank, relies on the credibility of Web documents to assess their quality. The
credibility value of a given Web document is computed according to the quality of the
links specified in the document. What is more, the algorithm in [9] allows its users to
determine different levels of “spam-tolerance” according to their preferences. Although the
experimental results in [9] seem favorable, the evaluations were conducted by using only
pornography-related Web documents as spam, which exclude a large number of Web pages
that are not pornography-related, but they are spam.
    The spam-detection approach of Liu et al. [20] is neither content- nor link-based; instead,
it relies on user-behaviors and Bayes learning. The proposed method analyzes user-behavior
patterns as shown in a collected Web-access log and uses three different features—search
engine oriented visiting ratio, the number of clicks on hyperlinks in a Web document,
and the number of sessions a user visit which is less than a previously-defined number
of pages within a given Web site—for training a naive Bayes learner for detecting spam
Web documents. Since the approach is based on a Bayes learner that must be trained to
accurately detect spam Web documents, it is differed from our spam-detection approach
that does not require any training process.
    Goodstein and Vassilevska [13] adapt the approach in [28] to develop a spam-detection
mechanism, which is set up as a game in which users, i.e., players, label the results re-


                                              4
trieved by a search engine as relevant or non-relevant with respect to a particular query.
Using the provided answers, a voting algorithm is invoked for classifying Web documents as
(non-)spam. It is assumed that previously-generated Web document rankings are modified
according to the voting results from the game, and hereafter Web documents labeled as
spam are removed entirely from the rank, preventing spammers from bypassing Web spam
filters. As opposed to our spam-detection approach, the detection method in [13] is entirely
depended on users’ classifications of Web documents, which may not always be reliable,
since the users are only given a snapshot of the Web pages and a short period of time for
deciding if they are (non-)spam. As a result, their answers might not be accurate, which
could jeopardize the quality of the modified rankings.
    None of the approaches discussed above relies on the actual word-semantic measure in
the content of a given Web document to detect spam Web documents, which is our spam-
detection approach. In [24] we demonstrate that using the content of emails for detecting
junk emails is effective, the same strategy that we adapt for finding spam Web documents
in this paper.


3     Our spam-document detection approach
As discussed earlier, spam Web documents are a burden for (i) the Web servers, since
(among others) undetected spam pages waste storage space and processing time to index
and maintain and (ii) the users who must deal with low-quality retrieved results (caused
by spamming) when performing Web searches. To neutralize existing spamming tricks, we
rely on the content (i.e., words and phrases) and/or proportion of markup structure of a
given Web document D to determine whether D should be treated as spam.

3.1    Title and body of a Web document
In [17, 19] the authors claim that the title of a document often reflects its content, and we
are confident that this same concept applies to Web documents as well, since a legitimate
Web document is a regular document with a title that describes its content, whereas the
title of a spam Web document often does not reflect its content [21]. Consider the legitimate
Web document (http://www.mothersbliss.co.uk/shopping/index) in Figure 1 in which the
title reflects the content of the document, whereas Figure 2 shows a spam Web document
(http://extc.co.uk) in which its content and its title mismatch. We analyze the content of
a Web document to determine how closely its title is related to its body (as discussed in
Sections 3.2 and 3.4) and calculate the fraction of hidden content (i.e., markup data struc-
ture) of a Web document (in Section 3.3), if necessary, in detecting spam Web documents.
Hereafter, we discuss an enhanced content-similarity measure using bigrams and trigrams
(in Section 3.4).

3.2    Degrees of similarity of Web documents
We rely on the similarity measure between the content of, which is represented by a se-
quence of words in, the title T and the body B of a Web document to identify spam
documents. We determine the degree of similarity between T and B using the correlation

                                             5
        Figure 1: The title and (portion of the) content of a legitimate Web document




          Figure 2: The title and (portion of the) content of a spam Web document

factors of words in T and B as defined in a precomputed word-correlation matrix. The
matrix is generated by analyzing a set of 880,000 Wikipedia documents (downloaded from
http://www.wikipedia.org/) to calculate the correlation factor (i.e., similarity value) of any
two words1 based on their (i) frequency of co-occurrence and (ii) relative distance as defined
below.
                                                                     1
                                     ci,j =                                                              (1)
                                              wi ∈V (i) wj ∈V (j)
                                                                  d(wi , wj )

where d(wi , wj ) denotes the distance between any two words wi and wj in any Wikipedia
document D, and V (i) (V (j), respectively) is the set of stem variations of word i (j,
respectively) in D. If wi and wj are consecutive words, then d(wi , wj ) = 1, whereas if wi
                                                                      1
and wj do not co-occur in D, then d(wi , wj ) = ∞, i.e., d(wi1,wj ) = ∞ = 0. To avoid bias that
occurs in documents of large size, we normalize the word-correlation factors as follows:
                                                          ci,j
                                         Ci,j =                                                          (2)
                                                   |V (i)| × |V (j)|

where ci,j is as defined in Equation 1.

3.2.1     Content similarity of the title and body
Using the normalized word-correlation factors, we can compute the degree of similarity
(between the words) of the title T and the body B of a Web document D, denoted SimT B.
We focus only on T and B of D, since as shown in the Experimental Results section
(Section 5), SimT B of D can accurately determine whether D is spam or legitimate2 .
   To determine the SimT B value of D, we calculate the similarity value of each word t in
T with respect to each word b in B of D. The higher the correlation factors among t and
the words in B, the higher the similarity value between t and B, denoted µt,B .
   1
     Words in the documents were stemmed (i.e., reduced to their grammatical roots, e.g., “computer”,
“computing”, and “computation” are converted to “compute”) after all the stopwords (i.e., words that
carry little meaning, e.g., articles, prepositions, and conjunctions) were removed, which minimized the
number of words to be considered.
   2
     In the case in which there is no title in D, we consider the hidden content (i.e., the markup structure)
of D. (See Section 3.3.)

                                                        6
                Names     Announce       Time           Diet         Baby         Answer    ...    µ value
 Baby          8.9×10−8   1.6×10−7     9.8×10−8      7.1×10−8          1        7.3×10−8    ...       1
 Pregnancy     5.8×10−4   4.1×10−5     1.4×10−7      6.1×10−7      3.7×10−3     2.5×10−10   ...       1
 Discover      6.1×10−8   7.1×10−2     1.4×10−8      3.3×10−8      8.9×10−8     8.1×10−9    ...   2.3×10−4
 Magic         6.9×10−8   5.8×10−8     9.5×10−8      4.9×10−8      1.3×10−7     6.1×10−8    ...   8.7×10−5
 ...              ...        ...          ...           ...           ...          ...      ...      ...

Table I: Word-correlation factors and the µ values of some of the words in the title with
respect to the words in the body of the legitimate Web document as shown in Figure 1

              Computers    Internet   Electronics     Mortgages        Credit     Flights   ...    µ value
Compare       9.1×10−8    2.5×10−7    5.9×10−8        2.3×10−8       1.1×10−8    6.1×10−8   ...   2.9×10−7
Find          2.2×10−7    2.1×10−7    1.4×10−8        5.3×10−9       1.6×10−8    2.0×10−8   ...   1.9×10−7
Resources     6.2×10−8    4.0×10−8    5.4×10−8        4.5×10−8       1.7×10−7    2.0×10−8   ...   4.7×10−7

Table II: Word-correlation factors and the µ values of the words in the title with respect to
the words in the body of the spam Web document as shown in Figure 2



                                       µt,B = 1 −           (1 − Ct,b )                                 (3)
                                                      b∈B


where µt,B is defined as the complement of a negated algebraic product, instead of the
algebraic sum, and Ct,b is as defined in Equation 2.
   Once the µ value of each word in T with respect to all the words in B is calculated, we
can determine the degree of similarity between T and B, which yields the average similarity
value of each word t in T with respect to all the words in B, as
                                                    µ1,B + µ2,B + . . . + µn,B
                            SimT B(T, B) =                                                              (4)
                                                               n
where n is the number of words in T . A high (low, respectively) SimT B(T, B) value reflects
a high (low, respectively) degree of similarity between (the contents of) T and B.

Example 1 Table I (Table II, respectively) shows the correlation factors among the words
in the title and the body of the Web document D1 in Figure 1 (D2 in Figure 2, respectively).
The degree of similarity, i.e., SimT B, of the title T1 and the body B1 of D1 is 0.88, whereas
the degree of similarity between the title T2 and the body B2 of D2 is 3.2 × 10−7 . According
to the computed SimT B value, D1 (D2 , respectively) is highly likely a legitimate (spam,
respectively) document. 2

3.2.2       Similarity threshold value
Having computed the SimT B value of a given Web document D, we must determine an
appropriate word-similarity threshold value V so that if SimT B of D ≥ V , then D is
considered legitimate; otherwise, D is treated as spam. An ideal similarity threshold should
(i) reduce to the minimum the number of spam Web documents identified as legitimate,
i.e., false negatives (F Ns), and (ii) avoid treating legitimate Web documents as spam,


                                                      7
           (a) SimT B threshold values                      (b) Hidden-content threshold values


Figure 3: Numbers of False Positives (F P s) and False Negatives (F Ns) computed by using
different possible similarity threshold values on the Web documents in the Threshold set

i.e., false positives (F P s). To determine the correct similarity threshold, we consider (i)
Web documents in a set, called Threshold set, and (ii) a number of possible similarity
threshold values which yield different number of F P s and F Ns. The Threshold set is a
collection of 370 previously classified spam and non-spam Web documents (170 spam and
200 non-spam) randomly selected from the WEBSPAM-UK2006 dataset (http://www.yr-
bcn.es/Webspam/datasets/), which is a well-known, publicly available reference collection
for Web spam research that consists of 77.9 million spam and non-spam Web documents.
As shown in Figure 3(a), the optimal word-similarity threshold is 0.80, since at 0.80 the
total number of F P s and F Ns are reduced to a minimum and neither the number of F P s
nor F Ns dominates the other3 . Hence, we declare a Web document D as

                                    Legitimate           if SimT B(TD , BD ) ≥ 0.80
                 Status(D) =                                                                           (5)
                                    Spam                 otherwise

where TD (BD , respectively) denotes the (content of) title (body, respectively) of D.
   Using Equation 5, we classify the Web document D1 in Figure 1 as legitimate, since
SimT B(T1 , B1 ) = 0.88 ≥ 0.80, whereas the Web document D2 in Figure 2 as spam, since
SimT B(T2 , B2 ) = 3.2 ×10−7 < 0.80.

3.3     Fraction of the hidden content
Even if a Web document D lacks of a title, we can still determine whether D is spam by
considering the fraction of hidden content in D. Ntoulas et al. [23] define the visible content
of a Web document D as the length (in characters) of all non-markup words in D divided
   3
    We verified the correctness of the similarity threshold value, as well as other threshold values in this
paper, using another Threshold set S, which consists of 100 (38 spam and 62 non-spam) documents from
WEBSPAM-UK2006, and S yields the same threshold values. Thus, we are confident that the threshold
values are accurately defined.


                                                    8
by the total size (in characters) of D and claim that spam Web documents often contain
less markup than legitimate documents. We adapt this heuristic, but instead compute the
fraction of hidden content, denoted HC, i.e., proportion of markup content, of D as
                                           Size of markup content of D
                              HC(D) =                                                    (6)
                                                  Total size of D
where the size of markup content and the total size of D are in characters.
    Again, upon computing the HC value of a Web document D, we must apply an appro-
priate threshold value, denoted HC-threshold, so that if HC(D) ≥ HC-threshold, then
D is considered legitimate; otherwise, D is treated as spam. To determine an appropriate
HC-threshold value, we used the same Threshold set previously described and computed
the number of F P s and F Ns according to each potential HC-threshold value. Figure 3(b)
shows that the ideal HC-threshold value is 0.75, since the total number of F P s and F Ns
are reduced to a minimum at 0.75. Excluding the title in the HTML document D1 , a
legitimate Web document (D2 , a spam Web document, respectively) as shown in Figure 1
(Figure 2, respectively), the fraction of hidden content of D1 (D2 , respectively) is 0.87
(0.23, respectively). Using the chosen HC-threshold value, i.e., 0.75, we correctly classify
D1 and D2 as legitimate and spam, respectively.

3.4       The use of bigrams and trigrams
We have observed that whenever there is at least one word4 in the title T of a spam
Web document D that also appears in the body B of D, then SimT B(T, B) is high,
which causes our spam-detection approach to misclassify D as legitimate. As a result, our
spam-detection method yields higher than expected number of false negatives. In order
to further enhance our spam-detection approach, we consider bigram and trigram, instead
of unigram (i.e., single-word as presented in Section 3.2.1), phrase-correlation factors of T
and B in determining the content similarity between T and B. We consider bigrams and
trigrams (as opposed to phrases of longer length), since as claimed by [22] and verified by
us, short phrases (i.e., bigrams and trigrams) increase the retrieval effectiveness, whereas
using phrases of 4 or more words tends to retrieve unreliable results.

3.4.1      The phrase-similarity value
In computing the phrase-correlation factors of any two n-grams (2 ≤ n ≤ 3), we apply the
                                  p(H)
Odds [18] ratio, i.e., Odds(H) = 1−p(H) , on the normalized word-similarity factors as defined
in Equation 2. Odds measures the predictive or prospective support based on a hypothesis
H (i.e., n-grams in our case) using prior knowledge p(H), i.e., the word-correlation factors
of the n-grams, to determine the strength of a belief, which is the phrase-correlation factor
in our case. We determine the phrase-correlation factor, denoted pcf , between any two
n-grams (2 ≤ n ≤ 3) p1 and p2 as
                                                         n
                                                         i=1 Cp1i ,p2i
                                      pcfp1 ,p2 =                                        (7)
                                                    1   − n Cp1i ,p2i
                                                           i=1

  4
      After stopwords are removed and the remaining words are reduced to their stems.

                                                        9
                              Baby         Name          Birth    Announce    ...    µ value
                              name          birth     announce       ready
 Baby pregnancy            4.1×10−5      1.2×10−14    1.3×10−11   5.9×10−10   ...   3.6×10−4
 Pregnancy motherhood      7.7×10−11     2.4×10−11    2.8×10−15   3.7×10−12   ...   3.7×10−9
 Motherhood discover       6.9×10−15     8.1×10−13    4.2×10−8    2.9×10−13   ...   7.4×10−9
 Discover magic            6.1×10−15     1.3×10−16    3.5×10−10   3.1×10−9    ...   1.8×10−10
 ...                           ...           ...          ...          ...    ...      ...

Table III: The phrase-correction factors and µ values of some of the bigrams in the title with
respect to some of the bigrams in the body of the legitimate Web document in Figure 1

              Baby name     Name birth    Birth announce    Announce ready    ...    µ value
                birth        announce          ready            baby
Baby
pregnancy     2.4×10−11     2.4×10−22       7.8×10−17          6.7×10−17      ...       1
motherhood
Pregnancy
motherhood    4.7×10−16     1.7×10−12       3.9×10−20          1.2×10−19      ...   1.4×10−12
discover
Motherhood
discover      1.4×10−23     4.7×10−17       1.8×10−15          3.6×10−16      ...   6.1×10−13
magic
Discover
magic         3.7×10−21     2.5×10−24       2.1×10−15          3.5×10−16      ...   8.9×10−17
mother
...               ...           ...             ...               ...         ...       ...

Table IV: The phrase-correction factors and µ values of some of the trigrams in the title
with respect to some of the trigrams in the body of the legitimate Web document in Figure 1



where p1i and p2i are the ith (1 ≤ i ≤ n) words in p1 and p2 , respectively, and Cp1i ,p2i is the
normalized word-similarity value as defined in Equation 2.
    Using the computed phrase-correlation factors, we can replace Ct,b in Equation 3 by
pcfp1 ,p2 to determine (i) the µ value between an n-gram (2 ≤ n ≤ 3) in T and all the
n-grams in B, and (ii) the degree of similarity between T and B, i.e., SimT B, using the
computed µ values for n-grams and Equation 4, which overcomes the unigram problem that
arises when an unigram in T appear in B. In adopting Equation 4 to compute the degree
of similarity, n in the equation represents the number of bigrams (trigrams, respectively),
instead of unigrams, in T . Table III (Table IV, respectively) shows some of the phrase-
correction factors between the bigrams (trigrams, respectively) in the title and body of the
legitimate document in Figure 1.
Example 2 Figure 4 shows a spam Web document D (http://khs.co.uk) in which the word
KHS in its title T is repeated in its body B, yielding SimT B(T, B) = 0.84 by Equation 4
on word-similarity measures. Using the word-similarity threshold value, 0.80, as defined in
Section 3.2.2, D is misclassified as legitimate. However, when considering the bigrams in T

                                               10
Figure 4: A sample spam Web document that is misclassified as legitimate when the (un-
igram) word-similarity measure is applied, but is correctly classified as spam when the
(bigram) phrase-similarity value is considered




        (a) Bigram-similarity threshold values        (b)Trigram-similarity threshold values


Figure 5: Number of F P s and F Ns computed by using different possible bigram- and
trigram-similarity threshold values on the Web documents in the Threshold set

and B, SimT B(T, B) = 0.57 and using the threshold value (defined below), D is correctly
classified as spam. 2

3.4.2   The phrase-similarity threshold value
Prior to using the phrase-correlation factors, we define the bigram- (trigram-, respectively)
similarity threshold value V so that for any Web document D, if SimT B(TD , BD ) ≥ V ,
where SimT B(TD , BD ) is computed by using the bigram- or trigram-correlation factors,
then D is considered legitimate; otherwise, D is treated as spam.
    In determining an ideal phrase-similarity threshold, we use the same Threshold set (in
Section 3.2.2) to compute the number of F P s and F Ns according to different potential
phrase-similarity threshold values and choose the value V such that the total number of
F P s and F Ns at V are reduced to a minimum. Figure 5(a) shows that the optimal
bigram-similarity threshold value is 0.75, whereas Figure 5(b) indicates that the optimal
trigram-similarity threshold value is 0.65.




                                                 11
4     An enhanced similarity-measure method
We have considered alternative approaches to augment the use of phrase-correlation factors
in computing the SimT B value of a Web document that can further enhance the perfor-
mance of our spam-detection approach, i.e., minimizing the number of misclassified Web
documents. An alternative approach is to determine the similarity among n-gram phrases5
(2 ≤ n ≤ 3) in the title T with respect to the ones in the body B of a Web document D
and penalize D with a lower SimT B value if B contains phrases that are similar to only
a few phrases in T and reward D with a higher SimT B value if B contains phrases which
are closely related to a number of phrases in T .

4.1     The enhanced phrase-similarity approach
The enhanced phrase-similarity approach assures that if each phrase in the title T is closely
related to (or matches exactly with) a phrase in the body B, then the corresponding
Web document D is more likely legitimate; otherwise, D is likely spam. We compute
the enhanced similarity value between T and B of D by calculating the sum of the phrase-
correlation factor of each bigram (trigram, respectively) pt in T with respect to each bigram
(trigram, respectively) phrase in B, i.e.,
                                                         m
                                          spcfpt,B =         pcfpt,j                                    (8)
                                                       j=1


where pcfpt,j is the phrase-correlation factor as defined in Equation 7 and m is the total
number of the bigrams (trigrams, respectively) in B.
   Once the spcf value of each bigram (trigram, respectively) in T has been calculated, we
can compute the enhanced degree of similarity between T and B, denoted enSimT B, as
                                                         n
                               enSimT B(T, B) =              Min(spcfi,B , 1)                           (9)
                                                       i=1

where n is the total number of bigrams (trigrams, respectively) in T .
    In calculating the enSimT B value, we add the minimal value of 1 and the spcf value
of each n-gram phrase pti in T . We do so in order to restrict the similarity value of each
pti in T with respect to the ones in B to 1, which is the similarity value for an exact match;
otherwise, the spcf value could be given too much weight over an exact match, which could
raise the enSimT B value much higher than necessary on a “few” good (or exact) matches.
    To avoid the length bias in T , we normalize an enSimT B value as

                                                         enSimT B(T, B)
                               EnSimT B(T, B) =                                                        (10)
                                                               n
where n is the total number of bigrams (trigrams, respectively) in T , and 0 ≤ EnSimT B(T ,
B) ≤ 1.
   5
     In the case when no bigrams or trigrams are available in the title T , i.e., after stopword removal and
stemming on the words in T and only one word is available in T , then the (single) word-similarity will be
considered.

                                                    12
  (a) Bigram EnSimT B threshold values       (b) A sample spam Web document misclassified
                                             by the SimT B value


Figure 6: Determining the ideal bigram EnSimT B threshold value and a classification
example using the SimT B versus EnSimT B value based on the bigram-similarity value

4.2    The EnSimT B threshold value
We define the appropriate threshold value for EnSimT B, which yields the cut-off value
between spam and legitimate Web documents. Using the same Threshold set and different
possible threshold values, we determine the number of F P s and F Ns for each of the possible
thresholds for EnSimT B. As shown in Figure 6(a), the bigram EnSimT B-threshold value
should be 0.67, which yields the minimal sum of F P s and F Ns. (Note that the trigram
EnSimT B-threshold value is not computed, since as shown in Section 5 bigrams outperform
trigrams in similarity measure and hence we only consider bigrams from here on.)

Example 3 Table V shows how closely related some of the bigrams in the title of the spam
Web document D in Figure 6(b) are to the ones in its body. Using the bigram SimT B
value of D, which is 0.75, D is misclassified as legitimate, since SimT B(TD , BD ) ≥ 0.75,
the bigram SimT B threshold value. However, when the EnSimT B value is considered
instead, D is correctly classified as spam, since EnSimT B(TD , BD ) = 0.5 < 0.67, the
bigram EnSimT B threshold value. 2


4.3    Verifying the threshold values
To validate the correctness of the threshold values, i.e., HC-threshold, word-similarity
threshold, bigram-similarity threshold, trigram-similarity threshold, and bigram En-
SimT B-threshold, used in our spam-detection approach, we conducted an empirical study
using six disjoint subsets with 2,000 Web pages each randomly-selected from WEBSPAM-
UK2006. Out of the six collections, two contain 50% of spam Web pages, whereas the
remaining four include 20%, 30%, 60%, and 80% spam Web pages, respectively.
   As shown in Table VI, with the exception of the trigram-similarity threshold, the ac-
curacy ratio generated by each of the six subsets using the HC-threshold, word-similarity

                                             13
Bigrams in                       Bigrams   in the Body
the Title         Cash        Advance        Credit         Card      ...      spcf          M in
                advance        payday        report         debt               value       (spcf , 1)
Loan advice    1.5×10−9      2.1×10−14     5.0×10−16     6.5×10−16    ...      2.14          1.00
Advice site    2.7×10−16     2.6×10−15     4.5×10−15     2.9×10−15    ...   7.2×10−12     7.2×10−12
Site loan      4.5×10−13     1.5×10−16     1.3×10−15     2.3×10−15    ...   9.7×10−10     9.7×10−10
Loan quote     4.4×10−15     4.1×10−13     2.4×10−16     5.9×10−14    ...   3.4×10−14     3.4×10−14
...                ...           ...           ...           ...      ...       ...           ...

Table V: The bigram-similarity values and spcf values for some of the bigrams in the title
T of the Web document in Figure 6(b) with respect to the bigrams in its body B

threshold, bigram-similarity threshold, or bigram EnSimT B-threshold, respectively, for
detecting (non-)spam Web pages remains relatively consistent. The HC-threshold, how-
ever, has generated lower accuracy ratio6 (with respect to its maximum and minimum,
as well as its accuracy ratio range, i.e., between 52% to 65%). For this reason, we have
re-computed and re-evaluated the HC-threshold. The re-computed HC-threshold value,
which is set to be 0.60, is based on the numbers of F P s and F Ns generated by different
potential threshold values using a set of 1,000 Web pages from WEBSPAM-UK2006, out of
which 450 are spam Web pages, which is the percentage of spam used in the Threshold Set
(introduced in Section 3.2.2) for defining different threshold values. Furthermore, adopting
the same threshold evaluation strategy presented in Section 3.2.2, we verified the correct-
ness of the new HC-threshold value using a new threshold set with the same settings as
Threshold Set S. (See Footnote 3.) Hereafter, we proceeded to re-evaluate the new HC-
threshold using the same six subsets of Web pages. As shown in Table VI, using the new
HC-threshold value on the six subsets of Web pages, the accuracy remains consistent. In
addition, the appropriateness of the established threshold values7 is further confirmed by
the high accuracy value in detecting spam Web pages achieved using various large test sets.
(See Section 5 for details.)

4.4     The overall spam-detection process
Figure 7 shows the overall process of our spam-detection approach, which is described as
follows: (i) when analyzing a Web document D, if D is detected without a title, then (ii)
the HC (i.e., Hidden Content) value v of D is computed so that if v ≥ HC-threshold (<
HC-threshold, respectively), i.e., 0.60, then D is treated as legitimate (spam, respectively).
Otherwise, i.e., D contains a title, (iii) the SimT B value e of D using the chosen type of
n-gram (1 ≤ n ≤ 3) is computed and if e ≥ the corresponding n-gram similarity threshold,
i.e., 0.80 for unigrams, 0.75 for bigrams, and 0.65 for trigrams, then (iv) the HC value v of
D is calculated. (The HC value is evaluated at this stage as an additional step to provide
further evidence for classifying D as legitimate or spam.) If v ≥ HC-threshold (< HC-
   6
     The maximum, minimum, and average accuracy ratios generate by the HC-threshold value are compa-
rable to the corresponding ones generated by the trigram-threshold value, which is not a reliable measure
in detecting spam Web pages and thus is eventually excluded from consideration in our spam-detection
process. (See the experimental results presented in Section 5 for detailed discussion).
   7
     From now on, whenever we refer to HC-threshold value, we mean the new HC-threshold value.

                                                   14
     Threshold                  Accuracy Ratio                   Largest Difference, i.e.,
                       Min(imum) Max(imum) Ave(rage)            MAX(Ave-Min, Max-Ave)
HC                        0.52        0.65     0.61                       0.09
W ord-similarity          0.61        0.73     0.66                       0.07
Bigram-similarity         0.66        0.81     0.73                       0.08
T rigram-similarity       0.51        0.68     0.66                       0.15
Bigram EnSimT B           0.62        0.78     0.72                       0.10
New HC                    0.74        0.77     0.75                       0.02

Table VI: The minimum, maximum, and average accuracy values generated for each
threshold value using the six disjoint subsets of randomly-selected Web pages from the
WEBSPAM-UK2006 dataset




                    Figure 7: The overall Web spam-detection process

threshold, respectively), then D is classified as legitimate (spam, respectively). Otherwise,
i.e., e < the corresponding n-gram similarity threshold, (v) the EnSimT B value E of D (on
bigrams) is computed. (We calculate, as part of our spam-detection process, the EnSimT B
value as an extra step for reducing the number of F P s and F Ns that could be generated.)
If E ≥ bigramsEnSimT B-threshold, i.e., 0.67, then D is considered legitimate; otherwise,
D is categorized as spam. Note that since by using bigrams, we reduce considerably the
number of F P s, as well as F Ns, compared with using unigrams or trigrams (see Section 5),
we do not consider unigrams nor trigrams any further in Step (v).


5    Experimental results
In this section we discuss the two datasets (in Section 5.1) used for our empirical study and
show the accuracy of using n-gram (1 ≤ n ≤ 3) phrases in our spam-detection approach

                                             15
(in Section 5.2), which verifies the effectiveness of our approach in detecting spam Web
documents (in Section 5.3). In addition, we compare the performance of our spam-detection
approach with other well-known anti-spam methods (in Section 5.4) and evaluate and verify
the consistency of our approach in accurately detecting spam documents using corpora
samples of Web pages with different percentages of spam (in Section 5.4). We include a
case study (in Section 5.6) to demonstrate the degrees of accuracy in detecting spam Web
documents at various steps of the entire spam-detection process as shown in Figure 7.

5.1     Web document dataset
In verifying the effectiveness of our spam-detection approach in terms of accuracy, which
is measured by the number of correctly classified Web documents as spam or legitimate
versus the number of F P s and F Ns, we used nine subsets of randomly-selected Web pages
in the WEBSPAM-UK2006 dataset8 , which (as stated in Section 3.2.2) contains 77.9 mil-
lion Web documents. Each one of the nine subsets consists of 1,500 Web pages, and the
percentage of spam in each subset varies from 10% to 90% with 10% increments. In order to
further evaluate the overall performance of our spam-detection approach, in Section 5.3 we
used another nine randomly-sampled subsets of 1,500 Web documents each with different
percentage of spam (in the range of 10% to 90%) from the WEBSPAM-UK2007 dataset
[30]. The WEBSPAM-UK2007 dataset contains 105,896,555 Web pages downloaded from
114,529 hosts in the .UK domain in May 2007 that were previously labeled as spam and
non-spam (see http://barcelona.research.yahoo.net/webspam/datasets/uk2007 for details).
The reported measures in Section 5.3, i.e., the number of F P s and F Ns, accuracy, and er-
ror rate, are computed by averaging the corresponding measures generated by each of the
nine corresponding subsets of the WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets.
    As stated in [7], WEBSPAM-UK2006 is appropriate and widely used for verifying the
accuracy of a given spam-detection approach, since the collection (i) includes a large variety
of spam and non-spam Web documents, (ii) represents an uniform random sample, (iii)
consists of spam Web documents created by using different spam techniques, and (iv) is
freely available to be used as a benchmark measure in detecting spam Web documents.
These properties also apply to WEBSPAM-UK2007.

5.2     Accuracy of our approach in using SimT B with(out) the HC-
        value
We first verify (i) the effectiveness of our spam-detection approach in using n-grams (1 ≤
n ≤ 3) and (ii) the most accurate n-gram phrases in determining the SimT B value between
the title and the body of a Web document using the following measures:

                                         Correctly identified Web documents
                        Accuracy =                                                                 (11)
                                          Total number of Web documents
                     Error Rate = 1 − Accuracy                                                     (12)
   8
     Web pages in different subsets of WEBSPAM-UK2006 are different from the ones used earlier for setting
and verifying the appropriateness of the threshold values.

                                                  16
      (a) The accuracy and error rates           (b) The number of F P s and F N s


Figure 8: Experimental results on using n-gram (1 ≤ n ≤ 3) phrases based on the computed
SimT B values of the Web documents in the subsets of the WEBSPAM-UK2006 dataset



where correctly identified Web documents is the total number of analyzed Web documents
minus F P s and F Ns.
    As shown in Figure 8(a), using bigram SimT B values of the Web documents in the
various subsets of WEBSPAM-UK2006, our spam-document detection approach yields the
(average) accuracy and error rate of 74% and 26%, respectively, which outperforms the
unigram and trigram approaches. Furthermore, Figure 8(b) shows the number of F P s and
F Ns of different n-grams (1 ≤ n ≤ 3) in misclassifying Web documents. According to the
figure, bigrams significantly reduce the number of F P s and F Ns as opposed to the number
of F P s and F Ns generated by using unigrams or trigrams based on their corresponding
computed SimT B values.
    We have observed that bigrams outperform trigrams because (closely) related 3-word
phrases in the title and body of a Web document occur less often than (closely) related
2-word phrases. As a result, the degree of similarity between the title and the body of a
Web document is lower using trigrams than bigrams, causing a higher number of F P s and
F Ns.
    We have further compared how well our spam-detection approach performs when con-
sidering both bigram-similarity among the words in the title T and the body B of a given
Web document D and the fraction of hidden content of D (Method A), i.e., Steps (iii) and
(iv) in Figure 7, as opposed to only considering the bigram-similarity measure between T
and B of D (Method B), i.e., Step (iii) only, using the SimT B values. Figure 9(a) shows
that the accuracy of our approach is increased by 5% in applying Method A than method B.

5.3    The overall accuracy of our spam-detection approach
We have conducted further comparisons using the bigram phrase-SimT B measure with(out)
the EnSimT B values on the subsets of Web pages in the WEBSPAM-UK2006 dataset, i.e.,
Step (iii) with(out) Step (v) in Figure 7. As shown in Figure 9(b), the accuracy ratio is
increased by 6%, yielding an accuracy ratio of 80%, in detecting spam Web pages using


                                           17
 (a) Accuracy using Method A (bigram SimT B +        (b) Accuracy and Error Rates of the SimT B
 HC) and Method B (bigram SimT B only)               values with(out) the EnSimT B values


Figure 9: Experimental results computed on the samples of Web documents in the
WEBSPAM-UK2006 dataset

bigrams based on the SimT B and EnSimT B values, instead of using solely the SimT B
values.
    Even more so, by considering the HC-value, as well as the EnSimT B value (i.e.,
Step (iv) and (v) in Figure 7), in addition to the SimT B value, we further reduce the
number of F P s and F Ns (as shown in Figure 10(a)) and obtain an overall accuracy ratio
of 85% for the nine subsets of Web pages from the WEBSPAM-UK2006 dataset (as shown
in Figure 10(b)) without significantly increasing the computational cost, since it requires
only O(n) time in calculating the HC value, where n is the number of characters in a Web
document D, and O(m2 ) in computing the EnSimT B value, where m is the number of
bigrams in D.
    The accuracy of our proposed approach for detecting spam Web pages using the nine
subsets of Web documents from the WEBSPAM-UK2007 is 84%. As confirmed by the
empirical study conducted on (the samples of) two well-known datasets, i.e., WEBSPAM-
UK2006 and WEBSPAM-UK2007, our detection approach achieves an average accuracy
ratio of 84.5%, which verifies the effectiveness of our spam-detection approach.
    Besides assessing the performance of our detection approach using subsets of Web docu-
ments with different percentages of spam, we conducted yet another empirical study using a
subset of 20,000 randomly-selected Web pages from WEBSPAM-UK2006, denoted W S06,
and another subset of 20,000 Web pages extracted from the WEBSPAM-UK2007, denoted
W S07. Since the percentage of spam documents on the Web these days is between 14%
and 22%, as stated in [7] and [23], both W S06 and W S07 contain an average percentage
of spam Web pages, which is 18% (i.e., 14% + 22% ). Based on the experimental results, the
                                            2
averaged accuracy yielded for W S06 and W S07 is 83%, which is comparable with the one
generated by using smaller subsets of documents and further verifies the efficiency and
scalability of our spam-detection method.



                                                18
           (a) Computed F P s and F N s               (b) Computed Accuracy-Error Rates


Figure 10: Experimental results of applying various steps in our spam detection approach
on the (sampled) Web documents in the WEBSPAM-UK2006 dataset

5.4    Comparing the performance of our spam-detection approach
       with other anti-spam methods
We further compare the performance (in terms of precision and recall) of our spam-detection
approach with other well-known anti-spam methods in [8], which consider link-based [1]
and content-based [23] features, and the combination of both. The features described in
[8], which include the degree-related measures, PageRank, TrustRank [14], and features
described in [23], such as the number of words and average word length in a document,
are served as inputs to the C4.5 decision tree. Furthermore, the authors of [8] use the
aggregation of spam hosts to enhance the spam-detection accuracy by (i) implementing a
graph clustering algorithm that evaluates whether the majority of hosts in a cluster C are
spam and if so all the hosts in C are considered spam; (ii) applying the graph topology
to smooth “spamicity” predictions by propagating them using random walks [31]; and (iii)
using a stacked graphical learning scheme [10], which derives initial predictions for all the
objects in a group of Web documents and generates extra features for each object to improve
the quality of the original predictions.
    In comparing the existing anti-spam methods listed earlier with ours, we consider the
evaluation method defined in [8], which adapts the following confusion matrix:

                                               Prediction
                                            Non-Spam Spam
                          True     Non-Spam     a         b
                          Label    Spam         c         d
                                                                      d
Castillo et al. [8] compute the True Positive Rate (or recall) = c + d , False Positive Rate
= a + b , and F -Measure = 2 × precision × recall , where precision is defined as b + d . High
     b
                                precision + recall
                                                                                   d

recall and precision translate into high F -measure, whereas low precision and recall yield
low F -measure. Furthermore, high (low, respectively) recall and low (high, respectively)
precision generate low F -measure.

                                             19
Figure 11: The False Positive Rate (FPR), True Positive Rate (TPR), and F -Measure
computed by using the WEBSPAM-UK2006 dataset applied to the approaches in [8] and
ours

    Figure 11 shows the experimental results reported in [8] for different Web anti-spam
detection methods using the respective classifier with the highest F -Measure, as well as
the results generated by using our approach, on the WEBSPAM-UK2006 dataset. Our
spam-detection method outperforms the other anti-spam methods by at least 5% in True
Positive Rate and at least 3% in terms of F-measure, which indicates that we obtain high
recall and comparable precision with respect to other approaches in detecting spam Web
documents. The empirical study has verified that our spam-detection approach correctly
identifies (almost) all spam Web documents and avoids misclassifying many legitimate Web
documents.
    To further assess the performance of our Web spam-detection approach we compare
our accuracy ratios with the ones generated by the spam-prediction method introduced
in [29] using another nine9 different subsets of documents. As opposed to our detection
approach, which relies on the content and structure of Web pages to detect spam doc-
uments, the spam-prediction method in [29] relies on HTTP session information. The
prediction method analyzes hosting IP addresses, as well as HTTP session headers, such as
“Content-Type”, “Server”, “X-Powered-By”, “Content-Language”, or “Pragma”, to train
classification algorithms, such as C4.5, HyperPipes, Logistic regression, or Support Vector
Machine (SVM), to identify (non-)spam Web pages [29].
    To perform a compatible evaluation between our spam-detection approach and the one
in [29], each of the nine subsets with the corresponding percentage of spam Web pages
used for evaluation contains 1,486 Web documents from the WEBSPAM-UK2006 dataset,
the same number of pages used for conducting the evaluation measures in [29]. Figure 12
shows, for each subset of Web pages, the corresponding accuracy ratios of the two spam-
detection/prediction methods. As stated in [29], the classifier’s performance is relative
consistent for subsets that contain between 30% and 70% spam Web pages, but varies con-
siderably in the extremes. The accuracy ratios of our proposed approach, on the other hand,
  9
      Once again the percentage of spam Web pages in each subset varies from 10% to 90%.


                                                   20
Figure 12: Accuracy ratios achieved by our spam-detection approach and the Web spam-
prediction method in [29] on different corpus samples extracted from the WEBSPAM-
UK2006 dataset

steadily increase as the percentage of spam pages in a collection increases. Furthermore,
the overall accuracy of our spam-detection approach (as shown in Figure 12) is higher than
the averaged accuracy of the approach in [29] by 3%. Since the spam-prediction method in
[29] performs better than our approach for collections of Web pages with low percentages of
spam, i.e., below 40%, we could consider using the HTTP session information, in addition
to our content- and structure-based analysis approach in classifying spam Web pages, which
should further enhance our spam-detection method.

5.5    Observations
The anti-spam methods in [8] combine widely-used algorithms, such as C4.5 decision tree
or (graph) clustering, with known link-based and content-based spam-detection strategies,
which are representative of the commonly-used methods in classifying (non-)spam Web
documents. Compared with the performance of these anti-spam methods, which have
been used for verifying the higher degree of accuracy of our spam-document detecting
approach in Section 5.4, we draw the conclusion that ours is more effective and simple. Our
approach is effective, since on the average we can correctly identify 84.5% (the accuracy
ratio as reported in Section 5.3) of the evaluated Web documents, and is simple, since our
detection approach only requires computing the (enhanced) similarity values among the
words in the title and body of a given Web page and/or its percentage of hidden content
to identify spam Web documents. The cost-sensitive decision tree [8], on the other hand,


                                            21
requires computing different link-based measures (as discussed in [1]) and considers many
content-based features (as presented in [23]) to construct a decision tree that classifies
(non-)spam Web pages and needs to be trained using previously labeled data (i.e., Web
documents labeled as spam/legitimate) so that it can later be used as a classifier. Hence,
the cost-sensitive decision tree requires additional pre-processing time for detection, which
is a constraint.
    Another anti-spam method, clustering [8], not only requires a classification step, but
also applies a graphical clustering algorithm that considers the hosts that contain the Web
pages to be evaluated. In one case, propagation [8] is used for smoothing the probabilities
that are associated with each host, which establish the likelihood of the host being spam.
In another case, the stacked graphical learning strategy [8] is applied, which considers a
stacked graphical learning scheme that iteratively yields new features that describe hosts
and are later used as additional evidences to improve the accuracy of correctly detecting
spam Web pages. Unfortunately, these methods require an additional one or more steps
to determine hosts’ influence in (i.e., provide additional information to) the decision tree
classifier for identifying (non-)spam Web pages. Moreover, these additional steps do not
generate higher degree of accuracy compared with our spam-detection approach in detecting
spam documents.
    Furthermore, for each one of the anti-spam methods presented in [8], the authors apply
bagging, which is a technique that combines available classifiers, i.e., decision trees. This
technique requires building and training an ensemble of classifiers, which translates into
yet an additional process to be implemented with the purpose of augmenting the accuracy
in detecting spam Web pages. It is worth to mention that the initial step, i.e., building
a classifier, in all of the anti-spam methods in [8] requires training to construct a decision
tree, which is not required by ours.
    As shown in Figure 11, among all the anti-spam approaches in [8], even though the
stacked graphical learning approach achieves the highest degree of accuracy, it is at least
5% less accurate (in terms of True Positive Rate) than ours. While Castillo et al. [8]
claim that the stacked graphical learning approach is scalable and can be used in large Web
datasets of any size, it is clear that is not as simple (in terms of implementation) as our
proposed spam-detection approach, which depends solely on the content of Web documents
and the availability of a pre-defined word-correlation matrix.

5.6     A case study
In designing our spam-detection approach, we rely on several methods—percentage of hid-
den content, unigram similarity, n-gram similarity (2 ≤ n ≤ 3), and enhanced n-gram
similarity measures. In this section we present a case study, which is conducted for an-
alyzing the effectiveness in applying the content-similarity methods that depend on the
(words in the) content of Web documents and the structure-based method that relies on
the mark-up content of Web pages for accurately identifying spam Web pages. We con-
structed another nine disjoint subsets with 1,000 randomly-selected Web pages each from
the WEBSPAM-UK2006 dataset10 . Again, each of the subsets contains a different percent-
  10
    The randomly selected Web pages are different from the subsets used for establishing and verifying the
different threshold values (in Sections 3 and 4) and the subsets of Web pages used in Section 5.1.


                                                   22
Figure 13: (Average) False Positives, False Negatives, and overall error rate computed using
nine subsets of randomly-selected pages from the WEBSPAM-UK2006 dataset

age of spam varying from 10% to 90% on a 10% increment to assess the persistence of our
spam-detection approach. On an average, among all the subsets 883 Web pages come with
a title, whereas the remaining 117 do not. Of the averaged 117 Web pages with no title, 76
are spam and 41 are legitimate.
    Figure 13 depicts the (average) false positives, false negatives, and the overall error rate
using different tactics in our spam-detection approach. As shown in the figure, applying
both the semantic-based and structure-based methods for detecting spam Web pages, we
can reduce the number of false positives and false negatives and obtain (an average) low
error rate of 15% (i.e., 85% accuracy), which verifies the ideal compensation of using the
semantic and structural analysis in classifying spam/legitimate Web documents.


6     Complexity analysis and implementation
In this section we analyze the complexity of our spam-detection algorithm, SpamDe, which
includes all the steps in the overall spam-detection process as shown in Figure 7, and discuss
its implementation, which are given below.

    Algorithm: SpamDe—Detecting (non-)spam Web documents

                                              23
  Input: A Web document D, the word-correlation matrix M, HC threshold,
     SimT B threshold, EnSimT B threshold, the n-gram indicator n (1 ≤ n ≤ 3)
  Output: Classified D as (non-)spam
  1. Let S be the size (in characters) of D
  2. Let P be the size of the markup content of D
  3. Let HC be the percentage of hidden content in D /* HC = P /S */
  4. Let V := n be the variable used for altering the n-gram indicator n, if needed
  5. IF (the title T in D is missing) OR (there is no non-stop, stemmed word in T ), THEN
       5.1. IF HC ≥ HC threshold, THEN
          Label D as Legitimate
       5.2. ELSE
          Label D as Spam
     ELSE
  6. Let L be the number of non-stop, stemmed words in the title T of D
  7. IF L < n, THEN /* there are insufficient number of words in T to perform
       V := L             n-gram phrase comparisons */
  8. FOR each V -gram g in the title of D
       8.1. IF V > 1, THEN /* Detection based on bigrams or trigrams */
          8.1.1. Compute and store the phrase-correlation factor, pcf , for g and each
               V -gram in the body of D using Equation 7 and M
          8.1.2. Compute the similarity µ value of g with respect to the V -grams in
               the body of D using Equation 3 and the pcf s computed in Step 8.1.1
     8.2. ELSE
          8.2.1. Compute the similarity µ value of g with respect to the unigrams in the
            body of D using Equation 3 and the word-correlation factors, cf s, in M
  9. Compute the degree of similarity of the title T and body B of D using Equation 4
     9.1. IF SimT B(T, B) ≥ SimT B threshold, THEN
       9.1.1. IF HC ≥ HC threshold, THEN
               Label D as Legitimate
       9.1.2. ELSE
               Label D as Spam
       ELSE
     9.2. FOR each V -gram g in the title of D
       9.2.1. Compute the sum of the pcf s (or cf s for the unigram case) for g with respect
            to the V -grams in the body of D using Equation 8 /* pcf s of g and V -grams
            were computed in Step 8.1.1 */
     9.3. Compute the enhanced degree of similarity of T and B using Equation 10
     9.4. IF EnSimT B(T, B) ≥ EnSimT B threshold, THEN
       9.4.1. Label D as Legitimate
          ELSE
       9.4.2. Label D as Spam
  END

    The complexity analysis of SpamDe is based on the n-grams (1 ≤ n ≤ 3) detection
strategies presented in Section 3.4, even though we have already verified in Section 5 that

                                           24
the use of bigrams outperforms the use of unigrams and trigrams. By considering any n-
grams, SpamDe can handle the detection of (non-)spam Wed documents according to the
user’s preference on which n-grams to use. However, if the number of non-stop, stemmed
words in the existing title of a given Web document D is less than n chosen by the user,
SpamDe resets the value of n to be the number of non-stop stemmed words in the title
of D (in Step 7 of SpamDe). As a result, the user’s choice of the n-grams used for spam
detection on D could be overwritten by SpamDe. This is a legitimate strategy because
n-grams cannot be used for spam detection unless the title of the document includes at
least one non-stop, stemmed n-gram. Furthermore, although the computed EnSimT B-
threshold value mentioned in Section 4.2 is for bigrams only, the EnSimT B-threshold
value for unigrams or trigrams can be computed in the same manner using the same (or
different) set of previously labeled (non-)spam Web documents and selecting the threshold
value that yields the minimal number of false positives and false negatives, i.e., misclassified
Web documents.
    Steps 1 through 7 of SpamDe involve constant time, since they either invoke an assign-
ment statement or perform a(n) (inequality) comparison, whereas Steps 8 and 9 are the
dominant steps in SpamDe in terms of time complexity.
    The dominant sub-step in Step 8 is the step that determines the µ value (the complement
of the negated products) between the m different n-grams (1 ≤ n ≤ 3) in the title of a Web
document with respect to the k different n-grams (1 ≤ n ≤ 3) in its body, i.e., Step 8.2.1
when unigrams are considered and Step 8.1.2 when bigrams or trigrams are considered,
which is O(k). Note that computing the phrase-correlation factors, i.e., Step 8.1.1, also
requires O(k), since it involves using the corresponding word-correlation factors of the n-
gram in the title being considered and each n-gram in the body of D, where n (2 ≤ n ≤ 3)
is insignificant. Thus, the overall complexity of Step 8 is O(m × k), where in general k is
significantly larger than m, i.e., k    m. This is because the number of non-stop, stemmed
words in the body is often much more than their counterparts in the title of D.
    The dominant sub-step in Step 9, which requires complexity O(m × k), involves com-
puting the sum of the word-correlation factors (pcf s, respectively) between the m different
n-grams (1 ≤ n ≤ 3) in the title of a Web document with respect to the k different n-grams
(1 ≤ n ≤ 3) in its body, i.e., Step 9.2.1.
    In the worse-case scenario, which occurs when both the degree of similarity, i.e., SimT B,
and the enhanced degree of similarity, i.e., EnSimT B, among the n-grams (1 ≤ n ≤ 3) in
the title and the body of D must be computed before SpamDe can determine whether D
is spam, the complexity of SpamDe is O(2 × (m × k)) = O(m × k).
    SpamDe was implemented using the P erl programming language on an Intel Dual Core
workstation with dual 2.66 GHz processors, 3 GB Ram size, and a hard disk of 300 GB
running under the Windows XP operating system.


7    Conclusions and future work
We have presented a spam-document detection approach that can effectively identify spam
Web documents to aid search engines in performing intelligent searches. Our anti-spam
approach minimizes the user’s time in looking through documents that are deceitful and


                                              25
do not contain useful information. In designing our anti-spam method, we consider (i) the
(enhanced) similarity measures of the n-gram (1 ≤ n ≤ 3) phrases in the title with respect
to the ones in the body of a Web document D, and (ii) the fraction of hidden content of D,
if necessary, to determine whether D is spam. Experimental results conducted on two well-
known Web spam-detection datasets, i.e., WEBSPAM-UK2006 and WEBSPAM-UK2007,
show that by using our spam-detection approach, we can classify spam Web documents
with an 84.5% accuracy on the average. Even more so, our detection approach outperforms
existing anti-spam approaches by at least 3% in F -measure. Furthermore, our approach
is computational inexpensive, since (i) the word-correlation factors used for computing the
phrase-correlation factors are precomputed and (ii) the computational time to calculate the
fraction of hidden content is insignificant.
    We have observed that the use of bigrams significantly enhances the performance of
our spam-document detection approach. Since the bigram-correlation values employed
in our approach are computed by using the unigram-correlation factors, it is our belief
that constructing a phrase-correlation matrix directly from the Wikipedia documents could
further enhance the performance of our approach in terms of (i) minimizing misclassified
spam Web documents and (ii) reducing the computational time required to determine the
(En)SimT B value, i.e., the (enhanced) similarity value between the title and body, of each
Web document to be examined.


References
 [1] Becchetti, L., Castillo, C., Donato, D., Leonardi, S. and Baeza-Yates, R. (2006), “Link-
     Based Characterization and Detection of Web Spam”, Proceedings of the 2nd Interna-
     tional Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1-8.
 [2] Becchetti, L., Castillo, C., Donato, D., Leonardi, S. and Baeza-Yates, R. (2006), “Us-
     ing Rank Propagation and Probabilistic Counting for Link-Based Spam Detection”,
     Proceedings of the Workshop on Web Mining and Web Usage Analysis, pp. 1-8.
 [3] Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R. and Leonardi, S. (2008), “Link
     Analysis for Web Spam Detection”, ACM Transactions on the Web(2):1, Article No.
     2.
           u
 [4] Bencz´ r, A., Csalogany, K., Sarlos, T. and Uher, M. (2005), “SpamRank - Fully Au-
     tomatic Link Spam detection”, Proceedings of the 1st International Workshop on Ad-
     versarial Information Retrieval on the Web (AIRWeb), pp. 25-38.
          u          ır´           a               o
 [5] Bencz´ r, A., B´ o, I., Csalog´ny, K. and Sarl´s, T. (2007), “Web Spam Detection
     via Commercial Intent Analysis”, Proceedings of the 3rd International Workshop on
     Adversarial information Retrieval on the Web (AIRWeb), Vol. 215, pp. 89-92.
 [6] Brin, S. and Page, L. (1998), “The Anatomy of a Large-Scale Hypertextual Web Search
     Engine”, Proceedings of the International World Wide Web Conference.
 [7] Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M. and Vigna, S. (2006), “A
     Reference Collection for Web Spam”, SIGIR Forum(40):2, pp. 11-24.

                                             26
 [8] Castillo, C., Donato, D., Gionis, A., Murdock, V. and Silvestri, F. (2007), “Know
     your Neighbors: Web Spam-detection Using the Web Topology”, Proceedings of ACM
     Research and Development in Information Retrieval (SIGIR), pp. 423-430.

 [9] Caverlee, J. and Liu, L. (2007), “Countering Web Spam with Credibility-based Link
     Analysis”, Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of
     Distributed Computing, pp. 157-166.

[10] Cohen, W. and Kou, Z. (2006), “Stacked Graphical Learning: Approximating Learning
     Markov Random Fields Using Very Short Inhomogeneous Markov Chains”, Technical
     Report, Machine Learning Department, Carnegie Mellon University.

[11] Davison, B. (2000), “Recognizing Nepotistic Links on the Web”, Artificial Intelligence
     for Web Search, AAAI Press, pp. 23-28.

[12] Fetterly, D., Manasse, M. and Najork, M. (2004), “Spam, Damn Spam, and Statis-
     tics: Using Statistical Analysis to Locate Spam Web Pages”, Proceedings of the 7th
     International Workshop on the Web and Databases (WebDB), pp. 1-6.

[13] Goodstein, M. and Vassilevska, V. (2007), “A Two Player Game to Combat Web
     Spam”, Carnegie Mellon University Technical Report CMU-CS-07-134.

[14] Gyongyi, Z., Garcia-Molina, H. and Pedersen, J. (2004), “Combating Web Spam with
     TrustRank”, Proceedings of the 30th International Conference on Very Large Data
     Base, pp. 576-587.

[15] Gyongyi, Z. and Garcia-Molina, H. (2005), “Web Spam Taxonomy”, Proceedings of
     the 1st International Workshop on Adversarial Information Retrieval on the Web, pp.
     39-47.

[16] Gyongyi, Z., Berkin, P., Garcia-Molina, H. and Pedersen, J. (2006), “Link Spam De-
     tection Based on Mass Estimation”, Proceedings of the 32nd International Conference
     on Very Large Data Bases, pp. 439-450.

[17] Jin, R. and Hauptmann, A. (2002), “A New Probabilistic Model for Title Generation”,
     Proceedings of Computational Linguistics(1), pp. 1-7.

[18] Judea, P. (1988), “Probabilistic Reasoning in the Intelligent Systems: Networks of
     Plausible Inference”, Morgan Kauffman.

[19] Lam-Adesina, A. and Jones, G. (2001), “Applying Summarization Techniques for Term
     Selection in Relevance Feedback”, Proceedings of ACM Research and Development in
     Information Retrieval (SIGIR), pp. 1-9.

[20] Liu, Y., Zhang, M., Ma, S. and Ru, L. (2008), “User Behavior Oriented Web Spam
     Detection” Proceeding of the International World Wide Web Conference, pp. 1039-
     1040.




                                           27
[21] Martinez-Romo, J. and Araujo, L. (2009), “Web Spam Identification through Language
     Model Analysis”, Proceedings of the International Workshop on Adversarial Informa-
     tion Retrieval on the Web (AIRWeb), pp. 21-28.

[22] Misjne, G. and de Rijke, M. (2005), “Boosting Web Retrieval through Query Opera-
     tions”, Proceedings of European Conference on Information Retrieval, pp. 501-516.

[23] Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D. (2006), “Detecting Spam Web
     Pages through Content Analysis”, Proceedings of the International World Wide Web
     Conference, pp. 83-92.

[24] Pera, M.S. and Ng, Y.-K. (2007), “Using Word Similarity to Eradicate Junk Emails”,
     Proceedings of International Conference on Information and Knowledge Management
     (CIKM), pp. 943-946.

[25] Perkins, A. (2001), “The classification of Search Engine Spam”, Available online at
     http://www.silverdisc.co.uk/articles/spam-classification/.

[26] Svore, K.M., Wu, Q., Burges, J.C. and Raman, A. (2007), “Improving Web Spam Clas-
     sification Using Rank-Time Features”, Proceedings of the 3rd International Workshop
     on Adversarial Information Retrieval on the Web (AIRWeb), pp. 9-16.

[27] Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. (2008), “Tracking Web Spam
     with HTML Style Similarities”, ACM Transactions on the Web(2):1, Article No. 3.

[28] von Ahn, L. and Dabbish, L. (2004), “Labeling Images with a Computer Game”,
     Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
     319-326.

[29] Webb, S., Caverlee, J. and Pu, C. (2008), “Predicting Web Spam with HTTP Ses-
     sion Information”, Proceedings of the ACM Conference on Information and Knowledge
     Management (CIKM), pp. 339-348.

[30] Yahoo! Research: Web Spam Collections. (2007), http://barcelona.research.yahoo.net/
     webspam/datasets. Crawled by the Laboratory of Web Algorithmics, University of Milan
     (http://law.dsi.unimi.it).

[31] Zhou, D., Bousquet, O., Lal, T.N., Weston, J. and Scholkopf, B. (2004), “Learning
     with Local and Global Consistency”, Advance in Neural Information. Processing Sys-
     tems(16), pp. 321-328.




                                           28

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:8/27/2011
language:English
pages:28