Docstoc

dias

Document Sample
dias Powered By Docstoc
					                                   Topic Segmentation

                                              Gaël Dias and Elsa Alves

                         Human Language Technology Interest Group
                                  Department of Computer Science
                                University of Beira Interior - Portugal

                                                          http://hultig.di.ubi.pt




Ljubljana, November 18                     JOTA - Faculty of Arts                   1
Guidelines
    Introduction
    TextTiling
    Lexical Cohesion Profile
    DotPlotting
    Link Set Median
    Problems
    Asas
    Conclusions
    Problems and Future Work
Ljubljana, November 18   JOTA - Faculty of Arts   2
Introduction


  The concept of Topic
Segmentation is the task
       of breaking
     documents into                                 Topic Segmentation
    topically coherent
     multi-paragraph
        subparts.




Ljubljana, November 18     JOTA - Faculty of Arts                  3
Introduction
           Topic Segmentation is important for applications like:

                         Information Retrieval
           Indeed, you should prefer a document in which the
          occurrences of a word are concentrated into one or two
          paragraphs since such a concentration is most likely to
                  contain a definition of what is the word.

                       Text Summarization
   Topic Segmentation is one of the three phases of Automatic
     Abstracting: Topic Segmentation, Sentence Extraction and
                      Sentence Compression.

Ljubljana, November 18             JOTA - Faculty of Arts           4
 TextTiling
      TextTiling (Hearst and Plaunt 1993) is one of the most famous
                      system for Topic Segmentation.

The basic idea of this algorithm is to search for parts of a text where
   the vocabulary shifts from one subtopic to another. These points
    are then interpreted as the boundaries of multi-paragraph units.

                          TextTiling is divided into 4 phases:
                                (1) Segmentation of the Text
                                    (2) Cohesion Scorer
                                      (3) Depth Scorer
                                   (4) Boundary Selector



 Ljubljana, November 18                   JOTA - Faculty of Arts          5
TextTiling: Segmentation

         Sentence length can vary considerably.
         Therefore the text is first divided into small
       fixed units (20 words), the token sequences.

     Indeed, you should prefer a document in which the occurrences of a
     word are concentrated into one or two paragraphs since such a
     concentration is most likely to contain a definition of what is the
     word. Topic Segmentation is one of the three phases of Automatic
     Abstracting: Topic Segmentation, Sentence Extraction and
     Sentence Compression. ….

    20 words                                                    gaps

Ljubljana, November 18            JOTA - Faculty of Arts                   6
TextTiling: Cohesion

The cohesion scorer measures the amount of “topic continuity”
    or cohesion at each gap, i.e. the amount of evidence that
   the same is prevalent on both sides of the gap. Intuitively,
    we want to consider gaps with low cohesion as possible
                      segmentation points.

       In fact, we want to infer the breaking points from the
        distribution of the words in the text i.e. when there is a
     change of topic, previous words seem to disappear from the
                             rest of the text.



Ljubljana, November 18          JOTA - Faculty of Arts               7
TextTiling: Cohesion




Ljubljana, November 18   JOTA - Faculty of Arts   8
TextTiling: Cohesion

    In order to calculate this cohesion, Textiling uses
    the cosine similarity measure where the weights
        of the words are the frequency in the text.
                                                      p

                                                     X       ik    Xjk
                     Sij  cos Xi, Xj             k 1
                                                p                      p

                                             k 1
                                                     Xik 2           
                                                                      k 1
                                                                             Xjk 2



Ljubljana, November 18                       JOTA - Faculty of Arts                  9
TextTiling: Cohesion

        The similarity calculation can be illustrated as
       the following figure where the x-axis is the gap
                            number.




Ljubljana, November 18        JOTA - Faculty of Arts       10
TextTiling: Depth Scorer
     The depth scorer assigns a depth score to each
     gap depending on how low its cohesion score is
           compared to the surrounding gaps.

           If cohesion at the gap is lower than at
      surrounding gaps, then the depth score is high.
       Conversely, if cohesion is about the same at
      surrounding gaps, then the depth score is low.

     However, this situation is not so linear. There
    may be slight shifts and big shifts. Only the latter
        may be considered as breaking points.
Ljubljana, November 18      JOTA - Faculty of Arts         11
TextTiling: Depth Scorer
                                                                s1        s3




                                                                     s2
                                                                g1 g2 g3

     Indeed, you should prefer a document in which the occurrences of a
     word are concentrated into one or two paragraphs since such a
     concentration is most likely to contain a definition of what is the
     word. Topic Segmentation is one of the three phases of Automatic
     Abstracting: Topic Segmentation, Sentence Extraction and
     Sentence Compression. ….
                    s3                                     s2             s1
Ljubljana, November 18            JOTA - Faculty of Arts                       12
TextTiling: Depth Scorer
       The gap depth score is computed by summing
        the heights of the two sides of the valley it is
                          located in.

               For instance, for text 1, we would have:
                     depth(g2)=(s1-s2)+(s3-s2).

        In a text with rapid fluctuations of topic
    vocabulary, only the most radical changes will be
      accorded the status of segment boundaries.


Ljubljana, November 18           JOTA - Faculty of Arts    13
TextTiling: Depth Scorer
                                                    s1
                                      s3                                  s5
                         s1
                                                                s3

                                                         s2




                              s2                                     s4


                         g1 g2 g3                     g1 g2 g3 g4 g5

                              text1                             text2

Ljubljana, November 18                 JOTA - Faculty of Arts                  14
TextTiling: Depth Scorer

            For a practical implementation, several
           enhancements of the basic algorithm are
          needed. First, we need to smooth cohesion
           scores to address situations like in text 2.
            Intuitively, the difference (s1-s2) should
              contribute to the depth score of g4.

      This is achieved by smoothing scores using a
     low pass filter. For example, the cohesion score
         si for gi is replaced by (si-1 + si + si+1)/3.

Ljubljana, November 18         JOTA - Faculty of Arts     15
TextTiling: Depth Scorer

                              s1               s2                s3


                         b1        b2                       b3        b4


                              g1               g2                g3



                 smoothing                                        smoothing
                                   s2=(s1+s2+s3)/3



Ljubljana, November 18                  JOTA - Faculty of Arts                16
TextTiling: Boundary Selector

    The boundary selector is the module that looks
   at the depth scores and selects the gaps that are
             the best segmentation points.

     The module estimates the average μ and the
      standard deviation σ of the depth scores and
    selects all gaps as boundaries that have a depth
         score higher that μ-c.σ where c is some
                        constant.


Ljubljana, November 18     JOTA - Faculty of Arts      17
Lexical Cohesion Profile

   The Lexical Cohesion Profile (LCP) has been proposed by
                       (Kozima, 1993).

   The basic idea is that the words in a segment are linked
    together via lexical cohesion relations. So, LCP records
        mutual similarity of words in a sequence of text.

The similarity of words, which represents their cohesiveness,
      is computed using a predefined semantic network
   automatically built from the LDOCE (English Dictionary).

Ljubljana, November 18      JOTA - Faculty of Arts              18
Lexical Cohesion Profile
       The basic idea is that when a block shifting from left to right
        looses in lexical cohesion, then, it should evidence a topic
                                 change.




Ljubljana, November 18            JOTA - Faculty of Arts                 19
Lexical Cohesion Profile
                         Some Results …

         Human Reader                                 LCP




Ljubljana, November 18       JOTA - Faculty of Arts         20
Dotplotting

    The Dotplotting methodology has been proposed for
      finding discourse boundaries by (Reynar 1994)

           The idea is based on lexical item repetition.

If a particular word appears in position x and y in a text,
    then 4 points corresponding to the Cartesian product
    should be plotted on the dotplot: (x,x) (x,y) (y,x) (y,y).


Ljubljana, November 18         JOTA - Faculty of Arts            21
Dotplotting
                                                      Text Segment = Lexical
                                                      Cohesion based on repetition




y



x




                         x   y

Ljubljana, November 18           JOTA - Faculty of Arts                              22
Dotplotting
     In order to find the |P| desired boundaries, we repeat the
               minimization of the following measure:




       where Va,b represents a vector containing the word counts
                         between a through b.

     This technique has the advantage to compare all possible
     segments with all the other ones and not just the surrounding
                                 ones.
Ljubljana, November 18           JOTA - Faculty of Arts              23
Dotplotting




Ljubljana, November 18   JOTA - Faculty of Arts   24
Link Set Median Procedure

 The idea of the Link Set Median (LSM), proposed
   by (Sardinha1999) is to look at the similarity
    between all the sentences with which each
      adjacent sentence shares lexical items.

 The set of sentences with which each sentence
    has links can be seen to form a link set.


Ljubljana, November 18   JOTA - Faculty of Arts     25
Link Set Median Procedure




Ljubljana, November 18   JOTA - Faculty of Arts   26
Link Set Median Procedure
                               ALGORITHM

1. Identify the links for all sentences in the text
2. Create the link sets
3. Compute the median for each link set
4. Calculate the median difference for each link set and its
predecessor
5. Compute the average (mean) median difference for the text
6. Compare each (link set) median difference to the (text) average
median difference
7. If the median difference is higher than the average, insert a
segment boundary
8. Locate the section boundaries in the text and disregard sections
starting with sentence 1
9. Compare segment and section boundaries

Ljubljana, November 18           JOTA - Faculty of Arts               27
Link Set Median Procedure




Ljubljana, November 18   JOTA - Faculty of Arts   28
Problems

    Lexical repetition shows reliability problems.

    Systems based on lexical cohesion use existing
     linguistic resources (dictionary, thesaurus,
     ontology) that are usually available only for
     dominating languages like English, French or
     German, and as a consequence do not apply to
     less favored languages.

Ljubljana, November 18   JOTA - Faculty of Arts       29
      Asas
          The idea of the Informative Similarity-based Topic
           Segmentation System (Asas), proposed by (Dias
                    and Alves, 2005) is to look at:


(1)     The importance of words in global context
(2)     The importance of words in local context
(3)     The global importance of words
(4)     The Informative Similarity between Sentences and
        Blocks of sentences
(5)     The Selection of Boundaries
      Ljubljana, November 18     JOTA - Faculty of Arts        30
Asas
              Text to Segment



                                                         Segmented Text


                                        Asas




           Texts of Context
Ljubljana, November 18          JOTA - Faculty of Arts                    31
Asas: Global Importance

First, it is necessary to find what are the important
  words in the text to segment. Only these should
 be taken into account. For that reason, we apply
    the well-known tf.idf measure introduced by
                    (Salton, 1975).

                                       tf w; d 
                     tf .idf w, d  
                                                            N
                                                   log 2
                                          |d |            df ( w)


Ljubljana, November 18                 JOTA - Faculty of Arts       32
Asas: Local importance
      Then, it is necessary to find what are the
    important words for the sake of segmentation.
    Indeed, if a word occurs in all the sentences of
          the text, it is of no use for the task of
     segmentation. For that reason, we apply the
    well-known tf.idf measure to sentences that we
                           call tf.isf.

                                           stf w; s 
                         tf .isf w, s  
                                                                 Ns
                                                        log 2
                                              |s|              sf ( w)

Ljubljana, November 18                      JOTA - Faculty of Arts       33
Asas: Local importance




                                              Useless for Topic Segmentation

Ljubljana, November 18   JOTA - Faculty of Arts                                34
Asas: Local Importance
Finally, the more dense a word is, the more important it
     is for the sake of segmentation. Indeed, if a word
   occurs many times in a small portion of the text, it is
      of great use for segmentation. For that purpose,
   (Dias and Alves, 2005) proposed a density measure
        based on the distance (in terms of words) of
             consecutive occurrences of a word.

                                 |w|1
                dens w, d     
                                                            1
                                 k 1
                                         lndist occur k , occur k  1  e 

Ljubljana, November 18                        JOTA - Faculty of Arts                 35
Asas: Local Importance


                                                              not so strong




                                                  From moon to star




Ljubljana, November 18   JOTA - Faculty of Arts                        36
Asas: Weight
So, in order to give a weight to each word for each
 sentence in the text to segment, we propose this
                   simple measure:


      weight w, d   tf .idf w, d   tf .isf w, s   dens w, d 


       where all measure have been normalized so that
           they can be joined into a single measure.


Ljubljana, November 18            JOTA - Faculty of Arts                  37
 Asas: Similarity
Unlike in TextTiling and LCP, we use the sentence as the basic unit of
      segmentation. The basic idea is to observe whether a given
    sentence is more similar than the previous group of sentences or
               more similar to the next group of sentences.

  Si-3                                                             Similarity
  Si-2                                                              Block 1
  Si-1
  Si
  Si+1                                                              Similarity
  Si+2                                                               Block 2
  Si+3


 Ljubljana, November 18         JOTA - Faculty of Arts                   38
Asas: Similarity
 For that purpose, we introduce an informative similarity measure
      (Dias and Alves, 2005) in order to avoid the need of lexical
                 databases like thesauri or dictionaries.
    The informative similarity measure is based on the cosine
      similarity measure but integrating any word co-occurrence
                    association measure in its body.

The cosine measure

                                 p

                                X     ik    Xjk                   Where:
Sij  cos Xi, Xj             k 1
                                                                    Xi=the vector representing a webpage
                          p                    p

                         
                         k 1
                                Xik 2       k 1
                                                     Xjk 2
                                                                    Xj=the vector representing the website
                                                                    Xik=the weight of word in index k in the vector Xi

Ljubljana, November 18                                 JOTA - Faculty of Arts                                       39
Asas: Similarity



                         (1) Ronaldo defeated the goalkeeper once more.
                              (2) Real Madrid striker scored again.




Ljubljana, November 18                    JOTA - Faculty of Arts          40
 Asas: Similarity
Each block and the sentence in focus are represented as vectors
        of weights and the association measure used is the
              Equivalence Index (Muller et al., 1997).

The informative similarity measure
                                                                 p       p

                                                                X           ik    Xjl  EI Wik ,Wjl 
 Sij  infosimba Xi, Xj                                      k 1 l 1
                                          p      p                                           p      p

                                          Xik  EI Wik,Wil 
                                         k 1 l  k 1
                                                                                    2
                                                                                             Xjk  EI Wjk,Wjl 
                                                                                            k 1 l  k 1
                                                                                                                    2




                                                       f w, q 
                                                                     2
 where,          EI w, q   pw | q   pq | w 
                                                     f w  f q 

 Ljubljana, November 18                                  JOTA - Faculty of Arts                                         41
Asas: Selection of Boundary
     In order to compare the similarity between blocks and the
         sentence in focus, we propose the following solution.


                                           infosimba  Xi, Xi  1
                         ps  Xi   log
                                           infosimba  Xi, Xi  1


          So, if Xi is more similar to Xi-1, ps(Xi) will give a positive
          number. On the contrary, if Xi is more similar to Xi-2, ps(Xi)
          will give a negative number. In the case, where it is similar
                   to both blocks, the value of ps(Xi) will be 0.


Ljubljana, November 18                      JOTA - Faculty of Arts         42
Asas: Selection of Boundary
                         Results with Equivalence Index




Ljubljana, November 18              JOTA - Faculty of Arts   43
Asas: Selection of Boundary
   Each time the link value goes from positive to
    negative between two consecutive sentences,
       there exits a topic shift. We will call this
              phenomenon a downhill.




                                                      downhills



Ljubljana, November 18   JOTA - Faculty of Arts           44
Asas: Selection of Boundary

       A downhill is simply defined as follows
     whenever the value of the ps score goes from
     positive to negative between two consecutive
                  sentences Xi and Xi+1.


             downhillXi, Xi  1  psXi   psXi  1



Ljubljana, November 18          JOTA - Faculty of Arts     45
Asas: Selection of Boundary

      Once all downhills in the text have been
     calculated, their mean and standard deviation
     are evaluated. The topic boundaries are then
     elected if they satisfy the following constraint


                         downhillXi, Xi  1  x  2



Ljubljana, November 18                JOTA - Faculty of Arts   46
Asas: Flexibility
         This architecture has the advantage to accept different
        association measures so that better tuning can be obtained.

And by choosing different context length (one word, k words, k
     sentences, k paragraphs, k texts) for the calculation of
      association measures between two words, different
       applications can be obtained for our architecture:

    -    Segmenting different news articles from a list of articles
                            (larger context).
   -     Segmenting a technical text about one topic where each
             segment is about a subtopic (small context).

Ljubljana, November 18             JOTA - Faculty of Arts             47
Conclusion

    Language-independent unsupervised Topic
     Segmentation system based on word-co-occurrence.

    Avoids the accessibility to existing linguistic resources
     such as electronic dictionaries or lexico-semantic
     databases such as thesauri or ontology.

    Solves the problems evidenced systems based uniquely
     on lexical repetition that show reliability problems.


Ljubljana, November 18      JOTA - Faculty of Arts               48
Problems and Future Work

    Existence of three main parameters: the block size, the window size to
     calculate the association measure and the topic discovery threshold.

    Exhaustive Evaluation: different association measures, different similarity
     measures, different languages etc …

    Comparison with systems that use linguistic resources.

    More work must be done on the automatic boundary detection algorithm.
     We are convinced that better algorithms may be proposed based on the
     transformation of the representation of ps function into a graph or network.

    The system will be downloadable at http://asas.di.ubi.pt/ under GPL License
     when completely tested.


Ljubljana, November 18             JOTA - Faculty of Arts                           49
The End


             Thanks for your attention and patience 




Ljubljana, November 18       JOTA - Faculty of Arts     50