Extracting Meaningful Insights From Blogs by shoppingdiscussions


More Info
									                                                        1. INTRODUCTION

One of the main objectives is to automate the summarisation of large texts, in this case
blogs, and make them as structured and meaningful as would be done manually. It will
allow us to gain useful insights from blogs without having to go through all the text
ourselves. Also the comments generated in response to a blog can be analysed to determine
the mindset and opinions of the people. Another issue is the classification of blogs
according to their content. Blogs have to be tagged as per genre.
Typical text mining tasks include
                     text clustering,
                     text categorization,
                     concept/entity extraction,
                     sentiment analysis,
                     and document summarization.

    Develop an automated multi-document summarization system that extracts
       information from a collection of documents related to a topic.

       Input: Collection of blogs
       Output: Summary of the topic

Extracting Meaningful Insights From Blogs
1.2 Background
Much research has been conducted in the area of automatic text summarization.
Specifically, research using lexical chains and related techniques has received much
Early methods using word frequency counts did not consider the relations between
similar words. Finding the aboutness of a document requires finding these relations. How
these relations occur within a document is referred to as cohesion (Halliday and Hasan,
1976). First introduced by Morris and Hirst (1991), lexical chains represent lexical
cohesion among related terms within a corpus. These relations can be recognized by
identifying arbitrary size sets of words which are semantically related (i.e., have a sense
flow). These lexical chains provide an interesting method for summarization because
their recognition is easy within the source text and vast knowledge sources are not
required in order to compute them.

1.3 Objectives:

    1. To develop an automated multi-document summarization system, that extracts
        information from a collection of documents related to a topic.
    2. To extract meaningful insights from the summarization of different kinds of blogs
        available on the WWW.
    3. To analyse the results obtained from the given algorithm and determine its
    4. To design a reasonably domain independent system which outputs a short
        summary based on the most salient concepts from the original document.
    5. To design a system in which the length of the extracted summary can be either
        controlled automatically or manually based on length or percentage of

Extracting Meaningful Insights From Blogs
1.4 Motivation:

Researchers and students constantly face this scenario: It is almost impossible to read
most if not all of the newly published papers to be informed of the latest progress and
when they work on a research project, the time spent on reading literature review seems
endless. The goal of this
project is to design a domain independent, automatic text extraction system to alleviate, if
not totally solve, this problem.

We have scored sentences in the given text linguistically to generate a summary
comprising of the most important ones obtained so. The program takes input from a text
file, and outputs the summary into a similar text file.
The most daunting task at hand was to generate an efficient scoring algorithm that would
produce the best results for a wide range of text types. The only means to arrive at it was
to manually summarize and then evaluate sentences for common traits, which would then
be converted into the machine language.

1.5 Structure of the report:
Chapter 2 talks about the literature survey, Chapter 3 deals with system design, chapter 4
is about detailed system design, chapter 5 discusses about software implementation,
chapter 6 is testing, chapter 7 is deployment and maintenance, chapter 8 is conclusion and
scope followed by chapter 9 which gives the references.

Extracting Meaningful Insights From Blogs
                                               2. LITERATURE SURVEY

Early ideas of systems for automatically condensing and/or summarizing documents date
back into the fifties (Luhn, 1958). Recently, there has been a shift in the Information
Extraction paradigm (Kuhn 1970). Experts are slowly moving more away from the
research that is motivated by linguistic theory and moving towards applied technology.
       Research on the IBMs statistical machine translation (MT) (Brown et al., 1990)
proved that, contradictory to predictions, their system which employed essentially no
linguistic knowledge not only worked, but also did surprisingly well. At the same time, a
lot of corpus based methods and approaches were developed and implemented, such as
automatic part-of-speech (POS) tagging.
       Information Extraction is a relatively new field in NLP. Its prime objectives are
       i)      to divide text into “relevant” and irrelevant sections(filtering)

       ii)     to fill pre-specified templates with information extracted from the text

       These filled templates can be entered in standard relational databases, to serve as
automatic triggers for informing human data analysts of events of interest or to provide a
basis for generating text abstracts or condensed versions of source texts.

2.1 Lexical Chaining
Lexical chain is the way to represent the lexical cohesion present in text. Lexical chain is
the list (group) of semantically related words within the text. As lexical chains are group
of related words within the text, lexical chains are the indicators of the different concepts
discussed in the text. In addition, beginning of new chain indicates beginning of the new
topic and end of the lexical chain indicate end of the particular topic.
       With the availability of machine readable thesauruses like Word Net and Roget’s,
computation of lexical chains and its applications to different text processing problems
have gained significant interest. Cohesive units found by lexical chains are useful in
Extracting Meaningful Insights From Blogs
different natural language processing applications such as text segmentation, text
summarization, multi-document summarization, information retrieval and so on.
       Subsequent to the introduction of lexical chaining for discourse analysis by
Morris and Hirst (1991), lexical chaining has become one of the basic techniques to
address various problems in text analysis. Beginning from early research work in some
of the problems in natural language processing the technique was explored later as
knowledge based approach to traditional text summarization task and it was used to solve
few text analysis problems effectively which are solved by traditional text summarization
       One of the interesting applications of lexical chaining is that lexical chains can be
used for alternative scoring mechanism based on the semantics of the text rather than
traditional word frequency based vector space model of text summarization. The lexical
chaining technique has been put to test in different applications such as word sense
disambiguation, text segmentation, text summarization etc.
       Though computationally (CPU time) intensive, the technique is robust enough to
apply it to variety of problems as mentioned above and to many other text analysis
problems successfully.
Some of the important applications of the lexical chaining are:
      Lexical chains for text segmentation/topic boundary detection

      Lexical cohesion for detection of malapropism

      Generation of intra-document and inter-document links

      Text summarization

There are three approaches for text summarization using lexical chaining.

2.1.1 First approach
       Assuming the most important chain or strong chain among all chains represents
the summary. That is of all the concepts the most prevailing concept is treated as a
summary and appropriate text repressing the main concept is chosen as a summary
How do decide the chain strength:
Extracting Meaningful Insights From Blogs
Length based
Cohesion based (stokes***, barziley*** also uses same notion but different formula)

This approach is suitable for generating headlines. It can be tuned to generate a long
summary. Some concepts are likely to be missed out – they might not be important for
the text but can be valid part of the summary

2.1.2 Second approach
       To summarize the text consider all the chains. That is to summarize the text all the
concepts discussed in text are considered (irrespective of there importance) and
appropriate representative text of those concept is chosen as summary. It can be tuned to
produce short summary. All the concepts are represented hence a true summary. Chain
pruning might be must otherwise single word chains other spurious chains will play part
in summarization

2.1.3 Third approach
       Rank the chains based on the importance and chose number of chains, based on
some threshold, for summary. That is out of many concepts appearing in the text
prioritize the concepts to be considered for summary.

2.1.4 A Comparison of the approaches
Implications of these three different strategies for chaining are important with respect to
chaining dynamics. They are discussed below:
       The last strategy will create less (minimum) number of chains than any other
strategy. As we are passing queue of all the word in the text and then passing that queue
through XSR, SR, MSR search before considering new chain generation, there is only
one chain in the stack to start with. Even if later the chains are added in the stack search
through chain stack is redundant as we have checked all the possible related words for
earlier chain (chains) through the text. The net effect is at given point in time only one
chain would be active chain for seeking the words. One by one chain will be created

Extracting Meaningful Insights From Blogs
(One loop of XSR, SR. and MSR relations will complete the one chain and so on) and
they will not change their order ever.

2.2 Cohesion, Coherence and Discourse Analysis
When reading any text it is obvious that it is not merely made up of a set of unrelated
sentences, but that these sentences are in fact connected to each other through the use of
two linguistic phenomenon, namely cohesion and coherence.

        As Morris and Hirst (1991) point out, cohesion relates to the fact that the elements
of a text (e.g. clauses) ‘tend to hang together’; while coherence refers to the fact that
‘there is sense (or intelligibility) in a text’.

        Observing the interaction between textual units in terms of these properties is one
way of analysing the discourse structure of a text. Most theories of discourse result in a
hierarchical tree-like structure that reflects the relationships between sentences or clauses
in a text. These relationships may, for example, highlight sentences in a text that
elaborate, reiterate or contradict a certain theme.   Meaningful discourse analysis like
this requires a true understanding of textual coherence which in turn often involves
looking beyond the context of the text, and drawing from real-world knowledge of events
and the relationships between them.

        Hasan, in her paper on ‘Coherence and Cohesive Harmony’ (1984), hypothesises
that the coherence of a text can be indirectly measured by analyzing the degree of
interaction between cohesive chains in a text. Analysing cohesive relationships in this
manner is a more manageable and less computationally expensive solution to discourse
analysis than coherence analysis.

        For example, Morris and Hirst (1991) note that, unlike research into cohesion,
there has been no widespread agreement on the classification of different types of
coherence relationships. Furthermore, they note that even humans find it more difficult to
identify and agree on textual coherence because, although identifying cohesion and
Extracting Meaningful Insights From Blogs
coherence are subjective tasks, coherence requires a definite ‘interpretation of meaning’,
while cohesion requires only an understanding that terms are about ‘the same thing’.

2.3 Semantic Networks, Thesauri and Lexical Cohesion
2.3.1 Longmans Dictionary of Contemporary English
The Longmans Dictionary of Contemporary English (LDOCE) was the first available
machine-readable dictionary. Its popularity as a lexical resource can also be attributed to
the simplicity of its design, since it was created with non-native English speakers in
mind. More specifically, the dictionary was written so that all gloss definitions were
described with respect to a controlled vocabulary of 2,851 words referred to as the
Longmans Defining Vocabulary (LDV).

       In their paper, which looks at calculating the similarity between words based on
spreading activation in a dictionary, Kozima and Furugori (1993a) took advantage of this
design feature, and generated a semantic network from it using the gloss entries of a
subset of the words in the dictionary. This subset of the dictionary is called Glossème and
consists of all words included in the LDV.

2.3.2 Roget’s Thesaurus
Roget’s Thesaurus is one of a category of thesauri (like the Macquarie Thesaurus) that
were custom built as aids to writers who wish to replace a particular word or phrase with
a synonymous or near-synonymous alternative. Unlike a dictionary, they contain no
gloss definitions, instead they provide the user with a list of possible replacements for a
word and leave it up to the user to decide which sense is appropriate. The structure of the
thesaurus provides two ways of accessing a word:
1. By searching for it in a list of over 1,042 pre-defined categories, e.g. Killing,
Organisation, Amusement, Physical Pain, Zoology, Mankind, etc.
2. By searching for the word in an alphabetical index that lists all the different categories
in which the word occurs, i.e. analogous to defining sense distinctions.

Extracting Meaningful Insights From Blogs
         Lexical cohesive relationships between words can also be determined using a
resource of this nature, since words that co-exist in the same category are semantically
related. There is a hierarchical structure above (classes, sub-classes and sub-sub-classes)
and below (sub-categories) these categories in the thesaurus, and it is this structure that
facilitates the inferring of a small range of semantic strengths between words; however,
unlike LDOCE, ‘no numerical value for semantic distance can be obtained’ (Budanitsky,

2.3.3 WordNet
WordNet (Miller et al., 1990; Fellbaum, 1998a) is an online lexical database whose
design is inspired by current psycholinguistic theories of human lexical memory.
         WordNet is divided into 4 distinct word categories: nouns, verbs, adverbs and
adjectives. The most important relationship between words in WordNet is synonymy.
The WordNet definition of synonymy also includes near-synonymy. Hence, WordNet
synonyms are only interchangeable in certain contexts (Miller, 1998). A unique label
called a synset number identifies each synonymous set of words (a synset) in WordNet.
Each node or synset in the hierarchy represents a single lexical concept and linked to
other nodes in the semantic network by a number of relationships. Different relationships
are defined between synsets depending on which semantic hierarchy they belong to. For
example, most verbs are organised around entailment (synonymy and a type of verb
hyponymy called troponymy (Fellbaum, 1998b)), adjectives and adverbs around
antonymy (opposites such as big-small and beautifully-horribly) and synonymy.
         Nouns, on the other hand, are predominantly related though synonymy and
hyponymy/hypernymy. In addition, 9 other lexicographical relationships are also defined
between nodes in the noun hierarchy.
WordNet Noun Relationship Example
Hyponymy (KIND_OF)
Specialisation: apple is a hyponym of fruit since apple is a kind of fruit.
Hypernymy (IS_A)
Generalisation: celebration is a hypernym of birthday since birthday is a type of
Extracting Meaningful Insights From Blogs
HAS_PART_COMPONENT: tree is a holonym of branch.
HAS_PART_MEMBER: church is a holonym of parishioners .
Holonymy (HAS_PART)
IS_MADE_FROM_OBJECT: tyre is a holonym of rubber.
OBJECT_IS_PART_OF: leg is a meronym of table.
OBJECT_IS_A_MEMBER_OF: sheep is a meronym of flock.
Meronymy (PART_OF)
OBJECT_MAKES_UP: air is a meronym of atmosphere.
Antonymy (OPPOSITE_OF) Girl is an antonym of boy.
        Unfortunately, there is little interconnectivity between the noun, verb, adverb and
adjective files in the WordNet taxonomy: the verb file has no relations with any of the
other files, the adverb file has only unidirectional relations with the adjective file, and
there are only a limited number of ‘pertains to’ relationships linking adjectives to nouns.

2.4 Morris and Hirst: The Origins of Lexical Chain Creation
Morris and Hirst (1991) used lexical chains to determine the intentional structure of a
discourse using Grosz and Sidner discourse theory (1986). In this theory, rosz and Sidner
propose a model of discourse based on three interacting textual elements: linguistic
structure (segments indicate changes in topic), intentional structure (how segments are
related to each other), and attentional structure (shifts of attention in text that are based on
linguistic and intentional structure). Obviously, any attempt to automate this process will
require a method of identifying linguistic segments in a text. Morris and Hirst believed
these discourse segments could be captured using lexical chaining, where each segment is
represented by the span of a lexical chain in the text.

        Morris and Hirst manually generated lexical chains using Roget’s Thesaurus
which consists of an index entry for each word that lists synonyms and near- synonyms
for each of its coarse-grained senses followed by a list of category numbers that are
related to these senses. A category in this context consists of a list of related words and
pointers to related categories. They used the following rules to glean semantic

Extracting Meaningful Insights From Blogs
associations from the thesaurus during the chain generation process, where two words are
related if any of the following relationship rules apply:
1. They have a common category in their index entries.
2. One word has a category in its index entry that contains a pointer to a category of the
other word.
3. A word is either a label in the other word’s index entry or it is listed in a category of
the other word.
4. Both words have categories in their index entries that are members of the same
5. Both words have categories in their index entries that point to a common category.
Morris and Hirst also introduced a general algorithm for generating chains, on which
most other chaining implementations are based

2.5 Greedy WordNet-based Chaining Algorithms
The generic algorithm is an example of a greedy chaining algorithm where each addition
of a term ti to a chain cm is based only on those words that occur before it in the text. A
non-greedy approach, on the other hand, postpones assigning a term to a chain until it has
seen all possible combinations of chains that could be generated from the text. One might
expect that there is only one possible set of lexical chains that could be generated for a
specific text, i.e. the correct set. However, in reality, terms have multiple senses and
could be added into a variety of different chains.

2.6 Non-Greedy WordNet-based Chaining Algorithms
As stated previously, a non-greedy approach to lexical chaining postpones resolving
ambiguous words in a text until it has analysed the entire context of the document.
       Barzilay and Elhadad (1997) were the first to discuss the advantages of a non-
greedy chaining approach. They argued that disambiguating a term after all possible links
between it and the other candidate terms in the text have been considered was the only
way to ensure that the optimal set of lexical chains for that text would be generated. In
other words, the relationships between the terms in each chain will only be valid if they
conform to the intended interpretation of the terms when they are used in the text. For
Extracting Meaningful Insights From Blogs
example, chaining ‘jaguar’ with ‘animal’ is only valid if ‘jaguar’ is being referred to in
the text as a type of cat and not a type of car.

Barzilay and Elhadad

Barzilay and Elhadad (1997) were the first to coin the phrase ‘a non-greedy or dynamic
solution to lexical chain generation’. They proposed that the most appropriate sense of a
word could only be chosen after examining all possible lexical chain combinations that
could be generated from a text. Their dynamic algorithm begins by extracting nouns and
noun compounds.

        Barzilay reduces both non-WordNet and WordNet noun compounds to their head
noun e.g. ‘elementary_school’ becomes ‘school’. As each target word arrives, a record
of all possible chain interpretations is kept and the correct sense of the word is decided
only after all chain combinations have been completed.

Extracting Meaningful Insights From Blogs

3.1     Introduction

3.1.1 Project Scope

The project aims to implement an effective algorithm that will enable summarization of
blogs and comments. It basically filters the text in order to extract relevant information
that will be useful in analysis.

        The software will take text from blogs as input and will extract the different
concepts present in the text. It will then grade each concept and determine the central
concept and other important concepts depicted by the blog. The summary will cover all
the important concepts with due weightage to the central concept.

The main tasks of an IE system are:
                1. Indexing the information items, i.e. assigning a number of
                    discriminative keywords (index words) to all items and possibly also
                    weighting them.
                2. Mapping the user’s request onto an internal representation language.

3.1.2 User Classes and Characteristics

Since it is a web based application there are no restrictions in terms of location and
hardware or software in use.

The application will be used by:
       People who want short and concise information on a particular topic
       Companies and businesses that want to analyse their product sales
Extracting Meaningful Insights From Blogs
      Testers who will conduct timely checks for performance parameters
      System designers who will keep upgrading the algorithm

3.1.3 Operating Environment

The application can run on any system with Web access for obtaining input in the form of
blogs, a database in order to store blogs and data from social networking sites.

The software shall be designed to run on at least one of the following platforms:

              Microsoft Windows: windows implementations shall be portable to all
               versions of Windows up to and including Windows XP.
              UNIX: UNIX implementation shall be portable to any version of UNIX
               that supports the user interface libraries.
              Linux : Linux implementations shall run on at least one version of Linux

3.1.4 Design and Implementation Constraints

           1. Web browser support
               The system shall support users using Microsoft IE version 4 or higher,
               Netscape Navigator version 4 or higher, Mozilla Firefox version 1.4 or
               higher or Google Chrome.

           2. Internationalization/ Localization
               The system uses internationalization techniques that allow the system to
               configure for natural languages, shall use the Unicode-32 or multi width
               character encoding

           3. User Interface

Extracting Meaningful Insights From Blogs
               The product must have a user friendly interface that is simple enough for
               all types of users to understand.

           4. Response Time
               Due to large computation involved in text analysis processing ability,
               there is a response time constraint on the system. An unreliable internet
               connection may also limit the interaction between the two.

           5. Backend Management
           The external database servers should be updated regularly. This updating and
           replication of data from central database server can introduce additional
           latency in the working of the system.

           6. Database Management
               The replication from central server to the backup server has to be
               asynchronous as these solutions also provide a greater amount of
               protection by extending the distance between the primary and secondary
               locations of data.

3.1.5 Assumptions and Dependencies

1. The system designers can assume that users of the system are using Microsoft Internet
Explorer version 4 or above or Netscape Navigator version 4 or above or Mozilla Firefox
or Google Chrome.

2. The system designers can assume that the web server, application server and the
database server software and hardware are available to support the requirements cited in
the supplementary specifications.

3. The system designers can assume that necessary system analysis, design,
implementation and test tools will be available.
Extracting Meaningful Insights From Blogs
4. The system designers can assume that the input will be readily available in the form of
blogs and comments or any other text.

3.2       System Features

This system has a direct application in text summarization

3.2.1 Text Summarization      Summarization considering chain strength
The gist is very short summary of the text ranging from phrases to a sentence. The short
summary so generated generally indicates essence of the given text and can be considered
as headline for the text. Such a short summary is very useful while broadcasting news.
Stokes (2004) uses lexical cohesion to generate very short summaries of news articles.
The methodology is as follows;
      1. Generate the lexical chains for the given news article.
          2. Identify the highest scoring noun and proper noun chains using following

                                                     reps(i)  rel (i, j )
                                                    i 1
                          formula: score(chain) =

                 i = current word of the chain
                 n= number of words in the chain or length of the chain in terms of words
                 reps(i) = number of repetition of word i
                 rel(j) = strength or weight of the relation between i and j
                 weights for the relations:
                            o Extra strong = 1.0

Extracting Meaningful Insights From Blogs
                                o Strong relation (synonym) = 0.9
                                o Strong        relation     (hypernym,    hyponym,   meronym,
                                       medium strength relation = 0.4
                                o statistical relation = 0.4

   3. Select the highest scoring proper noun chain and noun chain. Select all the chains
          sharing same highest score in case there is more than one chain with highest
   4. The so selected chain is the most important chain for the given text and the words
          in the chain represent most important nouns and proper nouns (key words) for the
          given text.
   5. The next step is to score the sentences in the text according to the key words using
          the                                    following                             formula:

                                 score(chain)
                                i 1
          Score(sentence) =
                   N        =          number       of       words        in   the     sentence
                   score(chain)i = 0 if the word ith word does not belong to the key chain
                   if the word belongs to the key chain then it equals the score of the chain
                   calculated                                                            earlier
                   Intuition over here is more the number of words in the sentence belonging
                   to the key chain more important the sentence is.
   6. Score all the sentences accordingly and choose highest scoring sentence as a
          gist/headline         Summarization using all the chains
The alternative strategy for summarization given by Barzilay is (1997):
Heuristic 1:
The intuition is; as the beginning sentence (of a particular chain) starts the concept which
is identified by the chain the sentence will have necessary information to summarize the
Extracting Meaningful Insights From Blogs
concept. Additionally, other sentences can be seen as the text repeating the concept in
the sentence where the particular chain began. Hence the starting sentence is a candidate
for representing summary of the concept represented by a particular chain.               To
summarize the given text:
      1. Generate the lexical chains
      2. For each chain the sentence at which the chain begins is chosen as a part of

Heuristic 2:
Second heuristic is based on the assumption that the chain words represent the concept to
different extent. Therefore, it is necessary to find most prevailing concept sense - that is
the word that can be appropriately represent the chain. The most prevailing concept
sense from a chain is identified by frequency of word in the chain. To summarize the
      1. Generate the lexical chains for the given text
      2. Choose the most appropriate word (most prevailing sense of the concept
         represented by a particular chain) from the chain. To do that, calculate the
         frequency of the chain words. Choose the word (or words) having more than
         average frequency.
      3. Choose the sentence as the part of summery where the chain representing word
         appears first. Choose the sentences in such manner for all the chains.

3.3      External Interface Requirements

3.3.1 User Interfaces

            The user will be provided with an input text area in which he can enter or pass
             the text to be processed.

Extracting Meaningful Insights From Blogs
          The user may also be allowed to pass information in the form of a file or a
           link to a website.

          The user can choose the strength of the relations with which to form the
           lexical chains.

          The testers will be allowed to choose different parameters with which to
           summarize the text so that they can analyse the response time.

3.3.2 Hardware Interfaces

              The only hardware required is a functional terminal (with console and
               input devices such as keyboard and mouse) along with an internet

              The product requires very limited graphics usage with just a simple
               keypad for taking the user input.

              The product does not require usage of sound or animation. The hardware
               and operating system requires a minimum screen resolution of 400 * 300
               pixels (owing to small form factor).

              Also a high capacity storage device would be needed in order to maintain
               the backend databases.

3.3.3 Software Interfaces

          The system shall make use of the operating system calls to the file
           management system to store and retrieve files from the database.

          The application will also interact with SQL server in order to perform all
           backend operations on the data related to the blogs stored in the database.

Extracting Meaningful Insights From Blogs
          The application will make use of some of the components defined in the
           Microsoft .NET Framework 2.0 or higher.

3.3.4 Communications Interfaces

          The system shall make use of communication standards such as FTP and
           HTTP in order to exchange data over the internet.

3.4    Other Nonfunctional Requirements

3.4.1 Performance Requirements

          The software will support simultaneous user access only if there are multiple

          Only textual information will be handled by the software. Amount of
           information handled can vary from user to user.

          For normal conditions, 95% of the transactions should be processed in less
           than 5 seconds.

          The software can run from a standalone desktop PC with access to the

          Should run on at least a 1 GHz, 128 MB machine.

3.4.2 Safety Requirements

No safety requirements have been identified.

Extracting Meaningful Insights From Blogs
3.4.3 Security Requirements

There are no specific security requirements, other then those generally governing use of
login into accounts. The files in which information regarding the source code and
libraries are present should be secured against malicious deformations, this can be done
         Utilizing certain cryptographic techniques
         Keeping certain logs and history data sets
         Restricting communications between some areas of the program.
         Check data integrity for certain variables.

3.4.4 Software Quality Attributes      Web Access
All user functionality shall be available in a web browser with no additional software
installed on the client machines.      Usability Requirements

User Guidance
The system shall guide the user on the meaning of the information that they enter and that
the system displays the allowed values and format of the information they enter and
suggestions on how to correct information entries that are not corrected.

The system shall meet federal requirements for accessibility to handicapped users.

Extracting Meaningful Insights From Blogs
Ease of Use
The system shall be designed for “ease of use” by users with knowledge of browsing the
web and completing online portfolios.    Reliability requirements

The system shall be available 24 hours a day, 365 days a year. There shall be no more
than 5 % down time.

Mean time to failure and repair
Value to be determined.    Supportability Requirements

The system shall support localization of all user presented information (the system shall
be configurable to provide information to users in their native language).

              Usability : The software shall be readily accessible and easy to use with a
               simple GUI that is self-explanatory

              Reliability : The software shall maintain a backup of the input data until
               the automated summary is not complete

              Portability : The software should be easily portable from one environment
               to another without loss of data

              Maintainability : The software shall be easy to update and edit by the
               developers in order to increase its efficiency

Extracting Meaningful Insights From Blogs
3.5    Other Requirements

3.5.1 Database Requirements

The system will require a large database that can store a corpus of blogs that can be
queried and from which information can be extracted.

3.5.2 Legal Requirements

The software’s source code cannot be freely distributed without the express permission of
the owner. Any imitation, copying of the code and graphics without authorization is
strictly prohibited. The software shall be the sole product of the Tata Consultancy Group.

Extracting Meaningful Insights From Blogs
                                                       4. SYSTEM DESIGN

4.1 The Algorithm

Lexical Chains represent the sequence of related words. They indicate important
concepts/topics in the document. The beginning of chain indicates beginning of a new
topic and the end of chain indicates end of a topic. The chain return indicates intentional

Candidate words:
       Nouns or pronouns within the text
Stop words:
       High frequency uninformative words and words responsible to form spurious
       Word remaining after removing stop words from the candidate words, TOKENs
       take part in chaining process
       Member of the chain
Sentence queue:
       Queue of tokens from the sentences
Chain stack:
       Stack holding the chain formed during the chaining process. The most recently
       updated chain is given preference for seeking the relations.          The stack is
       implemented to access the last updated chain.
       Extra strong relation, repetition of the word

Extracting Meaningful Insights From Blogs
       Strong relation- synonym, hpernym, hyponym, meronym, holonym
       Medium strength relation- it is transitivity relation between two words
Word distance:
       Distance between Node and Token in terms of how many sentences they are away
       from each other
Path length:
       Distance between Node and Token in terms of number of edges or number of
       nodes in MS relation
Wight of the relation:
       Relative strength of the link between two words
       Weights for different relations are:
       XSR = 1.0, SR = 0.9, MSR = 0.8 / 0.7

Algorithm to determine chain strength
Input = text
Output = chains
Create the stack to hold the chains
Choose the candidate words
Form a queue of candidate words per sentence
Create the chain with the first candidate word
for every sentence queue do
       //loop for XSR relations
       for every TOKEN in the sentence queue do
               if word distance < threshold for XSR then
                         //try to find if TOKEN is related to any of the NODEs starting
                         with last updated chain in the chain stack
                         if XSR relation (TOKEN) then
                                  attach the token to the chain
                                  move that chain on top of the stack
                         end if
Extracting Meaningful Insights From Blogs
              end if
       end for
       //loop for SR relations
       for every TOKEN in the sentence queue do
              if word distance < threshold for SR
                       //try to find if TOKEN is related to any of the NODEs starting
                       with last updated chain in the chain stack
                       if XSR relation (TOKEN) or SR relation in (TOKEN) then
                                attach the token to the chain
                                move that chain on top of the stack
                       end if
              end if
       end for
       //loop for MS relations
       for every TOKEN in the sentence queue do
              if word distance < threshold for MS
                       //try to find if TOKEN is related to any of the NODEs starting
                       with last updated chain in the chain stack
                       if XSR relation (TOKEN) or SR relation (TOKEN) or MS relation
                       (TOKEN) then
                                attach the token to the chain
                                move that chain on top of the stack
                       end if
              end if
       end for
end for //loop for sentence queues

Lexical chaining algorithm
1.Choose      a    set     of       candidate      terms        for   chaining,   t1……tn.
(closed-class nouns, i.e. highly informative words as opposed to stopwords).

Extracting Meaningful Insights From Blogs
2. Initialize: The first candidate term in the text, t1, becomes the head of the first chain,
        3. for each remaining term ti do
               4. for each chain cm do
                          5. Find the chain that is most strongly related to ti
                          with respect to the following chaining constraints:
                          a. Chain Salience (recently updated chain given higher
                          b. Thesaural Relationships (X strong, strong, medium strength)
                          c. Transitivity (path length for MS relation)
                          d. Allowable Word Distance (In terms of sentences)
                          6. If the relationship between a chain and ti adheres to these
                          constraints then ti becomes a member of cm, otherwise ti becomes
                          the head of a new chain.
               7. end for
        8. end for

        The constraints listed in statement 5 of the algorithm are critical in controlling the
scope, size and in many cases the validity of the relationships within a chain. If           these
constraints are not adhered to or if suitable parameters are not chosen for each of them,
then the occurrence of spurious chains (chains that contain weakly related or incorrect
terms) will be greatly increased. We now look at each of these constraints in turn:
° Chain Salience: This constraint refers to the notion that words should be added to the
most recently updated chain. This intuitive rule appeals to the notion that terms are best
disambiguated with respect to active chains, i.e. active themes or speaker intentions in the
° Thesaural Relationships: Regardless of the knowledge source used to deduce semantic
similarity between terms, a set of appropriate knowledge source relationships must be
decided on. Morris and Hirst state that their relationship rules 1 and 2, defined above,
based on Roget’s thesaural structure, account for nearly 90% of relationships between
chain    words.      On      the    other    hand,     in   WordNet-based         chaining     the
Extracting Meaningful Insights From Blogs
specialisation/generalisation hierarchy of the noun taxonomy is responsible for the
majority of associations found between nouns.
° Transitivity: Another factor to consider when searching for relationships between words
is transitivity. In particular, although weaker transitive relationships (such as a is related
to c because a is related to b and b is related to c) increase the coverage of possible word
relationships in the taxonomy, they also increase the likelihood of spurious chains, as
they tend to be even more context specific than strong relationships such as synonymy.
For example, consider the following tentative relationship found in WordNet: ‘foundation
stone’ is indirectly related to ‘porch’ since ‘house’ is directly related to both ‘foundation
stone’ and ‘porch’. Deciding whether these transitive relationships are useful is a difficult
decision as one must also consider the loss of possible valuable relationships if they are
ignored, e.g. ‘cheese’ is indirectly related to ‘perishable’ since ‘dairy product’ is directly
related to both words according to WordNet.
° Allowable Word Distance: This constraint works on a similar assumption as Chain
Salience, where the relationships between words are best disambiguated with respect to
the words that they lie nearest to in the text. The general rule is that relationships between
words that are situated far apart in a text are only permitted if they exhibit a very strong
semantic relationship such as repetition or synonymy.

Extracting Meaningful Insights From Blogs

                                       Figure 1

Extracting Meaningful Insights From Blogs
       4.3    UML Diagrams


Extracting Meaningful Insights From Blogs

Extracting Meaningful Insights From Blogs
                              FIG 4: ACTIVITY DIAGRAM

Extracting Meaningful Insights From Blogs
                         FIG 5: SEQUENCE DIAGRAM

Extracting Meaningful Insights From Blogs
                     FIG 6: COLLABORATION DIAGRAM

Extracting Meaningful Insights From Blogs
                     FIG 7: COMPONENT DIAGRAM

Extracting Meaningful Insights From Blogs
                     FIG 8: DEPLOYMENT DIAGRAM

Extracting Meaningful Insights From Blogs
                                                   5. IMPLEMENTATION




   C # is a multi-paradigm programming language encompassing imperative, functional,
   generic, object-oriented (class-based), and component-oriented programming disciplines.
   It was developed by Microsoft within the .NET initiative and later approved as a standard
   by Ecma (ECMA-334) and ISO (ISO/IEC 23270). C# is one of the programming
   languages designed for the Common Language Infrastructure.

   C# is intended to be a simple, modern, general-purpose, object-oriented programming
   language. Its development team is led by Anders Hejlsberg. The most recent version is C#
   4.0, which was released in April 12, 2010.


Our system basically consists of three modules:
                      a) Sentence splitter
                      b) Document clustering
                      c) Sentence Extraction

Extracting Meaningful Insights From Blogs
       a) Sentence splitter

                                            Figure 9a

Extracting Meaningful Insights From Blogs
                                            Figure 9b

Extracting Meaningful Insights From Blogs
                                            Figure 9c

Extracting Meaningful Insights From Blogs
b) Document Clustering

                                            Figure 10

Extracting Meaningful Insights From Blogs
c) Sentence Extractor

                                            Figure 11a

Extracting Meaningful Insights From Blogs
                                            Figure 11b

Extracting Meaningful Insights From Blogs

           a. Sentence Splitter
   Features: -
   Software: SplitSentences.proj
   The sentence splitter is given an input in the form of a collection of documents related
   to a particular topic. It first preprocesses the text by eliminating delimiters from
   abbreviations by replacing the abbreviated words with their expansions, removing
   white spaces from the text etc. Then it proceeds to split the document into its
   constituent sentences on the basis of delimiters such as ‘.’, ‘,’, ‘!’, ‘?’

   Functions Used :-
   Signature: private void Form_Sentences(string text)
   It calls an instance of CleanText for preprocessing. It then replaces the delimiters with
   a special character “SEN_END”. After that, it splits the sentences wherever it finds
   an occurrence of “SEN_END”.

   Signature: private string CleanText(string InputText)
   It removes extra spaces between words, white spaces at the beginning, multiple tabs
   and spaces before the newline character. It also expands abbreviations on the basis of
   a given list.

Extracting Meaningful Insights From Blogs
            b. Document Clustering

    This is the most important module as it is responsible for the formation of lexical
    chains and the scoring of the chains and sentences. It relies on the semantic dictionary
    WordNet 2.1 to tag words and obtain their word sense. Lexical chains can be formed
    using either greedy, semi-greedy or semi-semi greedy approach and the optimum path
    length can be set by the user. It stores the score of various chains passing through a
    sentence in a .mat file whereas the individual chain scores are stored in a text file
    titled “ChainScores.txt”.
    It also calculates various indices that help determine the similarity between the given

      i)       Brill Tagger
License: GNU licensed Copyright 1993 MIT
It tags each and every word in a sentence based on part-of-speech tagging. This version
uses the combined Wall Street Journal and Brown corpora supplied by Brill. It has
classified >90,000 individual words from >1million words of text. If the word is in the
lexicon, it is tagged with its first (most likely) tag.

Signature: private static void DoContextualTagging()
The norm is to only check for a substitution if the new tag already exists in the list of
possible tags. If the word is unknown then it's probably best to try the substitution

Signature: private static void DoLexicalTagging()
Go through each of the rules. Each of these will go through every word in the sentence to
see if the rule applies. This is a lot of work, but it has to be done . The code assumes that
the rules are all perfectly formed, so don't hand edit the rule file!

Extracting Meaningful Insights From Blogs
Signature: private static void DoBasicTagging()
If the word is in the lexicon, tag it with its first (most likely) tag if not tag it as NN or
NNP if it has a capital letter. Ignore everything that doesn't start with a letter of the
alphabet except for something which starts with a number then make it CD it will get
changed to JJ if it contains a 'd' (e.g. 2nd) or a 't' (e.g. 31st)
Signature: public static void GetLexiconEtc(string LexiconFile, string LexicalRuleFile,
string ContextualRuleFile)
We usually read in long files with SR.ReadToEnd then do a Split on VBNewLine. But in
this case it is MUCH slower than doing it via ReadLine And reading and hashing this
way is MUCH faster than saving the serialized hash table especially for a very long
lexicon. Serializing a big hash table is VERY , VERY slow.

Signature: public static string BrillTagged(string TheSentence, bool DoLexical, bool
DoContextual, bool DoClean, string LexiconFile, string LexicalRuleFile, string
Various      kind   of    tagging    like    DoBasicTagging,         DoContextualTagging   and
DoLexicalTagging is done in the BrillTagged function.

       ii)     Lexical Chainer
The main sub-block within the summary generator block, it is responsible for the
formation of chains from the queue of sentences. It contains the lexical chaining
algorithm which is applied to the entire text.
  Signature: public List<int> doclist = new List<int>();
  List of documents through which chain is passing.It is
  same as the doclist in chain stack. This just inherits the list for docstat.

Extracting Meaningful Insights From Blogs
Signature: public ChainMembers(String chainMemberValue, int documentno , int
docfreq, int XSRfreq, int SSRfreq, int SRfreq, int MSfreq, int IDfreq)
   Identifies and records the chain member’s value, document no, document frequency
   and XSR, SR or the MSR relationship.

   Signature:      public   Tokens(string   Token_Value,      int   Sentence_index,     int
   Token_index_inDoc, int Doc_index, string Tag, Hashtable HtWordStructure)
   Identifies and records the token’s value, sentence index, document index, tag and the

      iii)   WordNet
   License : Princeton University
   It is an online semantic dictionary that provides the different senses of the given word
   and the paths generated between the words.

   Signature : private void setWordNet_Path()
   Description :
   Methods to find the depth of synsets in WordNet taxonomies

   Signature : private void initialize_StopWords()
   Description :
   Function to eleminate very common words fom the text which donot form the part of
   central concepts.

   Signature : public string tagging(string sentence)
   Description :
   POS tagging is required to understand the contextual relationship of token with the
   chain content.

Extracting Meaningful Insights From Blogs
   Signature :private List<string> Clean_Text_Form_Senstences(string text)
   Description :
   After the POS tagging is done,stop words are eliminated and other text preprocessing
   is done to clean the text.

           c. Sentence Extractor
   This module scores each sentence by summing the scores of all the chains passing
   through that sentence. It then selects the sentences for the summary with either
   concept or sentence based approach. In concept approach the highest scoring
   sentences are selected from the top concepts whereas in sentence approach, the top
   scoring sentences are picked directly.

   Signature : public string findscore(Int32 i,string path)
   Description :
   It is a function to read score given the chain no. from ChainScores.txt file.

   Signature : public void sel_con_or_sent(int num_sen, int num_sent, int fl)
   Description :
   This is used to select either of concept or sentence based approach. In concept
   approach it selects the highest scoring sentences from the top scoring concepts . In
   sentence approach it selects sentences on basis of sentence scores.

   Signature : public string selsent(String path, int no)
   Description :
   This is used to select sentences on basis of concept approach. It selects the highest
   scoring sentences from the top scoring concepts . It maintains a linked list to store
   sentences and scores for each chain.

Extracting Meaningful Insights From Blogs
   Signature : public void selsent(String path)
   Description :
   This is used to select sentences on basis of sentence approach. It simply selects the
   highest scoring sentences . It maintains an array to store sentence numbers and scores
   of the top scoring senetences.

Extracting Meaningful Insights From Blogs
                                                                              6. TESTING


       Functional Testing

       Functional testing is performed on the entire system in the context of a Functional
       Requirement Specification(s) (FRS) and/or a System Requirement Specification (SRS).
       System testing is an investigatory testing phase, where the focus is to have almost a
       destructive attitude and tests not only the design, but also the behavior and even the
       believed expectations of the customer. It is also intended to test up to and beyond the
       bounds defined in the software/hardware requirements specification(s).


       We have preformed FUNCTIONAL TESTING for our project to ensure that each
       functionality implemented above is being correctly called by the USER INTERFACE (UI
       Elements) and being correctly executed.

       The few of the test cases for our project are as follows:

Test       Test          UI             ILAYE          Input       Expected       Actual         Remarks
Cas        case          Eleme          R              fields      output         output
e no       descrip       nt    in       function
           tion          PLAY           s called

  Extracting Meaningful Insights From Blogs
1.          Input       Button      openFil      .mat     File      path   File      path   Test
            file        5           eDialog      file     displayed in     displayed in     passed
            name                    1.Open                the textbox      the textbox

2           Selecti     Radio       btnprint     Appro    Datagrid is      Datagrid is      Test
            ng an       button      .Perfor      ach      populated        populated        passed
            approa      rdbtnc      mClick(
            ch          on and      )

3           Chosin      check       none         Click    Textbox          Textbox          Test
            g           Box1                     check    displaying       displaying       passed
            Advan                                box      default          default
            ced                                           values      of   values      of
            options                                       the              the
                                                          parameters       parameters

4           Genera      button      sel_con      none     MessageBo        MessageBo        Test
            ting        4           _or_sen               x showing        x showing        passed
            summa                   t(num_s               the              the
            ry                      en,                   generated        generated
                                    num_se                summary          summary
                                    nt, 0)                txtfile          txtfile

5           Export      button      (Excel.      none     MessageBo        MessageBo        Test
            ing to      3           Worksh                x showing        x showing        passed
            Excel                   eet)xlW               the              the

     Extracting Meaningful Insights From Blogs
                                   orkBoo                message         message
                                   k.Work                “Excel file     “Excel file
                                   sheets.g              created”        created”

6          Inputti     Button      openFil      Other    MessageBo       MessageBo       Test
           ng file     5           eDialog      than     x showing       x showing       passed
           name                    1.Open       .mat     “Input          “Input
                                   File()       file     string not in   string not in
                                                         format”         format”

7          Input       Textbo      none         Non-     MessageBo       MessageBo       Test
           parame      x                        integr   x showing       x showing       passed
           ters in     Txtsen                   al       “Arguments      “Arguments
           advanc      percon,                  value    not        in   not        in
           ed          txtperc                           correct         correct
           options     ,                                 format”         format”

    Extracting Meaningful Insights From Blogs

Test case no: 1
Test case: Input file name

                             FIGURE 12a : Input File Name

Extracting Meaningful Insights From Blogs
                             FIGURE 12b : Input File Name

Extracting Meaningful Insights From Blogs
Test case no: 2

   Test case: Selecting an approach


                            FIGURE 13: Selecting an approach
Extracting Meaningful Insights From Blogs
Test case no: 3

   Test case: Choosing Advanced options


                        FIGURE 14: Choosing Advanced options

Extracting Meaningful Insights From Blogs
Test case no: 4a

   Test case: Generating summary


                          FIGURE 15a: Generating summary

Extracting Meaningful Insights From Blogs
Test case no: 4b

   Test case: Generating summary


                          FIGURE 15b: Generating summary

Extracting Meaningful Insights From Blogs
Test case no: 5

   Test case: Exporting to Excel


                             FIGURE 16: Exporting to Excel

Extracting Meaningful Insights From Blogs
Test case no: 6

   Test case: Inputting file name


                             FIGURE 17: Inputting file name

Extracting Meaningful Insights From Blogs
Test case no: 7

   Test case: Input parameters in advanced options


                   FIGURE 18: Input parameters in advanced options

Extracting Meaningful Insights From Blogs
                   7. DEPLOYMENT AND MAINTENANCE

   The system can be deployed on any desktop/workstation which has the following
   Windows XP, Vista, 7
   WordNet 2.1 Dictionary

   7.1 Introduction
   Software deployment is all of the activities that make a software system available for
   use. These activities can occur at the producer site or at the consumer site or both.
   Because every software system is unique, the precise processes or procedures within
   each activity can hardly be defined. Therefore, "deployment" should be interpreted as
   a general process that has to be customized according to specific requirements or
   characteristics. A brief description of each activity will be presented later.

   7.2 Installation Process
   The installation file consists of an exe file titled “Setup.exe” for the
   DocumentClustering module. On clicking the Install button the package self-extracts
   itself into the Program Files folder in the root drive.
   After that the Microsoft Office Interop Assemblies are installed in the Program Files

Extracting Meaningful Insights From Blogs
                    FIGURE 19a: Welcome to the Sankshipt Setup Wizard

                          FIGURE 19b: Select Installation Folder

Extracting Meaningful Insights From Blogs
                             FIGURE 19c: Installing Sankshipt

                            FIGURE 19d: Installation Complete

Extracting Meaningful Insights From Blogs
                      FIGURE 19e: Microsoft Office Interop Assemblies

                   FIGURE 19f: Select the Installation folder for assemblies

Extracting Meaningful Insights From Blogs
7.3 Uninstallation
An uninstaller, also called a deinstaller, is an application software which is designed to
remove all or parts of specific application software. The Uninstaller is used to reverse
changes in the log.
We can go through the start->All Programs->Sankshipt->Uninstall process to remove all
the parts of the software.

7.4 User Help
It is a technical communication document intended to give assistance to people using a
particular system.
A Readme file is provided with the installer to give assistance to people using the system
which covers the instructions regarding the software, its installation and uninstallation.

Extracting Meaningful Insights From Blogs
                                                              8. CONCLUSION

       We present a domain independent summarization system which uses lexical
       chains for automatic text summarization of large documents. This system builds
       on previous research by implementing a lexical chain extraction algorithm in
       linear time. The system is reasonably domain independent and takes as input any
       text or HTML document. The system outputs a short summary based on the most
       salient concepts from the original document. The length of the extracted summary
       can be either controlled automatically, or manually based on length or percentage
       of compression. The system provides useful summaries which compare well in
       information content to human generated summaries. Additionally, the system
       provides a robust test bed for future summary generation research.

       8.1 FUTURE SCOPE

      Business analysts can use a text summarizer in order to derive user opinions on
       products and services as expressed through blogs. They can observe the impact of
       changes in business strategies on their sales and understand the customers’ needs

      News feeds from several websites on a particular topic can be summarized into a
       short paragraph, thus providing the viewer with the required information without
       having to go through all the content.

      Search engine hit summarization resulting in a summary of all the information
       returned by the hit list could be an important application.

      Physicians could employ a text summarizer to summarize and compare the
       recommended treatments for a patient
Extracting Meaningful Insights From Blogs
                                                             9. REFERENCE

[1] Nicola Stokes “Applications of Lexical Cohesion Analysis in Topic Detection and
Tracking Domain”

[2] Klaus Zechner, 1996, “A Literature Survey on Information Extraction and Text

[3]Kedar Bellare, Anish Das Sarma, Atish Das Sarma, Navneet Loiwal, Vaibhav Mehta,
Ganesh Ramakrishnan, Pushpak Bhattacharyya, “Generic Text Summarization using

[4] Rajdeep, 2008, “Auto Text Summarization”

[5] Yves Petinot, “Context-based URL Summarization”

[6] COLING/ACL’98, Eduard Hovy and Daniel Marcu, “Automated Text summarization

[7] Meru Brunn, Yllias Chali, Christopher J. Pinchak, “Text Summarization Using
Lexical Chains.”

[8] Visual C# .NET Programming by Davis, Harold.
Extracting Meaningful Insights From Blogs
Extracting Meaningful Insights From Blogs

To top