VIEWS: 0 PAGES: 69 POSTED ON: 6/7/2013 Public Domain
Project report on Extracting Meaningful Insights From Blogs a final year project .
1. INTRODUCTION 1.1 PROBLEM DEFINITION One of the main objectives is to automate the summarisation of large texts, in this case blogs, and make them as structured and meaningful as would be done manually. It will allow us to gain useful insights from blogs without having to go through all the text ourselves. Also the comments generated in response to a blog can be analysed to determine the mindset and opinions of the people. Another issue is the classification of blogs according to their content. Blogs have to be tagged as per genre. Typical text mining tasks include text clustering, text categorization, concept/entity extraction, sentiment analysis, and document summarization. PROBLEM STATEMENT Develop an automated multi-document summarization system that extracts information from a collection of documents related to a topic. Input: Collection of blogs Output: Summary of the topic 1 Extracting Meaningful Insights From Blogs 1.2 Background Much research has been conducted in the area of automatic text summarization. Specifically, research using lexical chains and related techniques has received much attention. Early methods using word frequency counts did not consider the relations between similar words. Finding the aboutness of a document requires finding these relations. How these relations occur within a document is referred to as cohesion (Halliday and Hasan, 1976). First introduced by Morris and Hirst (1991), lexical chains represent lexical cohesion among related terms within a corpus. These relations can be recognized by identifying arbitrary size sets of words which are semantically related (i.e., have a sense flow). These lexical chains provide an interesting method for summarization because their recognition is easy within the source text and vast knowledge sources are not required in order to compute them. 1.3 Objectives: 1. To develop an automated multi-document summarization system, that extracts information from a collection of documents related to a topic. 2. To extract meaningful insights from the summarization of different kinds of blogs available on the WWW. 3. To analyse the results obtained from the given algorithm and determine its effectiveness. 4. To design a reasonably domain independent system which outputs a short summary based on the most salient concepts from the original document. 5. To design a system in which the length of the extracted summary can be either controlled automatically or manually based on length or percentage of compression. 2 Extracting Meaningful Insights From Blogs 1.4 Motivation: Researchers and students constantly face this scenario: It is almost impossible to read most if not all of the newly published papers to be informed of the latest progress and when they work on a research project, the time spent on reading literature review seems endless. The goal of this project is to design a domain independent, automatic text extraction system to alleviate, if not totally solve, this problem. We have scored sentences in the given text linguistically to generate a summary comprising of the most important ones obtained so. The program takes input from a text file, and outputs the summary into a similar text file. The most daunting task at hand was to generate an efficient scoring algorithm that would produce the best results for a wide range of text types. The only means to arrive at it was to manually summarize and then evaluate sentences for common traits, which would then be converted into the machine language. 1.5 Structure of the report: Chapter 2 talks about the literature survey, Chapter 3 deals with system design, chapter 4 is about detailed system design, chapter 5 discusses about software implementation, chapter 6 is testing, chapter 7 is deployment and maintenance, chapter 8 is conclusion and scope followed by chapter 9 which gives the references. 3 Extracting Meaningful Insights From Blogs 2. LITERATURE SURVEY Early ideas of systems for automatically condensing and/or summarizing documents date back into the fifties (Luhn, 1958). Recently, there has been a shift in the Information Extraction paradigm (Kuhn 1970). Experts are slowly moving more away from the research that is motivated by linguistic theory and moving towards applied technology. Research on the IBMs statistical machine translation (MT) (Brown et al., 1990) proved that, contradictory to predictions, their system which employed essentially no linguistic knowledge not only worked, but also did surprisingly well. At the same time, a lot of corpus based methods and approaches were developed and implemented, such as automatic part-of-speech (POS) tagging. Information Extraction is a relatively new field in NLP. Its prime objectives are twofold: i) to divide text into “relevant” and irrelevant sections(filtering) ii) to fill pre-specified templates with information extracted from the text These filled templates can be entered in standard relational databases, to serve as automatic triggers for informing human data analysts of events of interest or to provide a basis for generating text abstracts or condensed versions of source texts. 2.1 Lexical Chaining Lexical chain is the way to represent the lexical cohesion present in text. Lexical chain is the list (group) of semantically related words within the text. As lexical chains are group of related words within the text, lexical chains are the indicators of the different concepts discussed in the text. In addition, beginning of new chain indicates beginning of the new topic and end of the lexical chain indicate end of the particular topic. With the availability of machine readable thesauruses like Word Net and Roget’s, computation of lexical chains and its applications to different text processing problems have gained significant interest. Cohesive units found by lexical chains are useful in 4 Extracting Meaningful Insights From Blogs different natural language processing applications such as text segmentation, text summarization, multi-document summarization, information retrieval and so on. Subsequent to the introduction of lexical chaining for discourse analysis by Morris and Hirst (1991), lexical chaining has become one of the basic techniques to address various problems in text analysis. Beginning from early research work in some of the problems in natural language processing the technique was explored later as knowledge based approach to traditional text summarization task and it was used to solve few text analysis problems effectively which are solved by traditional text summarization techniques. One of the interesting applications of lexical chaining is that lexical chains can be used for alternative scoring mechanism based on the semantics of the text rather than traditional word frequency based vector space model of text summarization. The lexical chaining technique has been put to test in different applications such as word sense disambiguation, text segmentation, text summarization etc. Though computationally (CPU time) intensive, the technique is robust enough to apply it to variety of problems as mentioned above and to many other text analysis problems successfully. Some of the important applications of the lexical chaining are: Lexical chains for text segmentation/topic boundary detection Lexical cohesion for detection of malapropism Generation of intra-document and inter-document links Text summarization There are three approaches for text summarization using lexical chaining. 2.1.1 First approach Assuming the most important chain or strong chain among all chains represents the summary. That is of all the concepts the most prevailing concept is treated as a summary and appropriate text repressing the main concept is chosen as a summary How do decide the chain strength: 5 Extracting Meaningful Insights From Blogs Length based Cohesion based (stokes***, barziley*** also uses same notion but different formula) This approach is suitable for generating headlines. It can be tuned to generate a long summary. Some concepts are likely to be missed out – they might not be important for the text but can be valid part of the summary 2.1.2 Second approach To summarize the text consider all the chains. That is to summarize the text all the concepts discussed in text are considered (irrespective of there importance) and appropriate representative text of those concept is chosen as summary. It can be tuned to produce short summary. All the concepts are represented hence a true summary. Chain pruning might be must otherwise single word chains other spurious chains will play part in summarization 2.1.3 Third approach Rank the chains based on the importance and chose number of chains, based on some threshold, for summary. That is out of many concepts appearing in the text prioritize the concepts to be considered for summary. 2.1.4 A Comparison of the approaches Implications of these three different strategies for chaining are important with respect to chaining dynamics. They are discussed below: The last strategy will create less (minimum) number of chains than any other strategy. As we are passing queue of all the word in the text and then passing that queue through XSR, SR, MSR search before considering new chain generation, there is only one chain in the stack to start with. Even if later the chains are added in the stack search through chain stack is redundant as we have checked all the possible related words for earlier chain (chains) through the text. The net effect is at given point in time only one chain would be active chain for seeking the words. One by one chain will be created 6 Extracting Meaningful Insights From Blogs (One loop of XSR, SR. and MSR relations will complete the one chain and so on) and they will not change their order ever. 2.2 Cohesion, Coherence and Discourse Analysis When reading any text it is obvious that it is not merely made up of a set of unrelated sentences, but that these sentences are in fact connected to each other through the use of two linguistic phenomenon, namely cohesion and coherence. As Morris and Hirst (1991) point out, cohesion relates to the fact that the elements of a text (e.g. clauses) ‘tend to hang together’; while coherence refers to the fact that ‘there is sense (or intelligibility) in a text’. Observing the interaction between textual units in terms of these properties is one way of analysing the discourse structure of a text. Most theories of discourse result in a hierarchical tree-like structure that reflects the relationships between sentences or clauses in a text. These relationships may, for example, highlight sentences in a text that elaborate, reiterate or contradict a certain theme. Meaningful discourse analysis like this requires a true understanding of textual coherence which in turn often involves looking beyond the context of the text, and drawing from real-world knowledge of events and the relationships between them. Hasan, in her paper on ‘Coherence and Cohesive Harmony’ (1984), hypothesises that the coherence of a text can be indirectly measured by analyzing the degree of interaction between cohesive chains in a text. Analysing cohesive relationships in this manner is a more manageable and less computationally expensive solution to discourse analysis than coherence analysis. For example, Morris and Hirst (1991) note that, unlike research into cohesion, there has been no widespread agreement on the classification of different types of coherence relationships. Furthermore, they note that even humans find it more difficult to identify and agree on textual coherence because, although identifying cohesion and 7 Extracting Meaningful Insights From Blogs coherence are subjective tasks, coherence requires a definite ‘interpretation of meaning’, while cohesion requires only an understanding that terms are about ‘the same thing’. 2.3 Semantic Networks, Thesauri and Lexical Cohesion 2.3.1 Longmans Dictionary of Contemporary English The Longmans Dictionary of Contemporary English (LDOCE) was the first available machine-readable dictionary. Its popularity as a lexical resource can also be attributed to the simplicity of its design, since it was created with non-native English speakers in mind. More specifically, the dictionary was written so that all gloss definitions were described with respect to a controlled vocabulary of 2,851 words referred to as the Longmans Defining Vocabulary (LDV). In their paper, which looks at calculating the similarity between words based on spreading activation in a dictionary, Kozima and Furugori (1993a) took advantage of this design feature, and generated a semantic network from it using the gloss entries of a subset of the words in the dictionary. This subset of the dictionary is called Glossème and consists of all words included in the LDV. 2.3.2 Roget’s Thesaurus Roget’s Thesaurus is one of a category of thesauri (like the Macquarie Thesaurus) that were custom built as aids to writers who wish to replace a particular word or phrase with a synonymous or near-synonymous alternative. Unlike a dictionary, they contain no gloss definitions, instead they provide the user with a list of possible replacements for a word and leave it up to the user to decide which sense is appropriate. The structure of the thesaurus provides two ways of accessing a word: 1. By searching for it in a list of over 1,042 pre-defined categories, e.g. Killing, Organisation, Amusement, Physical Pain, Zoology, Mankind, etc. 2. By searching for the word in an alphabetical index that lists all the different categories in which the word occurs, i.e. analogous to defining sense distinctions. 8 Extracting Meaningful Insights From Blogs Lexical cohesive relationships between words can also be determined using a resource of this nature, since words that co-exist in the same category are semantically related. There is a hierarchical structure above (classes, sub-classes and sub-sub-classes) and below (sub-categories) these categories in the thesaurus, and it is this structure that facilitates the inferring of a small range of semantic strengths between words; however, unlike LDOCE, ‘no numerical value for semantic distance can be obtained’ (Budanitsky, 1999). 2.3.3 WordNet WordNet (Miller et al., 1990; Fellbaum, 1998a) is an online lexical database whose design is inspired by current psycholinguistic theories of human lexical memory. WordNet is divided into 4 distinct word categories: nouns, verbs, adverbs and adjectives. The most important relationship between words in WordNet is synonymy. The WordNet definition of synonymy also includes near-synonymy. Hence, WordNet synonyms are only interchangeable in certain contexts (Miller, 1998). A unique label called a synset number identifies each synonymous set of words (a synset) in WordNet. Each node or synset in the hierarchy represents a single lexical concept and linked to other nodes in the semantic network by a number of relationships. Different relationships are defined between synsets depending on which semantic hierarchy they belong to. For example, most verbs are organised around entailment (synonymy and a type of verb hyponymy called troponymy (Fellbaum, 1998b)), adjectives and adverbs around antonymy (opposites such as big-small and beautifully-horribly) and synonymy. Nouns, on the other hand, are predominantly related though synonymy and hyponymy/hypernymy. In addition, 9 other lexicographical relationships are also defined between nodes in the noun hierarchy. WordNet Noun Relationship Example Hyponymy (KIND_OF) Specialisation: apple is a hyponym of fruit since apple is a kind of fruit. Hypernymy (IS_A) Generalisation: celebration is a hypernym of birthday since birthday is a type of celebration. 9 Extracting Meaningful Insights From Blogs HAS_PART_COMPONENT: tree is a holonym of branch. HAS_PART_MEMBER: church is a holonym of parishioners . Holonymy (HAS_PART) IS_MADE_FROM_OBJECT: tyre is a holonym of rubber. OBJECT_IS_PART_OF: leg is a meronym of table. OBJECT_IS_A_MEMBER_OF: sheep is a meronym of flock. Meronymy (PART_OF) OBJECT_MAKES_UP: air is a meronym of atmosphere. Antonymy (OPPOSITE_OF) Girl is an antonym of boy. Unfortunately, there is little interconnectivity between the noun, verb, adverb and adjective files in the WordNet taxonomy: the verb file has no relations with any of the other files, the adverb file has only unidirectional relations with the adjective file, and there are only a limited number of ‘pertains to’ relationships linking adjectives to nouns. 2.4 Morris and Hirst: The Origins of Lexical Chain Creation Morris and Hirst (1991) used lexical chains to determine the intentional structure of a discourse using Grosz and Sidner discourse theory (1986). In this theory, rosz and Sidner propose a model of discourse based on three interacting textual elements: linguistic structure (segments indicate changes in topic), intentional structure (how segments are related to each other), and attentional structure (shifts of attention in text that are based on linguistic and intentional structure). Obviously, any attempt to automate this process will require a method of identifying linguistic segments in a text. Morris and Hirst believed these discourse segments could be captured using lexical chaining, where each segment is represented by the span of a lexical chain in the text. Morris and Hirst manually generated lexical chains using Roget’s Thesaurus which consists of an index entry for each word that lists synonyms and near- synonyms for each of its coarse-grained senses followed by a list of category numbers that are related to these senses. A category in this context consists of a list of related words and pointers to related categories. They used the following rules to glean semantic 10 Extracting Meaningful Insights From Blogs associations from the thesaurus during the chain generation process, where two words are related if any of the following relationship rules apply: 1. They have a common category in their index entries. 2. One word has a category in its index entry that contains a pointer to a category of the other word. 3. A word is either a label in the other word’s index entry or it is listed in a category of the other word. 4. Both words have categories in their index entries that are members of the same class/group. 5. Both words have categories in their index entries that point to a common category. Morris and Hirst also introduced a general algorithm for generating chains, on which most other chaining implementations are based 2.5 Greedy WordNet-based Chaining Algorithms The generic algorithm is an example of a greedy chaining algorithm where each addition of a term ti to a chain cm is based only on those words that occur before it in the text. A non-greedy approach, on the other hand, postpones assigning a term to a chain until it has seen all possible combinations of chains that could be generated from the text. One might expect that there is only one possible set of lexical chains that could be generated for a specific text, i.e. the correct set. However, in reality, terms have multiple senses and could be added into a variety of different chains. 2.6 Non-Greedy WordNet-based Chaining Algorithms As stated previously, a non-greedy approach to lexical chaining postpones resolving ambiguous words in a text until it has analysed the entire context of the document. Barzilay and Elhadad (1997) were the first to discuss the advantages of a non- greedy chaining approach. They argued that disambiguating a term after all possible links between it and the other candidate terms in the text have been considered was the only way to ensure that the optimal set of lexical chains for that text would be generated. In other words, the relationships between the terms in each chain will only be valid if they conform to the intended interpretation of the terms when they are used in the text. For 11 Extracting Meaningful Insights From Blogs example, chaining ‘jaguar’ with ‘animal’ is only valid if ‘jaguar’ is being referred to in the text as a type of cat and not a type of car. Barzilay and Elhadad Barzilay and Elhadad (1997) were the first to coin the phrase ‘a non-greedy or dynamic solution to lexical chain generation’. They proposed that the most appropriate sense of a word could only be chosen after examining all possible lexical chain combinations that could be generated from a text. Their dynamic algorithm begins by extracting nouns and noun compounds. Barzilay reduces both non-WordNet and WordNet noun compounds to their head noun e.g. ‘elementary_school’ becomes ‘school’. As each target word arrives, a record of all possible chain interpretations is kept and the correct sense of the word is decided only after all chain combinations have been completed. 12 Extracting Meaningful Insights From Blogs 3. SOFTWARE REQUIREMENTS SPECIFICATION 3.1 Introduction 3.1.1 Project Scope The project aims to implement an effective algorithm that will enable summarization of blogs and comments. It basically filters the text in order to extract relevant information that will be useful in analysis. The software will take text from blogs as input and will extract the different concepts present in the text. It will then grade each concept and determine the central concept and other important concepts depicted by the blog. The summary will cover all the important concepts with due weightage to the central concept. The main tasks of an IE system are: 1. Indexing the information items, i.e. assigning a number of discriminative keywords (index words) to all items and possibly also weighting them. 2. Mapping the user’s request onto an internal representation language. 3.1.2 User Classes and Characteristics Since it is a web based application there are no restrictions in terms of location and hardware or software in use. The application will be used by: People who want short and concise information on a particular topic Companies and businesses that want to analyse their product sales 13 Extracting Meaningful Insights From Blogs Testers who will conduct timely checks for performance parameters System designers who will keep upgrading the algorithm 3.1.3 Operating Environment The application can run on any system with Web access for obtaining input in the form of blogs, a database in order to store blogs and data from social networking sites. The software shall be designed to run on at least one of the following platforms: Microsoft Windows: windows implementations shall be portable to all versions of Windows up to and including Windows XP. UNIX: UNIX implementation shall be portable to any version of UNIX that supports the user interface libraries. Linux : Linux implementations shall run on at least one version of Linux 3.1.4 Design and Implementation Constraints 1. Web browser support The system shall support users using Microsoft IE version 4 or higher, Netscape Navigator version 4 or higher, Mozilla Firefox version 1.4 or higher or Google Chrome. 2. Internationalization/ Localization The system uses internationalization techniques that allow the system to configure for natural languages, shall use the Unicode-32 or multi width character encoding 3. User Interface 14 Extracting Meaningful Insights From Blogs The product must have a user friendly interface that is simple enough for all types of users to understand. 4. Response Time Due to large computation involved in text analysis processing ability, there is a response time constraint on the system. An unreliable internet connection may also limit the interaction between the two. 5. Backend Management The external database servers should be updated regularly. This updating and replication of data from central database server can introduce additional latency in the working of the system. 6. Database Management The replication from central server to the backup server has to be asynchronous as these solutions also provide a greater amount of protection by extending the distance between the primary and secondary locations of data. 3.1.5 Assumptions and Dependencies 1. The system designers can assume that users of the system are using Microsoft Internet Explorer version 4 or above or Netscape Navigator version 4 or above or Mozilla Firefox or Google Chrome. 2. The system designers can assume that the web server, application server and the database server software and hardware are available to support the requirements cited in the supplementary specifications. 3. The system designers can assume that necessary system analysis, design, implementation and test tools will be available. 15 Extracting Meaningful Insights From Blogs 4. The system designers can assume that the input will be readily available in the form of blogs and comments or any other text. 3.2 System Features This system has a direct application in text summarization 3.2.1 Text Summarization 184.108.40.206 Summarization considering chain strength The gist is very short summary of the text ranging from phrases to a sentence. The short summary so generated generally indicates essence of the given text and can be considered as headline for the text. Such a short summary is very useful while broadcasting news. Stokes (2004) uses lexical cohesion to generate very short summaries of news articles. The methodology is as follows; 1. Generate the lexical chains for the given news article. 2. Identify the highest scoring noun and proper noun chains using following n reps(i) rel (i, j ) i 1 formula: score(chain) = Where; i = current word of the chain n= number of words in the chain or length of the chain in terms of words reps(i) = number of repetition of word i rel(j) = strength or weight of the relation between i and j weights for the relations: o Extra strong = 1.0 16 Extracting Meaningful Insights From Blogs o Strong relation (synonym) = 0.9 o Strong relation (hypernym, hyponym, meronym, holonym,antonym)=0.7 medium strength relation = 0.4 o statistical relation = 0.4 3. Select the highest scoring proper noun chain and noun chain. Select all the chains sharing same highest score in case there is more than one chain with highest score. 4. The so selected chain is the most important chain for the given text and the words in the chain represent most important nouns and proper nouns (key words) for the given text. 5. The next step is to score the sentences in the text according to the key words using the following formula: n score(chain) i 1 i Score(sentence) = Where; N = number of words in the sentence score(chain)i = 0 if the word ith word does not belong to the key chain if the word belongs to the key chain then it equals the score of the chain calculated earlier Intuition over here is more the number of words in the sentence belonging to the key chain more important the sentence is. 6. Score all the sentences accordingly and choose highest scoring sentence as a gist/headline 220.127.116.11 Summarization using all the chains The alternative strategy for summarization given by Barzilay is (1997): Heuristic 1: The intuition is; as the beginning sentence (of a particular chain) starts the concept which is identified by the chain the sentence will have necessary information to summarize the 17 Extracting Meaningful Insights From Blogs concept. Additionally, other sentences can be seen as the text repeating the concept in the sentence where the particular chain began. Hence the starting sentence is a candidate for representing summary of the concept represented by a particular chain. To summarize the given text: 1. Generate the lexical chains 2. For each chain the sentence at which the chain begins is chosen as a part of summery. Heuristic 2: Second heuristic is based on the assumption that the chain words represent the concept to different extent. Therefore, it is necessary to find most prevailing concept sense - that is the word that can be appropriately represent the chain. The most prevailing concept sense from a chain is identified by frequency of word in the chain. To summarize the text: 1. Generate the lexical chains for the given text 2. Choose the most appropriate word (most prevailing sense of the concept represented by a particular chain) from the chain. To do that, calculate the frequency of the chain words. Choose the word (or words) having more than average frequency. 3. Choose the sentence as the part of summery where the chain representing word appears first. Choose the sentences in such manner for all the chains. 3.3 External Interface Requirements 3.3.1 User Interfaces The user will be provided with an input text area in which he can enter or pass the text to be processed. 18 Extracting Meaningful Insights From Blogs The user may also be allowed to pass information in the form of a file or a link to a website. The user can choose the strength of the relations with which to form the lexical chains. The testers will be allowed to choose different parameters with which to summarize the text so that they can analyse the response time. 3.3.2 Hardware Interfaces The only hardware required is a functional terminal (with console and input devices such as keyboard and mouse) along with an internet connection. The product requires very limited graphics usage with just a simple keypad for taking the user input. The product does not require usage of sound or animation. The hardware and operating system requires a minimum screen resolution of 400 * 300 pixels (owing to small form factor). Also a high capacity storage device would be needed in order to maintain the backend databases. 3.3.3 Software Interfaces The system shall make use of the operating system calls to the file management system to store and retrieve files from the database. The application will also interact with SQL server in order to perform all backend operations on the data related to the blogs stored in the database. 19 Extracting Meaningful Insights From Blogs The application will make use of some of the components defined in the Microsoft .NET Framework 2.0 or higher. 3.3.4 Communications Interfaces The system shall make use of communication standards such as FTP and HTTP in order to exchange data over the internet. 3.4 Other Nonfunctional Requirements 3.4.1 Performance Requirements The software will support simultaneous user access only if there are multiple terminals. Only textual information will be handled by the software. Amount of information handled can vary from user to user. For normal conditions, 95% of the transactions should be processed in less than 5 seconds. The software can run from a standalone desktop PC with access to the internet. Should run on at least a 1 GHz, 128 MB machine. 3.4.2 Safety Requirements No safety requirements have been identified. 20 Extracting Meaningful Insights From Blogs 3.4.3 Security Requirements There are no specific security requirements, other then those generally governing use of login into accounts. The files in which information regarding the source code and libraries are present should be secured against malicious deformations, this can be done using: Utilizing certain cryptographic techniques Keeping certain logs and history data sets Restricting communications between some areas of the program. Check data integrity for certain variables. 3.4.4 Software Quality Attributes 18.104.22.168 Web Access All user functionality shall be available in a web browser with no additional software installed on the client machines. 22.214.171.124 Usability Requirements User Guidance The system shall guide the user on the meaning of the information that they enter and that the system displays the allowed values and format of the information they enter and suggestions on how to correct information entries that are not corrected. Accessibility The system shall meet federal requirements for accessibility to handicapped users. 21 Extracting Meaningful Insights From Blogs Ease of Use The system shall be designed for “ease of use” by users with knowledge of browsing the web and completing online portfolios. 126.96.36.199 Reliability requirements Availability The system shall be available 24 hours a day, 365 days a year. There shall be no more than 5 % down time. Mean time to failure and repair Value to be determined. 188.8.131.52 Supportability Requirements The system shall support localization of all user presented information (the system shall be configurable to provide information to users in their native language). Usability : The software shall be readily accessible and easy to use with a simple GUI that is self-explanatory Reliability : The software shall maintain a backup of the input data until the automated summary is not complete Portability : The software should be easily portable from one environment to another without loss of data Maintainability : The software shall be easy to update and edit by the developers in order to increase its efficiency 22 Extracting Meaningful Insights From Blogs 3.5 Other Requirements 3.5.1 Database Requirements The system will require a large database that can store a corpus of blogs that can be queried and from which information can be extracted. 3.5.2 Legal Requirements The software’s source code cannot be freely distributed without the express permission of the owner. Any imitation, copying of the code and graphics without authorization is strictly prohibited. The software shall be the sole product of the Tata Consultancy Group. 23 Extracting Meaningful Insights From Blogs 4. SYSTEM DESIGN 4.1 The Algorithm Lexical Chains represent the sequence of related words. They indicate important concepts/topics in the document. The beginning of chain indicates beginning of a new topic and the end of chain indicates end of a topic. The chain return indicates intentional boundaries. DEFINITIONS: Candidate words: Nouns or pronouns within the text Stop words: High frequency uninformative words and words responsible to form spurious chains TOKEN: Word remaining after removing stop words from the candidate words, TOKENs take part in chaining process NODE: Member of the chain Sentence queue: Queue of tokens from the sentences Chain stack: Stack holding the chain formed during the chaining process. The most recently updated chain is given preference for seeking the relations. The stack is implemented to access the last updated chain. XSR: Extra strong relation, repetition of the word SR: 24 Extracting Meaningful Insights From Blogs Strong relation- synonym, hpernym, hyponym, meronym, holonym MS: Medium strength relation- it is transitivity relation between two words Word distance: Distance between Node and Token in terms of how many sentences they are away from each other Path length: Distance between Node and Token in terms of number of edges or number of nodes in MS relation Wight of the relation: Relative strength of the link between two words Weights for different relations are: XSR = 1.0, SR = 0.9, MSR = 0.8 / 0.7 Algorithm to determine chain strength Input = text Output = chains Create the stack to hold the chains Choose the candidate words Form a queue of candidate words per sentence Create the chain with the first candidate word for every sentence queue do //loop for XSR relations for every TOKEN in the sentence queue do if word distance < threshold for XSR then //try to find if TOKEN is related to any of the NODEs starting with last updated chain in the chain stack if XSR relation (TOKEN) then attach the token to the chain move that chain on top of the stack end if 25 Extracting Meaningful Insights From Blogs end if end for //loop for SR relations for every TOKEN in the sentence queue do if word distance < threshold for SR //try to find if TOKEN is related to any of the NODEs starting with last updated chain in the chain stack if XSR relation (TOKEN) or SR relation in (TOKEN) then attach the token to the chain move that chain on top of the stack end if end if end for //loop for MS relations for every TOKEN in the sentence queue do if word distance < threshold for MS //try to find if TOKEN is related to any of the NODEs starting with last updated chain in the chain stack if XSR relation (TOKEN) or SR relation (TOKEN) or MS relation (TOKEN) then attach the token to the chain move that chain on top of the stack end if end if end for end for //loop for sentence queues Lexical chaining algorithm 1.Choose a set of candidate terms for chaining, t1……tn. (closed-class nouns, i.e. highly informative words as opposed to stopwords). 26 Extracting Meaningful Insights From Blogs 2. Initialize: The first candidate term in the text, t1, becomes the head of the first chain, c1. 3. for each remaining term ti do 4. for each chain cm do 5. Find the chain that is most strongly related to ti with respect to the following chaining constraints: a. Chain Salience (recently updated chain given higher preference) b. Thesaural Relationships (X strong, strong, medium strength) c. Transitivity (path length for MS relation) d. Allowable Word Distance (In terms of sentences) 6. If the relationship between a chain and ti adheres to these constraints then ti becomes a member of cm, otherwise ti becomes the head of a new chain. 7. end for 8. end for The constraints listed in statement 5 of the algorithm are critical in controlling the scope, size and in many cases the validity of the relationships within a chain. If these constraints are not adhered to or if suitable parameters are not chosen for each of them, then the occurrence of spurious chains (chains that contain weakly related or incorrect terms) will be greatly increased. We now look at each of these constraints in turn: ° Chain Salience: This constraint refers to the notion that words should be added to the most recently updated chain. This intuitive rule appeals to the notion that terms are best disambiguated with respect to active chains, i.e. active themes or speaker intentions in the text. ° Thesaural Relationships: Regardless of the knowledge source used to deduce semantic similarity between terms, a set of appropriate knowledge source relationships must be decided on. Morris and Hirst state that their relationship rules 1 and 2, defined above, based on Roget’s thesaural structure, account for nearly 90% of relationships between chain words. On the other hand, in WordNet-based chaining the 27 Extracting Meaningful Insights From Blogs specialisation/generalisation hierarchy of the noun taxonomy is responsible for the majority of associations found between nouns. ° Transitivity: Another factor to consider when searching for relationships between words is transitivity. In particular, although weaker transitive relationships (such as a is related to c because a is related to b and b is related to c) increase the coverage of possible word relationships in the taxonomy, they also increase the likelihood of spurious chains, as they tend to be even more context specific than strong relationships such as synonymy. For example, consider the following tentative relationship found in WordNet: ‘foundation stone’ is indirectly related to ‘porch’ since ‘house’ is directly related to both ‘foundation stone’ and ‘porch’. Deciding whether these transitive relationships are useful is a difficult decision as one must also consider the loss of possible valuable relationships if they are ignored, e.g. ‘cheese’ is indirectly related to ‘perishable’ since ‘dairy product’ is directly related to both words according to WordNet. ° Allowable Word Distance: This constraint works on a similar assumption as Chain Salience, where the relationships between words are best disambiguated with respect to the words that they lie nearest to in the text. The general rule is that relationships between words that are situated far apart in a text are only permitted if they exhibit a very strong semantic relationship such as repetition or synonymy. 28 Extracting Meaningful Insights From Blogs 4.2 SYSTEM ARCHITECTURE Figure 1 29 Extracting Meaningful Insights From Blogs 4.3 UML Diagrams FIG 2: CLASS DIAGRAM FOR LEXICAL ANALYSER 30 Extracting Meaningful Insights From Blogs FIG 3: CLASS DIAGRAM FOR DOCUMENT CLUSTERING 31 Extracting Meaningful Insights From Blogs FIG 4: ACTIVITY DIAGRAM 32 Extracting Meaningful Insights From Blogs FIG 5: SEQUENCE DIAGRAM 33 Extracting Meaningful Insights From Blogs FIG 6: COLLABORATION DIAGRAM 34 Extracting Meaningful Insights From Blogs FIG 7: COMPONENT DIAGRAM 35 Extracting Meaningful Insights From Blogs FIG 8: DEPLOYMENT DIAGRAM 36 Extracting Meaningful Insights From Blogs 5. IMPLEMENTATION 5.1 INTRODUCTION TECHNOLOGY DETAILS USED IN THE PROJECT C#: C # is a multi-paradigm programming language encompassing imperative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines. It was developed by Microsoft within the .NET initiative and later approved as a standard by Ecma (ECMA-334) and ISO (ISO/IEC 23270). C# is one of the programming languages designed for the Common Language Infrastructure. C# is intended to be a simple, modern, general-purpose, object-oriented programming language. Its development team is led by Anders Hejlsberg. The most recent version is C# 4.0, which was released in April 12, 2010. 5.2 SAMPLE SCREENSHOTS Our system basically consists of three modules: a) Sentence splitter b) Document clustering c) Sentence Extraction 37 Extracting Meaningful Insights From Blogs a) Sentence splitter Figure 9a 38 Extracting Meaningful Insights From Blogs Figure 9b 39 Extracting Meaningful Insights From Blogs Figure 9c 40 Extracting Meaningful Insights From Blogs b) Document Clustering Figure 10 41 Extracting Meaningful Insights From Blogs c) Sentence Extractor Figure 11a 42 Extracting Meaningful Insights From Blogs Figure 11b 43 Extracting Meaningful Insights From Blogs 5.3 IMPORTANT MODULES AND THEIR DESCRIPTION a. Sentence Splitter Features: - Software: SplitSentences.proj The sentence splitter is given an input in the form of a collection of documents related to a particular topic. It first preprocesses the text by eliminating delimiters from abbreviations by replacing the abbreviated words with their expansions, removing white spaces from the text etc. Then it proceeds to split the document into its constituent sentences on the basis of delimiters such as ‘.’, ‘,’, ‘!’, ‘?’ Functions Used :- Signature: private void Form_Sentences(string text) Description: It calls an instance of CleanText for preprocessing. It then replaces the delimiters with a special character “SEN_END”. After that, it splits the sentences wherever it finds an occurrence of “SEN_END”. Signature: private string CleanText(string InputText) Description: It removes extra spaces between words, white spaces at the beginning, multiple tabs and spaces before the newline character. It also expands abbreviations on the basis of a given list. 44 Extracting Meaningful Insights From Blogs b. Document Clustering This is the most important module as it is responsible for the formation of lexical chains and the scoring of the chains and sentences. It relies on the semantic dictionary WordNet 2.1 to tag words and obtain their word sense. Lexical chains can be formed using either greedy, semi-greedy or semi-semi greedy approach and the optimum path length can be set by the user. It stores the score of various chains passing through a sentence in a .mat file whereas the individual chain scores are stored in a text file titled “ChainScores.txt”. It also calculates various indices that help determine the similarity between the given sentences. i) Brill Tagger License: GNU licensed Copyright 1993 MIT It tags each and every word in a sentence based on part-of-speech tagging. This version uses the combined Wall Street Journal and Brown corpora supplied by Brill. It has classified >90,000 individual words from >1million words of text. If the word is in the lexicon, it is tagged with its first (most likely) tag. Signature: private static void DoContextualTagging() Description: The norm is to only check for a substitution if the new tag already exists in the list of possible tags. If the word is unknown then it's probably best to try the substitution Signature: private static void DoLexicalTagging() Description: Go through each of the rules. Each of these will go through every word in the sentence to see if the rule applies. This is a lot of work, but it has to be done . The code assumes that the rules are all perfectly formed, so don't hand edit the rule file! 45 Extracting Meaningful Insights From Blogs Signature: private static void DoBasicTagging() Description: If the word is in the lexicon, tag it with its first (most likely) tag if not tag it as NN or NNP if it has a capital letter. Ignore everything that doesn't start with a letter of the alphabet except for something which starts with a number then make it CD it will get changed to JJ if it contains a 'd' (e.g. 2nd) or a 't' (e.g. 31st) Signature: public static void GetLexiconEtc(string LexiconFile, string LexicalRuleFile, string ContextualRuleFile) Description: We usually read in long files with SR.ReadToEnd then do a Split on VBNewLine. But in this case it is MUCH slower than doing it via ReadLine And reading and hashing this way is MUCH faster than saving the serialized hash table especially for a very long lexicon. Serializing a big hash table is VERY , VERY slow. Signature: public static string BrillTagged(string TheSentence, bool DoLexical, bool DoContextual, bool DoClean, string LexiconFile, string LexicalRuleFile, string ContextualRuleFile) Description: Various kind of tagging like DoBasicTagging, DoContextualTagging and DoLexicalTagging is done in the BrillTagged function. ii) Lexical Chainer The main sub-block within the summary generator block, it is responsible for the formation of chains from the queue of sentences. It contains the lexical chaining algorithm which is applied to the entire text. Functions: Signature: public List<int> doclist = new List<int>(); Description: List of documents through which chain is passing.It is same as the doclist in chain stack. This just inherits the list for docstat. 46 Extracting Meaningful Insights From Blogs Signature: public ChainMembers(String chainMemberValue, int documentno , int docfreq, int XSRfreq, int SSRfreq, int SRfreq, int MSfreq, int IDfreq) Description: Identifies and records the chain member’s value, document no, document frequency and XSR, SR or the MSR relationship. Signature: public Tokens(string Token_Value, int Sentence_index, int Token_index_inDoc, int Doc_index, string Tag, Hashtable HtWordStructure) Description: Identifies and records the token’s value, sentence index, document index, tag and the WordStructure. iii) WordNet License : Princeton University It is an online semantic dictionary that provides the different senses of the given word and the paths generated between the words. Signature : private void setWordNet_Path() Description : Methods to find the depth of synsets in WordNet taxonomies Signature : private void initialize_StopWords() Description : Function to eleminate very common words fom the text which donot form the part of central concepts. Signature : public string tagging(string sentence) Description : POS tagging is required to understand the contextual relationship of token with the chain content. 47 Extracting Meaningful Insights From Blogs Signature :private List<string> Clean_Text_Form_Senstences(string text) Description : After the POS tagging is done,stop words are eliminated and other text preprocessing is done to clean the text. c. Sentence Extractor This module scores each sentence by summing the scores of all the chains passing through that sentence. It then selects the sentences for the summary with either concept or sentence based approach. In concept approach the highest scoring sentences are selected from the top concepts whereas in sentence approach, the top scoring sentences are picked directly. Signature : public string findscore(Int32 i,string path) Description : It is a function to read score given the chain no. from ChainScores.txt file. Signature : public void sel_con_or_sent(int num_sen, int num_sent, int fl) Description : This is used to select either of concept or sentence based approach. In concept approach it selects the highest scoring sentences from the top scoring concepts . In sentence approach it selects sentences on basis of sentence scores. Signature : public string selsent(String path, int no) Description : This is used to select sentences on basis of concept approach. It selects the highest scoring sentences from the top scoring concepts . It maintains a linked list to store sentences and scores for each chain. 48 Extracting Meaningful Insights From Blogs Signature : public void selsent(String path) Description : This is used to select sentences on basis of sentence approach. It simply selects the highest scoring sentences . It maintains an array to store sentence numbers and scores of the top scoring senetences. 49 Extracting Meaningful Insights From Blogs 6. TESTING 6.1 INTRODUCTION Functional Testing Functional testing is performed on the entire system in the context of a Functional Requirement Specification(s) (FRS) and/or a System Requirement Specification (SRS). System testing is an investigatory testing phase, where the focus is to have almost a destructive attitude and tests not only the design, but also the behavior and even the believed expectations of the customer. It is also intended to test up to and beyond the bounds defined in the software/hardware requirements specification(s). TEST CASES We have preformed FUNCTIONAL TESTING for our project to ensure that each functionality implemented above is being correctly called by the USER INTERFACE (UI Elements) and being correctly executed. The few of the test cases for our project are as follows: Test Test UI ILAYE Input Expected Actual Remarks Cas case Eleme R fields output output e no descrip nt in function tion PLAY s called 50 Extracting Meaningful Insights From Blogs 1. Input Button openFil .mat File path File path Test file 5 eDialog file displayed in displayed in passed name 1.Open the textbox the textbox File() 2 Selecti Radio btnprint Appro Datagrid is Datagrid is Test ng an button .Perfor ach populated populated passed approa rdbtnc mClick( option ch on and ) rdbtnse nt 3 Chosin check none Click Textbox Textbox Test g Box1 check displaying displaying passed Advan box default default ced values of values of options the the parameters parameters 4 Genera button sel_con none MessageBo MessageBo Test ting 4 _or_sen x showing x showing passed summa t(num_s the the ry en, generated generated num_se summary summary nt, 0) txtfile txtfile 5 Export button (Excel. none MessageBo MessageBo Test ing to 3 Worksh x showing x showing passed Excel eet)xlW the the 51 Extracting Meaningful Insights From Blogs orkBoo message message k.Work “Excel file “Excel file sheets.g created” created” et_Item( 1) 6 Inputti Button openFil Other MessageBo MessageBo Test ng file 5 eDialog than x showing x showing passed name 1.Open .mat “Input “Input File() file string not in string not in format” format” 7 Input Textbo none Non- MessageBo MessageBo Test parame x integr x showing x showing passed ters in Txtsen al “Arguments “Arguments advanc percon, value not in not in ed txtperc correct correct options , format” format” txtnum sent, txtsenc utoff 52 Extracting Meaningful Insights From Blogs 6.2 SNAP SHOTS OF THE TEST CASES Test case no: 1 Test case: Input file name Snapshot: FIGURE 12a : Input File Name 53 Extracting Meaningful Insights From Blogs FIGURE 12b : Input File Name 54 Extracting Meaningful Insights From Blogs Test case no: 2 Test case: Selecting an approach Snapshot: FIGURE 13: Selecting an approach 55 Extracting Meaningful Insights From Blogs Test case no: 3 Test case: Choosing Advanced options Snapshot: FIGURE 14: Choosing Advanced options 56 Extracting Meaningful Insights From Blogs Test case no: 4a Test case: Generating summary Snapshot: FIGURE 15a: Generating summary 57 Extracting Meaningful Insights From Blogs Test case no: 4b Test case: Generating summary Snapshot: FIGURE 15b: Generating summary 58 Extracting Meaningful Insights From Blogs Test case no: 5 Test case: Exporting to Excel Snapshot: FIGURE 16: Exporting to Excel 59 Extracting Meaningful Insights From Blogs Test case no: 6 Test case: Inputting file name Snapshot: FIGURE 17: Inputting file name 60 Extracting Meaningful Insights From Blogs Test case no: 7 Test case: Input parameters in advanced options Snapshot: FIGURE 18: Input parameters in advanced options 61 Extracting Meaningful Insights From Blogs 7. DEPLOYMENT AND MAINTENANCE The system can be deployed on any desktop/workstation which has the following features: Windows XP, Vista, 7 WordNet 2.1 Dictionary 7.1 Introduction Software deployment is all of the activities that make a software system available for use. These activities can occur at the producer site or at the consumer site or both. Because every software system is unique, the precise processes or procedures within each activity can hardly be defined. Therefore, "deployment" should be interpreted as a general process that has to be customized according to specific requirements or characteristics. A brief description of each activity will be presented later. 7.2 Installation Process The installation file consists of an exe file titled “Setup.exe” for the DocumentClustering module. On clicking the Install button the package self-extracts itself into the Program Files folder in the root drive. After that the Microsoft Office Interop Assemblies are installed in the Program Files folder. 62 Extracting Meaningful Insights From Blogs FIGURE 19a: Welcome to the Sankshipt Setup Wizard FIGURE 19b: Select Installation Folder 63 Extracting Meaningful Insights From Blogs FIGURE 19c: Installing Sankshipt FIGURE 19d: Installation Complete 64 Extracting Meaningful Insights From Blogs FIGURE 19e: Microsoft Office Interop Assemblies FIGURE 19f: Select the Installation folder for assemblies 65 Extracting Meaningful Insights From Blogs 7.3 Uninstallation An uninstaller, also called a deinstaller, is an application software which is designed to remove all or parts of specific application software. The Uninstaller is used to reverse changes in the log. We can go through the start->All Programs->Sankshipt->Uninstall process to remove all the parts of the software. 7.4 User Help It is a technical communication document intended to give assistance to people using a particular system. A Readme file is provided with the installer to give assistance to people using the system which covers the instructions regarding the software, its installation and uninstallation. 66 Extracting Meaningful Insights From Blogs 8. CONCLUSION CONCLUSION We present a domain independent summarization system which uses lexical chains for automatic text summarization of large documents. This system builds on previous research by implementing a lexical chain extraction algorithm in linear time. The system is reasonably domain independent and takes as input any text or HTML document. The system outputs a short summary based on the most salient concepts from the original document. The length of the extracted summary can be either controlled automatically, or manually based on length or percentage of compression. The system provides useful summaries which compare well in information content to human generated summaries. Additionally, the system provides a robust test bed for future summary generation research. 8.1 FUTURE SCOPE Business analysts can use a text summarizer in order to derive user opinions on products and services as expressed through blogs. They can observe the impact of changes in business strategies on their sales and understand the customers’ needs better. News feeds from several websites on a particular topic can be summarized into a short paragraph, thus providing the viewer with the required information without having to go through all the content. Search engine hit summarization resulting in a summary of all the information returned by the hit list could be an important application. Physicians could employ a text summarizer to summarize and compare the recommended treatments for a patient 67 Extracting Meaningful Insights From Blogs 9. REFERENCE  Nicola Stokes “Applications of Lexical Cohesion Analysis in Topic Detection and Tracking Domain”  Klaus Zechner, 1996, “A Literature Survey on Information Extraction and Text Summarization” Kedar Bellare, Anish Das Sarma, Atish Das Sarma, Navneet Loiwal, Vaibhav Mehta, Ganesh Ramakrishnan, Pushpak Bhattacharyya, “Generic Text Summarization using WordNet”  Rajdeep, 2008, “Auto Text Summarization”  Yves Petinot, “Context-based URL Summarization”  COLING/ACL’98, Eduard Hovy and Daniel Marcu, “Automated Text summarization Tutorial”.  Meru Brunn, Yllias Chali, Christopher J. Pinchak, “Text Summarization Using Lexical Chains.”  Visual C# .NET Programming by Davis, Harold. 68 Extracting Meaningful Insights From Blogs 69 Extracting Meaningful Insights From Blogs
Pages to are hidden for
"Extracting Meaningful Insights From Blogs"Please download to view full document