Docstoc

Design of Content-Oriented Information Retrieval by Semantic Analysis

Document Sample
Design of Content-Oriented Information Retrieval by Semantic Analysis Powered By Docstoc
					                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 1, January 2011




       DESIGN OF CONTENT ORIENTED INFORMATION RETRIEVAL BASED
                       ON SEMANTIC ANALYSIS
              S.Amudaria,                                                      S.Sasirekha,
              PG Student,                                                      Asst. Professor,
              Dept of IT, SSN College of Engineering,                          Dept of IT, SSN College of Engineering,
              Chennai, India                                                   Chennai, India
              daria.amu@gmail.com                                              sasirekhas@ssn.edu.in


Abstract:                                                          different keywords has been completely left out. The user
      The existing Information Retrieval (IR) systems which        normally goes to the search engine to get the exact and
are based entirely on syntactic (keyword based) contents have      relevant results. But the current search engine is not
serious limitations like irrelevant document retrieval, word       responsible for producing the accurate results to the user.
sense ambiguity, low precision and recall ratio since the
complete semantics of the contents are not represented. To
overcome these limitations, from the recent literature it is             Semantic search seeks to improve search accuracy by
identified that it is necessary to analyze and determine the       understanding searcher intent and the contextual meaning of
semantic features of both the content in document and query.       terms as they appear in the searchable data space, whether
Hence in this paper it is proposed to initially develop a          on the Web or within a closed system, to generate more
semantic pattern that represents semantic features of the          relevant results. Rather than using ranking algorithms such
contents in every document in the corpus as a Term Document        as Google's Page Rank to predict relevancy, Semantic
Matrix (TDM) format. Then to develop a semantic pattern for
the contents in the query by incorporating it with Natural
                                                                   Search uses semantics, or the science of meaning in
Language Processing technique along with Synset (WordNet)          language, to produce highly relevant search results. In most
for query refinement & expansion. Now the similarity between       cases, the goal is to deliver the information queried by a
the semantic pattern of the query and TDM is calculated using      user rather than have a user sort through a list of loosely
Latent Semantic Analysis (LSA) and plotted in Semantic             related keyword results. Here WordNet is used to get the
Vector Space. Then by matching against the vector space,           semantics of the query.
contents associated to the query can be identified in the
corresponding cluster. Various experimental results are
carried on, which shows the increase in document retrieval
                                                                        A brief literature survey about the information retrieval
recall and precision rates, thereby demonstrating the              techniques are done in the section II, then the proposed
effectiveness of the model.                                        system is explained using various techniques in section III
                                                                   and the next two sections deals completely about the
    Keywords: Information retrieval, Semantic extraction,          implementation and Test results.
Query extension, Query matching
                                                                                             II RELATED WORK
                       I INTRODUCTION
                                                                         Ming-Yen Chen et al. (2009) introduces a semantic
        The existing information retrieval systems are mostly      enabled information retrieval in which a web corpus is taken
keyword-based and identify relevant documents or                   and the related information is retrieved. The limitation of
information by matching keywords. Keyword-based search,            this project is that it won’t deals about the Synonyms or
in spite of its merits of expedient query for information and      Synsets. Here in our project we have concentrated on
ease-of-use, has failed to represent the complete semantics        WordNet ontology to collect more senses.
contained in the content (Oh et al, 2007) and has led to the
following problems (Abdelali et al, 2007; Moreale et al,
2004): (1) keywords could represent only fragmented                      Zongli Jiang et al. (2009) introduce the concept of
meanings of the content, and the content identified through        category attribute of a word. According to the category
keywords did not always meet the querist requirements. The         attribute of a word, the useless results can be removed from
querist had to screen retrieval results and correct keywords       the search results and the retrieval efficiency will be
several times to obtain the required information. (2)              improved. Latent semantic analysis is a method that can
Compared to a text, a query usually comprised fewer                discover the underlying semantic relation between words
contents, which might lead to wrong retrieval results due to       and documents. Singular value decomposition is used in
problems like insufficient information being used in the           latent semantic analysis to analyze the words and documents
search process, insufficient query topics, and difficulty in       and get the semantic relation finally.
determining query features. (3) Due to synonym and
polysemy in human language, information retrieval through                 Hongwei Yang et al. (2010) can enable the users to
keywords can only cover information containing the same            find the relevant documents more easily and also help users
keyword, while other information with similar semantics but        to form an understanding of the different facets of the query
                                                                   that have been provided for web search engine. A popular




                                                                 92                              http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 1, January 2011



technique for clustering is based on K-means such that the                   The Singular value decomposition will process the
data is partitioned into K clusters. In this method, the groups      document in the corpus and the term document frequency
are identified by a set of points that are called the cluster        matrix is generated. This term document frequency matrix is
centers. The data points belong to the cluster whose center is       plotted and most similar terms that are corresponding to the
closest. The algorithm used in the proposed system is K-             query will be plotted in the semantic space. Then finally the
means clustering algorithm.                                          relevant documents are obtained by using the k-means
                                                                     clustering. The block diagram of the proposed system is
     Gang et al. (2009) proposed a method to enhance the             shown below.
information retrieval recall and precision. To filter out the
document which have smaller related degree with original
query, the scores of search results document is re-calculated
by use of ontology semantic similarity. A new definition of
the iterative query expansion parameters is put forward
which can reduce the number of expansion and further
improve the efficiency of the query.

      Trong Hai et al. (2008) proposed a system which
applies the relations between entities discovered from Text
corpus to ontology integration tasks in which the noun
phrase (NP) is used to identify its head noun; this is useful
to avoid wrong relations between entities. It also proposes a
collaborative acquisition algorithm combining WordNet-
based and Text corpus to provide general concepts and their
relations for ontology integration tasks.

      Trong Hai Duong, Geun Sik Jo (2009) designed a new
measure based on semantic ontology database WordNet is
proposed, which combines information content-based                                       Fig1: Block diagram of the proposed system
measure and the edge-counting techniques to measure
semantic similarity. “PART-OF” and “IS-A” hierarchical                   Consider the word Java. The corresponding senses of
relations’ influence are considered on the semantic                  word java that are taken from WordNet are given below.
similarity in this paper. Breadth-first search is used to find
the shortest path between two concepts. The similarity of                  Word                            Senses
hierarchy and superposition are calculated respectively.
                                                                                       an island in Indonesia south of Borneo; one of
                   III PROPOSED SYSTEM                                      Java       the world's most densely populated regions
                                                                                       Coffee- a beverage consisting of an infusion
      The proposed system uses the semantic analysis                        Java       of ground coffee beans; "he ordered a cup of
technique to retrieve the content which is relevant to the user                        coffee"
query. The user’s query will be analyzed in the semantic                               a simple platform-independent object-
extraction and determination module to extract its semantic                            oriented programming language used for
features for the purpose of determining contents of the query               Java       writing applets that are downloaded from the
and representing them in a structured and materialized                                 World Wide Web by a client and run on the
semantic pattern. In this component the semantic elements                              client's machine
are identified and analyze their semantic relations, to be
followed by the integration and simplification of semantic               It means that the single word java has three senses. This
relations with Word Net. Now the semantic extension                  type of word ambiguity is not satisfied by the current search
module will identify other potentially relevant semantic             engines. Also consider another example. The query given by
features based on semantic features of the query and include         the user is Computer. Both the words PC and Computer refer
them into the query patterns. This will increase the number          to a same thing. But in the current search engines, only the
of semantic features in the query as the basis for matching.         documents containing the word computer will be indexed and
The input query from the user is processed using                     retrieved to the user. So even though the word PC resembles
preprocessing techniques such as stop word list removal and          the same meaning, the pages relevant to word PC are not
then stemming is done. Each and every processed word is              retrieved. Hence precision and recall ratio is minimized.
passed to the WordNet to collect all the other senses that the
corresponding word has. The Synsets related to the query are              The proposed system is to design a content based
taken and latent semantic Analysis process is done to index          information retrieval based on semantic Analysis where we
the documents.                                                       use WordNet ontology for performing a search based on
                                                                     Synsets and thus to increase the precision and recall ratio.




                                                                   93                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 1, January 2011



Precision: What fraction of the returned results is relevant to          of all documents along the horizontal axis. These need not
the information need?                                                    be in any particular order, as long as it is kept track of which
Recall: What fraction of the relevant documents in the                   column and row corresponds to which keyword and
collection was returned by the system?                                   document.

                                                                         2. Query Refinement and Expansion using WordNet
                    IV IMPLEMENTATION
                                                                            i. Query Refinement
      The system is implemented by using a corpus of 250
documents. The query is given as input and the processing                            The query entered by the user is passed through
steps are explained below. The final output obtained is the              the stop word list to remove the stop words. Then stemming
relevant document that exactly matches with the Query.                   is also done to retrieve only the subject. This is passed to the
          • Semantic Pattern Construction                                WordNet to get more senses. For example, the word vomit
                                                                         has 3 senses such as vomits, barf and puke. In the keyword
          •    Semantic Query Processing                                 based search only the vomit word will be taken but not its
                                                                         senses. Hence different words expressing the same meaning
          •    Semantic Query Refinement & Expansion                     will not be taken and so the user won’t be satisfied with the
                                                                         results of search engine. Hence pass each and every token of
          •    Semantic Pattern Matching                                 the query to the WordNet to get more senses.

1.   Develop Semantic Pattern from the content                              ii. Query Vector Coordinates
            Developing a semantic pattern from the content                              The query vector coordinates are generated by
requires the following steps. The given content is pre-                  checking the keyword txt file and count the occurrences of
processed using the porter stemming algorithm to find the                it. The senses are also counted and hence the count is
root word and removing the stop words. The stop words are                incremented. The goal of WordNet project is the creation of
given manually which doesn’t make any senses in the                      dictionary and thesaurus, which could be used intuitively.
content and query.                                                       The next purpose of WordNet is the support for automatic
                                                                         text analysis and artificial intelligence. WordNet is a lexical
     i. Content Preprocessing                                            database for the English language. It groups English words
                                                                         into sets of synonyms called Synsets, provides short, general
            A content repository of 250 text documents is                definitions, and records the various semantic relations
 taken as corpus. These documents are to be processed in to              between these synonym sets. The purpose is twofold: to
 tokens. Some selected stop words are taken. These stop                  produce a combination of dictionary and thesaurus that is
 words are discarded by the search engine. All the text                  more intuitively usable, and to support automatic text
 documents that are present in the corpus are passed through             analysis and artificial intelligence applications. WordNet
 these stop word list. The document word that matches with               distinguishes between nouns, verbs, adjectives and adverbs
 the stop word is considered to be stop word and is                      because they follow different grammatical rules. Every
 eliminated. This step is done to reduce the token. The                  Synset contains a group of synonymous words or
 remaining word is considered to be keyword and is stored in             collocations (a collocation is a sequence of words that go
 a text file. Normally the stop words will be pronouns,                  together to form a specific meaning, such as "car pool");
 Articles and Prepositions.                                              different senses of a word are in different Synsets.

     ii. Porter Stemming Algorithm                                                A query Q is represented as an n-dimensional
                                                                         vector q in the same vector space as the document vectors.
                      After removing the stop words the                  There are several ways how to search for relevant
 keywords are passed to a stemming Algorithm. The                        documents. Generally, we can compute matrix to represent
 stemming Algorithm used in this work is Portar Stemming                 the similarity of query and document vectors.
 Algorithm. This component identified semantics elements
 like subject, object, and predicate in the content semantics            3. Perform SVD and LSA
 and analyzes their semantic relations.
                                                                            i. Term Frequency             –     Inverse        Document
     iii. Term Document Matrix                                                 Frequency
           Generate a Term Document Matrix to know the                             After constructing the Term Document Matrix
 occurrences of each and every key word in the document.                 apply weight to all token found in countMatrix. The
 The term-document matrix is a large grid representing every             TFIDF (Term Frequency – Inverse Document Frequency)
 document and content word in a collection. The TDM                      is calculated using the formula
 (Term Document Matrix) is generated by arranging the list
 of all content words along the vertical axis, and a similar list               TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di )




                                                                    94                             http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 1, January 2011



 Where                                                                   the original space of A. This reduced set of matrices is often
                                                                         denoted with a modified formula such as:
         Ni,j = the number of times word i appears in
   document j (the original cell count).
         N*,j = the number of total words in document j                                    A ≈ Ak = Tk Sk DkT
   (just add the counts in col j).
         D = the number of documents (the number of                               Efficient LSI algorithms only compute the first k
   columns).                                                             singular values and term and document vectors as opposed
         Di = the number of documents in which word i                    to computing a full SVD and then truncating it.
   appears (the number of non-zero columns in row i).
               The TFIDF matrix obtained is used for the                    iii.   Latent Semantic Analysis (LSA)
 computation of the Singular Value Decomposition.
                                                                                By reducing the term-document space to fewer
     ii. Singular Value Decomposition                                    dimensions, SVD reveals the underlying relationships
                                                                         between terms and documents in all possible combinations
          The matrix that is generated from the term                     and the similarity between terms and documents are
frequency- inverse document frequency matrix is used for                 shown within the reduced space. This technique uses a
the computation of Singular value decomposition.                         term-document matrix which describes the occurrences of
          A rank-reduced, Singular Value Decomposition is                terms in documents; it is a sparse matrix whose rows
performed on the matrix to determine patterns in the                     correspond to terms and whose columns correspond to
relationships between the terms and concepts contained in                documents. Latent semantic analysis (LSA) is a technique
the text.                                                                in natural language processing, in particular in vectorial
                                                                         semantics, of analyzing relationships between a set of
                                                                         documents and the terms they contain by producing a set
                                                                         of concepts related to the documents and terms.

                                                                              A typical example of the weighting of the elements
                                                                         of the matrix is tf-idf (term frequency–inverse document
                                                                         frequency): the element of the matrix is proportional to the
                       Fig.2 SVD Computation                             number of times the terms appear in each document,
                                                                         where rare terms are up weighted to reflect their relative
              The SVD forms the foundation for LSI. It                   importance. The inverse weighted term document matrix
 computes the term and document vector spaces by                         calculates the occurrence of single word in all the
 transforming the single term-frequency matrix, A, into                  documents.
 three other matrices— a term-concept vector matrix, T, a
 singular values matrix, S, and a concept-document vector                4. Query Projection and Matching
 matrix, D, which satisfy the following relations:
                                                                                 In the LSI model, queries are formed into pseudo-
                        A = TSDT                                         documents that specify the location of the query in the
                        TTT = DT D = Ir                                  reduced term-document space. Given q, a vector whose
                        TTT = Im DDT = In                                non-zero elements contain the weighted term-frequency
            S1,1 ≥ S2,2 ≥...≥ Sr,r > 0 Si,j = 0 where i ≠ j              counts of the terms that appear in the query, the pseudo-
                                                                         document can be represented by
           In the formula, A, is the supplied m by n weighted
 matrix of term frequencies in a collection of text where m is                                  A = TSDT
 the number of unique terms, and n is the number of
 documents. T is a computed m by r matrix of term vectors                        The singular values are used to individually
 where r is the rank of A—a measure of its unique                        weight each dimension of the term-document space. Once
 dimensions ≤ min(m,n). S is a computed r by r diagonal                  the query is projected into the term-document space, one
 matrix of decreasing singular values, and D is a computed n             of several similarity measures can be applied to compare
 by r matrix of document vectors.                                        the position of the pseudo-document to the positions of the
                                                                         terms or documents in the reduced term-document space.
           The LSI modification to a standard SVD is to
 reduce the rank or truncate the singular value matrix S to                   i. Term-Term Similarity
 size k « r, typically on the order of a k in the range of 100 to
 300 dimensions, effectively reducing the term and document                        After getting U and S matrix, multiply U
 vector matrix sizes to m by k and n by k respectively. The               and S matrix with the resultant matrix (say T matrix) to
 SVD operation, along with this reduction, has the effect of              find Term-Term Similarity using Cosine relation.
 preserving the most important semantic information in the
 text while reducing noise and other undesirable artifacts of




                                                                    95                              http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 1, January 2011



                          V .W
                cos A = ———
                        ||V|| ||w||

        If the angle is between 0 and 90 then there exists a
relation (some similarity) between the two vectors
coordinated. Similarly cosine relation is computed
between query coordinate and other coordinate. Lesser the
angle more similarity between the terms. The cosine
similarity measure, is often used because, by only finding
the angle between the pseudo-document and the terms or
documents in the reduced space, the lengths of the
documents, which can affect the distance between the                                      Fig.3: Keywords Extraction
pseudo-document and the documents in the space, are
normalized. Once the similarities between the pseudo-                          The keywords are collected and Term document
document and all the terms and documents in the space                 matrix is generated. Finally the relevant document is
have been computed, the terms or documents are ranked                 retrieved. This also handles the Synonym by passing the
according to the results of the similarity measure, and the           query through the WordNet.
highest-ranking terms or documents, or all the terms and
documents exceeding some threshold value, are returned
to the user                                                         Precision and Recall ratio

   ii. K-means Clustering                                                       In information retrieval contexts, precision and
                                                                      recall are defined in terms of a set of retrieved
         The vector coordinates whose cosine similarity               documents (e.g. the list of documents produced by a web
value is greater than the threshold are retrieved and plotted         search engine for a query) and a set of relevant
in the semantic space to make the search to be more                   documents (e.g. the list of all documents on the internet
relevant. Then the k-means clustering is done to make the             that are relevant for a certain topic). In the context of
cluster documents much relevant to the query.                         information retrieval, precision is defined as the ratio of
                                                                      relevant documents to the number of retrieved documents:
         The basic idea of k-means algorithm is to do a                  Precision =Number of relevant documents
local optimization on a given number of clusters.                             Number of retrieved documents
Specifically, first randomly pick up k documents from the
entire collection and make them as the initial centroid of            and recall is defined as the proportion of relevant
the desired k clusters. Then for each document in the                 documents that are retrieved:
collection find the nearest centroid and put this document
into the corresponding cluster. After each document is                   Recall = Number of relevant, retrieved documents
assigned to one of the cluster, recompute the centroid and                  Total number of relevant documents
repeat the computation. This method iteratively optimize
the clusters until the computation converge when the                           Consider 250 documents and the relevant document
clusters do not change anymore and the clustering quality             retrieved is 70. Here the precision and recall ratio are
achieved a local maximum. The advantage of the k-means                calculated as 0.28 and 0.7 respectively. Hence the average
is that its complexity is very low and is very easy to                precision and recall ratio is 0.378 and is depicted in the
implementation. In a cluster, the similarity between the              below graph.
query and each content in the cluster is computed to sort
contents by the order of similarity and offer the most
approximate content to the querist.

                    V TEST RESULTS

        The proposed system is implemented by using a
corpus of 250 documents. The tokens are separated from the
corpus by means of Keywords. The screenshots are shown
below.



                                                                                          Fig 4: precision-recall ratio curve




                                                                 96                               http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 1, January 2011



                      VI CONCLUSION
                                                                    [6] Trong Hai Duong, Geun Sik Jo , Ngoc Thanh Nguyen,
             In this study, the proposed approaches can               “A Method for Integration across Text Corpus and
efficiently and precisely perform semantic based                      WordNet-based Ontologies” 2008 IEEE/WIC/ACM
information retrieval. In addition to semantic-based                  International Conference on Web Intelligence and
information retrieval, the proposed system has two                    Intelligent Agent Technology.
significant parts: a semantic extension model which employs
latent semantic analysis to generate more semantics for             [7] Zhongcheng Zhao “Measuring Semantic Similarity
matching, thereby solving the problem of insufficient                 Based On WordNet” 2009 Sixth Web Information
information for query; and a semantic clustering model                Systems and Applications Conference
which uses k-means clustering algorithm based on
neighbours and then performs content matching in that               [8] Trong Hai Duong, Geun Sik Jo, “Semantic similarity
category, thereby improving matching accuracy. Since the                methods in WordNet and their application to
query is passed through WordNet all the senses will be taken            information     retrieval    on    the   web”  2008
and accuracy of the relevant pages will increase.                       IEEE/WIC/ACM International Conference on Web
                                                                        Intelligence and Intelligent Agent Technology.
                     REFERENCES
                                                                    [9] Wei-Dong Fang, Ling Zhang, Yan-Xuan Wang, Shou-
                                                                        Bin Bong; “Toward a Semantic Search Engine based
[1] Ming-Yen Chen, Hui-Chuan Chu, Yuh-Min Chen
  (2009), “Developing a semantic-enable information                     on Ontologies” Network Engineering and Research
  retrieval mechanism”, Elsevier Journal on Expert                      Center, South China University of Technology,
                                                                        Guangzhou 510640, China
  Systems with Applications, May 2009.
                                                                    [10] Qinglin Guo, Ming Zhang (2007), “Multi-documents
[2] Zongli Jiang and Changdong Lu, “A latent semantic
  analysis based method of getting the category attribute              Automatic Abstracting based on text clustering and
  of words” 2009 International Conference on Electronic                semantic analysis”, Elsevier Journal on Knowledge
                                                                       Based Systems, 22, 482-485.
  Computer Technology.
                                                                    [11] Berry, M.W.(1992) , “Large scale singular value
[3] Hongwei Yang, “A document clustering algorithm for
                                                                       computations,   ”    .International     Journal of
  web search engine retrieval system”, 2010 International
  Conference on e-Education, e-Business, e-Management                  Supercomputer Applications, 6 (1), pp 13-49
  and e-Learning.
                                                                    [12] Jiuling Zhang, Beixing Deng, Xing Li “Concept
[4] Jianpei Zhang; Zhongwei Li; Jing Yang; “A
                                                                       Based Query Expansion Using WordNet” 2009
  divisional incremental training algorithm of support
                                                                       International e-Conference on Advanced Science and
  vector machine” Mechatronics and Automation,2005
  IEEE International Conference.                                       Technology

                                                                    [13] Abdelali, A., Cowie, J., & Soliman, H. S. (2007).
[5] Gang Lv 1,Cheng Zheng 2,Li Zhang3, “Text
                                                                       Improving query precision using semantic expansion.
  information retrieval based on concept semantic
                                                                       Information Processing and Management, 43, 705–716
  similarity” 2009 Fifth International Conference on
  Semantics, Knowledge and Grid.




                                                               97                             http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500

				
DOCUMENT INFO
Description: The International Journal of Computer Science and Information Security (IJCSIS) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems. . Frequency of Publication: MONTHLY ISSN: 1947-5500 [Copyright � 2011, IJCSIS, USA]