Design of Content-Oriented Information Retrieval by Semantic Analysis
The International Journal of Computer Science and Information Security (IJCSIS) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems. . Frequency of Publication: MONTHLY ISSN: 1947-5500 [Copyright � 2011, IJCSIS, USA]
- views:
- 206
- posted:
- 2/14/2011
- language:
- English
- pages:
- 6

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
DESIGN OF CONTENT ORIENTED INFORMATION RETRIEVAL BASED
ON SEMANTIC ANALYSIS
S.Amudaria, S.Sasirekha,
PG Student, Asst. Professor,
Dept of IT, SSN College of Engineering, Dept of IT, SSN College of Engineering,
Chennai, India Chennai, India
daria.amu@gmail.com sasirekhas@ssn.edu.in
Abstract: different keywords has been completely left out. The user
The existing Information Retrieval (IR) systems which normally goes to the search engine to get the exact and
are based entirely on syntactic (keyword based) contents have relevant results. But the current search engine is not
serious limitations like irrelevant document retrieval, word responsible for producing the accurate results to the user.
sense ambiguity, low precision and recall ratio since the
complete semantics of the contents are not represented. To
overcome these limitations, from the recent literature it is Semantic search seeks to improve search accuracy by
identified that it is necessary to analyze and determine the understanding searcher intent and the contextual meaning of
semantic features of both the content in document and query. terms as they appear in the searchable data space, whether
Hence in this paper it is proposed to initially develop a on the Web or within a closed system, to generate more
semantic pattern that represents semantic features of the relevant results. Rather than using ranking algorithms such
contents in every document in the corpus as a Term Document as Google's Page Rank to predict relevancy, Semantic
Matrix (TDM) format. Then to develop a semantic pattern for
the contents in the query by incorporating it with Natural
Search uses semantics, or the science of meaning in
Language Processing technique along with Synset (WordNet) language, to produce highly relevant search results. In most
for query refinement & expansion. Now the similarity between cases, the goal is to deliver the information queried by a
the semantic pattern of the query and TDM is calculated using user rather than have a user sort through a list of loosely
Latent Semantic Analysis (LSA) and plotted in Semantic related keyword results. Here WordNet is used to get the
Vector Space. Then by matching against the vector space, semantics of the query.
contents associated to the query can be identified in the
corresponding cluster. Various experimental results are
carried on, which shows the increase in document retrieval
A brief literature survey about the information retrieval
recall and precision rates, thereby demonstrating the techniques are done in the section II, then the proposed
effectiveness of the model. system is explained using various techniques in section III
and the next two sections deals completely about the
Keywords: Information retrieval, Semantic extraction, implementation and Test results.
Query extension, Query matching
II RELATED WORK
I INTRODUCTION
Ming-Yen Chen et al. (2009) introduces a semantic
The existing information retrieval systems are mostly enabled information retrieval in which a web corpus is taken
keyword-based and identify relevant documents or and the related information is retrieved. The limitation of
information by matching keywords. Keyword-based search, this project is that it won’t deals about the Synonyms or
in spite of its merits of expedient query for information and Synsets. Here in our project we have concentrated on
ease-of-use, has failed to represent the complete semantics WordNet ontology to collect more senses.
contained in the content (Oh et al, 2007) and has led to the
following problems (Abdelali et al, 2007; Moreale et al,
2004): (1) keywords could represent only fragmented Zongli Jiang et al. (2009) introduce the concept of
meanings of the content, and the content identified through category attribute of a word. According to the category
keywords did not always meet the querist requirements. The attribute of a word, the useless results can be removed from
querist had to screen retrieval results and correct keywords the search results and the retrieval efficiency will be
several times to obtain the required information. (2) improved. Latent semantic analysis is a method that can
Compared to a text, a query usually comprised fewer discover the underlying semantic relation between words
contents, which might lead to wrong retrieval results due to and documents. Singular value decomposition is used in
problems like insufficient information being used in the latent semantic analysis to analyze the words and documents
search process, insufficient query topics, and difficulty in and get the semantic relation finally.
determining query features. (3) Due to synonym and
polysemy in human language, information retrieval through Hongwei Yang et al. (2010) can enable the users to
keywords can only cover information containing the same find the relevant documents more easily and also help users
keyword, while other information with similar semantics but to form an understanding of the different facets of the query
that have been provided for web search engine. A popular
92 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
technique for clustering is based on K-means such that the The Singular value decomposition will process the
data is partitioned into K clusters. In this method, the groups document in the corpus and the term document frequency
are identified by a set of points that are called the cluster matrix is generated. This term document frequency matrix is
centers. The data points belong to the cluster whose center is plotted and most similar terms that are corresponding to the
closest. The algorithm used in the proposed system is K- query will be plotted in the semantic space. Then finally the
means clustering algorithm. relevant documents are obtained by using the k-means
clustering. The block diagram of the proposed system is
Gang et al. (2009) proposed a method to enhance the shown below.
information retrieval recall and precision. To filter out the
document which have smaller related degree with original
query, the scores of search results document is re-calculated
by use of ontology semantic similarity. A new definition of
the iterative query expansion parameters is put forward
which can reduce the number of expansion and further
improve the efficiency of the query.
Trong Hai et al. (2008) proposed a system which
applies the relations between entities discovered from Text
corpus to ontology integration tasks in which the noun
phrase (NP) is used to identify its head noun; this is useful
to avoid wrong relations between entities. It also proposes a
collaborative acquisition algorithm combining WordNet-
based and Text corpus to provide general concepts and their
relations for ontology integration tasks.
Trong Hai Duong, Geun Sik Jo (2009) designed a new
measure based on semantic ontology database WordNet is
proposed, which combines information content-based Fig1: Block diagram of the proposed system
measure and the edge-counting techniques to measure
semantic similarity. “PART-OF” and “IS-A” hierarchical Consider the word Java. The corresponding senses of
relations’ influence are considered on the semantic word java that are taken from WordNet are given below.
similarity in this paper. Breadth-first search is used to find
the shortest path between two concepts. The similarity of Word Senses
hierarchy and superposition are calculated respectively.
an island in Indonesia south of Borneo; one of
III PROPOSED SYSTEM Java the world's most densely populated regions
Coffee- a beverage consisting of an infusion
The proposed system uses the semantic analysis Java of ground coffee beans; "he ordered a cup of
technique to retrieve the content which is relevant to the user coffee"
query. The user’s query will be analyzed in the semantic a simple platform-independent object-
extraction and determination module to extract its semantic oriented programming language used for
features for the purpose of determining contents of the query Java writing applets that are downloaded from the
and representing them in a structured and materialized World Wide Web by a client and run on the
semantic pattern. In this component the semantic elements client's machine
are identified and analyze their semantic relations, to be
followed by the integration and simplification of semantic It means that the single word java has three senses. This
relations with Word Net. Now the semantic extension type of word ambiguity is not satisfied by the current search
module will identify other potentially relevant semantic engines. Also consider another example. The query given by
features based on semantic features of the query and include the user is Computer. Both the words PC and Computer refer
them into the query patterns. This will increase the number to a same thing. But in the current search engines, only the
of semantic features in the query as the basis for matching. documents containing the word computer will be indexed and
The input query from the user is processed using retrieved to the user. So even though the word PC resembles
preprocessing techniques such as stop word list removal and the same meaning, the pages relevant to word PC are not
then stemming is done. Each and every processed word is retrieved. Hence precision and recall ratio is minimized.
passed to the WordNet to collect all the other senses that the
corresponding word has. The Synsets related to the query are The proposed system is to design a content based
taken and latent semantic Analysis process is done to index information retrieval based on semantic Analysis where we
the documents. use WordNet ontology for performing a search based on
Synsets and thus to increase the precision and recall ratio.
93 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Precision: What fraction of the returned results is relevant to of all documents along the horizontal axis. These need not
the information need? be in any particular order, as long as it is kept track of which
Recall: What fraction of the relevant documents in the column and row corresponds to which keyword and
collection was returned by the system? document.
2. Query Refinement and Expansion using WordNet
IV IMPLEMENTATION
i. Query Refinement
The system is implemented by using a corpus of 250
documents. The query is given as input and the processing The query entered by the user is passed through
steps are explained below. The final output obtained is the the stop word list to remove the stop words. Then stemming
relevant document that exactly matches with the Query. is also done to retrieve only the subject. This is passed to the
• Semantic Pattern Construction WordNet to get more senses. For example, the word vomit
has 3 senses such as vomits, barf and puke. In the keyword
• Semantic Query Processing based search only the vomit word will be taken but not its
senses. Hence different words expressing the same meaning
• Semantic Query Refinement & Expansion will not be taken and so the user won’t be satisfied with the
results of search engine. Hence pass each and every token of
• Semantic Pattern Matching the query to the WordNet to get more senses.
1. Develop Semantic Pattern from the content ii. Query Vector Coordinates
Developing a semantic pattern from the content The query vector coordinates are generated by
requires the following steps. The given content is pre- checking the keyword txt file and count the occurrences of
processed using the porter stemming algorithm to find the it. The senses are also counted and hence the count is
root word and removing the stop words. The stop words are incremented. The goal of WordNet project is the creation of
given manually which doesn’t make any senses in the dictionary and thesaurus, which could be used intuitively.
content and query. The next purpose of WordNet is the support for automatic
text analysis and artificial intelligence. WordNet is a lexical
i. Content Preprocessing database for the English language. It groups English words
into sets of synonyms called Synsets, provides short, general
A content repository of 250 text documents is definitions, and records the various semantic relations
taken as corpus. These documents are to be processed in to between these synonym sets. The purpose is twofold: to
tokens. Some selected stop words are taken. These stop produce a combination of dictionary and thesaurus that is
words are discarded by the search engine. All the text more intuitively usable, and to support automatic text
documents that are present in the corpus are passed through analysis and artificial intelligence applications. WordNet
these stop word list. The document word that matches with distinguishes between nouns, verbs, adjectives and adverbs
the stop word is considered to be stop word and is because they follow different grammatical rules. Every
eliminated. This step is done to reduce the token. The Synset contains a group of synonymous words or
remaining word is considered to be keyword and is stored in collocations (a collocation is a sequence of words that go
a text file. Normally the stop words will be pronouns, together to form a specific meaning, such as "car pool");
Articles and Prepositions. different senses of a word are in different Synsets.
ii. Porter Stemming Algorithm A query Q is represented as an n-dimensional
vector q in the same vector space as the document vectors.
After removing the stop words the There are several ways how to search for relevant
keywords are passed to a stemming Algorithm. The documents. Generally, we can compute matrix to represent
stemming Algorithm used in this work is Portar Stemming the similarity of query and document vectors.
Algorithm. This component identified semantics elements
like subject, object, and predicate in the content semantics 3. Perform SVD and LSA
and analyzes their semantic relations.
i. Term Frequency – Inverse Document
iii. Term Document Matrix Frequency
Generate a Term Document Matrix to know the After constructing the Term Document Matrix
occurrences of each and every key word in the document. apply weight to all token found in countMatrix. The
The term-document matrix is a large grid representing every TFIDF (Term Frequency – Inverse Document Frequency)
document and content word in a collection. The TDM is calculated using the formula
(Term Document Matrix) is generated by arranging the list
of all content words along the vertical axis, and a similar list TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di )
94 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Where the original space of A. This reduced set of matrices is often
denoted with a modified formula such as:
Ni,j = the number of times word i appears in
document j (the original cell count).
N*,j = the number of total words in document j A ≈ Ak = Tk Sk DkT
(just add the counts in col j).
D = the number of documents (the number of Efficient LSI algorithms only compute the first k
columns). singular values and term and document vectors as opposed
Di = the number of documents in which word i to computing a full SVD and then truncating it.
appears (the number of non-zero columns in row i).
The TFIDF matrix obtained is used for the iii. Latent Semantic Analysis (LSA)
computation of the Singular Value Decomposition.
By reducing the term-document space to fewer
ii. Singular Value Decomposition dimensions, SVD reveals the underlying relationships
between terms and documents in all possible combinations
The matrix that is generated from the term and the similarity between terms and documents are
frequency- inverse document frequency matrix is used for shown within the reduced space. This technique uses a
the computation of Singular value decomposition. term-document matrix which describes the occurrences of
A rank-reduced, Singular Value Decomposition is terms in documents; it is a sparse matrix whose rows
performed on the matrix to determine patterns in the correspond to terms and whose columns correspond to
relationships between the terms and concepts contained in documents. Latent semantic analysis (LSA) is a technique
the text. in natural language processing, in particular in vectorial
semantics, of analyzing relationships between a set of
documents and the terms they contain by producing a set
of concepts related to the documents and terms.
A typical example of the weighting of the elements
of the matrix is tf-idf (term frequency–inverse document
frequency): the element of the matrix is proportional to the
Fig.2 SVD Computation number of times the terms appear in each document,
where rare terms are up weighted to reflect their relative
The SVD forms the foundation for LSI. It importance. The inverse weighted term document matrix
computes the term and document vector spaces by calculates the occurrence of single word in all the
transforming the single term-frequency matrix, A, into documents.
three other matrices— a term-concept vector matrix, T, a
singular values matrix, S, and a concept-document vector 4. Query Projection and Matching
matrix, D, which satisfy the following relations:
In the LSI model, queries are formed into pseudo-
A = TSDT documents that specify the location of the query in the
TTT = DT D = Ir reduced term-document space. Given q, a vector whose
TTT = Im DDT = In non-zero elements contain the weighted term-frequency
S1,1 ≥ S2,2 ≥...≥ Sr,r > 0 Si,j = 0 where i ≠ j counts of the terms that appear in the query, the pseudo-
document can be represented by
In the formula, A, is the supplied m by n weighted
matrix of term frequencies in a collection of text where m is A = TSDT
the number of unique terms, and n is the number of
documents. T is a computed m by r matrix of term vectors The singular values are used to individually
where r is the rank of A—a measure of its unique weight each dimension of the term-document space. Once
dimensions ≤ min(m,n). S is a computed r by r diagonal the query is projected into the term-document space, one
matrix of decreasing singular values, and D is a computed n of several similarity measures can be applied to compare
by r matrix of document vectors. the position of the pseudo-document to the positions of the
terms or documents in the reduced term-document space.
The LSI modification to a standard SVD is to
reduce the rank or truncate the singular value matrix S to i. Term-Term Similarity
size k « r, typically on the order of a k in the range of 100 to
300 dimensions, effectively reducing the term and document After getting U and S matrix, multiply U
vector matrix sizes to m by k and n by k respectively. The and S matrix with the resultant matrix (say T matrix) to
SVD operation, along with this reduction, has the effect of find Term-Term Similarity using Cosine relation.
preserving the most important semantic information in the
text while reducing noise and other undesirable artifacts of
95 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
V .W
cos A = ———
||V|| ||w||
If the angle is between 0 and 90 then there exists a
relation (some similarity) between the two vectors
coordinated. Similarly cosine relation is computed
between query coordinate and other coordinate. Lesser the
angle more similarity between the terms. The cosine
similarity measure, is often used because, by only finding
the angle between the pseudo-document and the terms or
documents in the reduced space, the lengths of the
documents, which can affect the distance between the Fig.3: Keywords Extraction
pseudo-document and the documents in the space, are
normalized. Once the similarities between the pseudo- The keywords are collected and Term document
document and all the terms and documents in the space matrix is generated. Finally the relevant document is
have been computed, the terms or documents are ranked retrieved. This also handles the Synonym by passing the
according to the results of the similarity measure, and the query through the WordNet.
highest-ranking terms or documents, or all the terms and
documents exceeding some threshold value, are returned
to the user Precision and Recall ratio
ii. K-means Clustering In information retrieval contexts, precision and
recall are defined in terms of a set of retrieved
The vector coordinates whose cosine similarity documents (e.g. the list of documents produced by a web
value is greater than the threshold are retrieved and plotted search engine for a query) and a set of relevant
in the semantic space to make the search to be more documents (e.g. the list of all documents on the internet
relevant. Then the k-means clustering is done to make the that are relevant for a certain topic). In the context of
cluster documents much relevant to the query. information retrieval, precision is defined as the ratio of
relevant documents to the number of retrieved documents:
The basic idea of k-means algorithm is to do a Precision =Number of relevant documents
local optimization on a given number of clusters. Number of retrieved documents
Specifically, first randomly pick up k documents from the
entire collection and make them as the initial centroid of and recall is defined as the proportion of relevant
the desired k clusters. Then for each document in the documents that are retrieved:
collection find the nearest centroid and put this document
into the corresponding cluster. After each document is Recall = Number of relevant, retrieved documents
assigned to one of the cluster, recompute the centroid and Total number of relevant documents
repeat the computation. This method iteratively optimize
the clusters until the computation converge when the Consider 250 documents and the relevant document
clusters do not change anymore and the clustering quality retrieved is 70. Here the precision and recall ratio are
achieved a local maximum. The advantage of the k-means calculated as 0.28 and 0.7 respectively. Hence the average
is that its complexity is very low and is very easy to precision and recall ratio is 0.378 and is depicted in the
implementation. In a cluster, the similarity between the below graph.
query and each content in the cluster is computed to sort
contents by the order of similarity and offer the most
approximate content to the querist.
V TEST RESULTS
The proposed system is implemented by using a
corpus of 250 documents. The tokens are separated from the
corpus by means of Keywords. The screenshots are shown
below.
Fig 4: precision-recall ratio curve
96 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
VI CONCLUSION
[6] Trong Hai Duong, Geun Sik Jo , Ngoc Thanh Nguyen,
In this study, the proposed approaches can “A Method for Integration across Text Corpus and
efficiently and precisely perform semantic based WordNet-based Ontologies” 2008 IEEE/WIC/ACM
information retrieval. In addition to semantic-based International Conference on Web Intelligence and
information retrieval, the proposed system has two Intelligent Agent Technology.
significant parts: a semantic extension model which employs
latent semantic analysis to generate more semantics for [7] Zhongcheng Zhao “Measuring Semantic Similarity
matching, thereby solving the problem of insufficient Based On WordNet” 2009 Sixth Web Information
information for query; and a semantic clustering model Systems and Applications Conference
which uses k-means clustering algorithm based on
neighbours and then performs content matching in that [8] Trong Hai Duong, Geun Sik Jo, “Semantic similarity
category, thereby improving matching accuracy. Since the methods in WordNet and their application to
query is passed through WordNet all the senses will be taken information retrieval on the web” 2008
and accuracy of the relevant pages will increase. IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology.
REFERENCES
[9] Wei-Dong Fang, Ling Zhang, Yan-Xuan Wang, Shou-
Bin Bong; “Toward a Semantic Search Engine based
[1] Ming-Yen Chen, Hui-Chuan Chu, Yuh-Min Chen
(2009), “Developing a semantic-enable information on Ontologies” Network Engineering and Research
retrieval mechanism”, Elsevier Journal on Expert Center, South China University of Technology,
Guangzhou 510640, China
Systems with Applications, May 2009.
[10] Qinglin Guo, Ming Zhang (2007), “Multi-documents
[2] Zongli Jiang and Changdong Lu, “A latent semantic
analysis based method of getting the category attribute Automatic Abstracting based on text clustering and
of words” 2009 International Conference on Electronic semantic analysis”, Elsevier Journal on Knowledge
Based Systems, 22, 482-485.
Computer Technology.
[11] Berry, M.W.(1992) , “Large scale singular value
[3] Hongwei Yang, “A document clustering algorithm for
computations, ” .International Journal of
web search engine retrieval system”, 2010 International
Conference on e-Education, e-Business, e-Management Supercomputer Applications, 6 (1), pp 13-49
and e-Learning.
[12] Jiuling Zhang, Beixing Deng, Xing Li “Concept
[4] Jianpei Zhang; Zhongwei Li; Jing Yang; “A
Based Query Expansion Using WordNet” 2009
divisional incremental training algorithm of support
International e-Conference on Advanced Science and
vector machine” Mechatronics and Automation,2005
IEEE International Conference. Technology
[13] Abdelali, A., Cowie, J., & Soliman, H. S. (2007).
[5] Gang Lv 1,Cheng Zheng 2,Li Zhang3, “Text
Improving query precision using semantic expansion.
information retrieval based on concept semantic
Information Processing and Management, 43, 705–716
similarity” 2009 Fifth International Conference on
Semantics, Knowledge and Grid.
97 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "