A SURVEY ON VARIOUS ARCHITECTURES_ MODELS AND METHODOLOGIES FOR INFORMATION RETRIEVAL

Document Sample
A SURVEY ON VARIOUS ARCHITECTURES_ MODELS AND METHODOLOGIES FOR INFORMATION RETRIEVAL Powered By Docstoc
					  International Journal of JOURNAL OF and Technology (IJCET), ISSN 0976-
 INTERNATIONALComputer EngineeringCOMPUTER ENGINEERING
  6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
                             & TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 1, January- February (2013), pp. 182-194
                                                                             IJCET
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI)                ©IAEME
www.jifactor.com




        A SURVEY ON VARIOUS ARCHITECTURES, MODELS AND
          METHODOLOGIES FOR INFORMATION RETRIEVAL
           Prakasha S                   Shashidhar HR                  Dr. G T Raju
    sprakashjpg@yahoo.co.in          shashi_dhara@yahoo.com         gtraju1990@yahoo.com
    RNSIT, Bengaluru 560098         RNSIT, Bengaluru 560098        RNSIT, Bengaluru 560098


  ABSTRACT

          The typical Information Retrieval (IR) model of the search process consists of three
  essentials: query, documents and search results. An user looking to fulfill information need
  has to formulate a query usually consisting of a small set of keywords summarizing the
  information need. The goal of an IR system is to retrieve documents containing information
  which might be useful or relevant to the user. Throughout the search process there is a loss of
  focus, because keyword queries entered by users often do not suitably summarize their
  complex information needs, and IR systems do not sufficiently interpret the contents of
  documents leading to result lists containing irrelevant and redundant information.
          The short keyword query used as input to the retrieval system can be supplemented
  with topic categories from structured Web resources. The topic categories can be used as
  query context to retrieve documents that are not only relevant to the query but also belongs to
  a relevant topic category. Category information is especially useful for the task of entity
  ranking where the user is searching for a certain type of entity such as companies or persons.
  Category information can help to improve the search results by promoting in the ranking
  pages belonging to relevant topic categories, or categories similar to the relevant categories.
  Users may raise various queries to describe the same information need. For example, to
  search for National Board of Accreditation, queries “National Board of Accreditation (NBA)”
  or “NB Accreditation” may be formulated. Directly using individual queries to describe
  context cannot capture contexts concisely and accurately. Also queries may arise where
  “NBA” can be expanded as either “National Basketball Association” or “National Board of
  accreditation”. Hence it becomes extremely important to go for context based query based on
  the user history and present requirements of the user in that context.
          In this paper, an extensive survey has been made on different Architectures, Models
  and Methodologies that have been used in IR by various researchers along with the
  comparison of results against various performance metrics, also highlighting the need for
  context based query.


                                               182
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

Keywords: Query Model, Ranking Model, feedback-model, Retrieval model, query context

1.   INTRODUCTION

        Given the constantly increasing information overflow of the digital age, the
importance of IR has become critical. Web search is one of the most challenging problems of
the Internet today, striving to provide users with search results most relevant to their
information needs. IR deals with the representation, storage, organization of, and access to
information items such as documents, Web pages, online catalogues, structured and semi-
structured records, and multimedia objects [Baeza-Yates and Ribeiro-Neto, 2011].
        Web search engines are by far the most popular and heavily used IR applications. The
next step in the search process is to translate the information need into a query, which can be
easily processed by the search engine. The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while retrieving as few non-relevant documents
as possible. To achieve this goal IR systems must somehow `interpret' the contents of the
documents in a collection, and rank them according to a degree of relevance to the user
query. The `interpretation' of a document involves extracting syntactic and semantic
information from the document and using this information to match the user information
need.
        The notion of relevance is at the centre of IR. While for simple navigational
information needs the search process is straightforward, for more complex information needs
we need focused retrieval methods. The notion of `focused retrieval' can be defined as
providing more direct access to relevant information by locating the relevant information
inside the retrieved documents [Trotman et al., 2007].
        The first element of the search process is the query. In an ideal situation this short
keyword query is a suitable summarization of the information need, and the user will only
have to inspect the first few search results to fulfill his information need. To overcome the
shallowness of the query, i.e., users entering only a few keywords poorly summarizing the
information need, we add context to the query to focus the search results on the relevant
context. We define context as: all available information about the user's information need,
besides the query itself. Different forms of context can be considered to implicitly or
explicitly gather more information on the user's search request. Potential forms of query
context are document relevance, and category information.
        The second elements of search we examine are the documents. Documents on the
Web are rich in structure. Documents can contain HTML structure, link structure, different
types of classification schemes, etc. Most of the structural elements however are not used
consistently throughout the Web. A key question is how to deal with all this (semi-)structured
information, that is how IR systems can `interpret' these documents to reduce the shallowness
in the document representation.
        A problem in Web search is the large amount of redundant and duplicate information
on the Web. Web pages can have many duplicates or near-duplicates. Web pages containing
redundant information can be hard to recognize for a search engine, but users easily
recognize redundant information and this will usually not help them in their search. Most
structured Web resources have organized their information in such a way that they do not
contain, or significantly reduce redundant information [Anna Maria Kaptein 2011].
        Structured resources provide two interesting opportunities: `Documents categorized
into a category structure' and `Absence of redundant information'. Category information is of


                                             183
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

vital importance to a special type of search, namely entity ranking. Entity ranking is the task
of finding documents representing entities of an appropriate entity type that are relevant to a
query. Entities can be almost anything, from broad categories such as persons, locations and
organizations to more specific types such as churches, science-fiction writers or CDs.
Searchers looking for entities are arguably better served by presenting a ranked list of
entities. Rather directly, than a list of Web pages with relevant but also potentially redundant
information about these entities. Category information can be used to favor pages belonging
to appropriate entity types[Anna Maria Kaptein 2011].
        Search Intent and Context is an important criterion in catering to the users query.
Suppose a user raises a query “apple” It is hard to determine the user’s search intent that is,
whether the user is interested in the history of apple Inc, or the fruit apple. Without looking at
the context of search, the existing methods often suggest many queries for various possible
intents, and thus result in a low accuracy in query suggestion. The query context which
consists of the search intent expressed by the users’ recent queries can help to better
understand thesaurus search intent and make more meaningful suggestions.

2.   DIFFERENT MODELS USED IN IR

        For effectively retrieving relevant documents by IR strategies, the documents are
typically transformed into a suitable representation. Each retrieval strategy incorporates a
specific model for its document representation purposes. Keke Cai et al., in their paper use
retrieval process based on context-based Retrieval model consists of KL_divergence retrieval
model for initial retrieval [9]. Similarly Tangjian Deng et al., present a brain memory
inspired, context-based information re-finding framework, which enables users to re-find
results accessed before by relevant contexts [16]. Yunping Huanget et al., propose a new
query model refinement approach: random walk smoothing method which exploits the
expanded terms and term relationships based on the feedback documents [13]. Xiaohui Yan
et al., address the problem of context-aware query recommendation. Unlike the existing
approaches which leverage query sequence patterns in query sessions, they use the click-
through of the given query as the major clue of user search intents to provide context-aware
recommendation [22]. Chang Liu and Nicholas J. Belkinhas proposes an a personalized IR
model based on implicit acquisition of task type and document preferences as search context
by observing and analyzing user behaviors, and then use implicit relevance feedback to re-
rank or reformulate user queries to help users search effectively and efficiently [4].
Huanhuan Cao et al., proposes modeling search context by CRF[31]. Ji-Rong Wen et al.,
proposes four models for contextual retrieval [20]. Protima Banerjee et al., proposed the
Aspect Model forms the foundation of the Probabilistic Latent Semantic Analysis (PLSA)
method. They also put forward a technique that estimates a relevance model from the query
alone without the need for training data. Yan Qi et al., proposes a Query-driven feedback-
based conflict resolution. They have developed data structures and algorithms to enable
feedback-based conflict resolution during query processing on imperfectly aligned data [25].
        The various models listed above are used for query expansion with the help of various
feedback techniques. By expanding the query it adds a context to the query. The above said
models are also used for ranking the query. Comparison of these models has been presented
in Table 1.




                                               184
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME


        Model          Author         Approach                Parameters          Inference           Inputs
                                   Markov Random
                                     Field (MRF).                                  MRR are
                                                            Top ranked
                                                                                respectivel-y
 KL_Divergence         KekeCai        MRF based            document list                            Top-ranked
                                                                                 improved by
Retrieval Model[9].                sentence retrieval     and ranked list                           documents
                                                                              19.7%, 25.5% and
                                       Bayesian              average
                                                                                    24.1%
                                        network
     Query model                     Random walk
                       Yunping                                Score of each
      refinement                      smoothing                                                       Query
                        Huang                                    vertex
     approach[13].                      method
                                                                                   λ to 0.1
                                                          λ -controls the
                                                                                or 0.2 usually
     Probabilistic     Xiaohui        High-order           weight of the                             Feedback
                                                                                yields the best
      model[22].        Yan            method              initial query                            documents
                                                                                   retrieval
                                                              model.
                                                                                 performance
                                   Intuitive Model
                                     Query And                                51:1% of the query
 Modeling Search
                      Huanhuan     Context Model                                occurrences &
   Context by                                             Document- dmax                             Query’s
                        Cao                                                   51:7% of the URL
   CRF[20].                        Eliminate Noisy                             clicks remained
                                   Elements Model
                                                                                                    Documents
                                                                               Improvement in
                      Protima                                  smoothing                               with
Aspect Model[25].                   PLSA method                               precision & recall
                      Banerjee                                parameter - λ                         probability
                                                                               (no % specified)
                                                                                                       p(d)
                                       Concept
                                   matching. Quest
                                   The FICSR pre-
  Query driven                        processing
 Feedback based       Yan Qi et         module                                the stabbed version
                                                          k- simple paths                           User query
    Conflict             al           Constraint                                was 60% faster
 resolution[15].                      analysis &
                                   system feedback
                                   User’s feedback
 Query Model and                                              vk-aggregate                            set of
                      Liang Jeff                                               Mean precision
     Ranking                                                   document                             keywords -
                        Chen                                                  10.2 for 30 query
   Model[33].                                                 parameter- sc                            (Qk)

              Table1. Comparison of Various Models used by different authors for IR

3.      THE VARIOUS ARCHITECTURES OF IR

        The various architectures for query context are defined since all the existing systems
do not perform ranking a query pattern according to context. Some of the architectures are
mentioned in the following sentences. Giorgio Orsi et al., has proposed a SAFE architecture
that receives input of sequence of keywords and produces, as output, a ranking over a set of
query patterns, possibly with a suggested assignment for their parameters [19]. They also
propose The Context Model is an instantiation of the context vocabulary and defines the
context model for the given application. In particular, the context-model specifies the
(possibly hierarchical) context dimensions for the specific application, along with their
possible values. A K Sharma et al., proposes Query Semantic Search System (QUESEM,

                                                        185
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

/’Qu-sem/) to improve the search quality. QUESEM maintains a database of definitions
(referred to as Definition Repository), as the core of the system to accomplish its desired
task [26]. Haizhou Fu et al., proposes CoSisystem architecture consists of three core
components: an indexer, a context-sensitive cost model and a query interpreter [23].
Christian Sengstock and Michael Gertz proposes architecture of the CONQUER system is
composed of a model generation component, a model index, and a suggestion service
[37]. Reiner Kraft et al., propose the overall Y! Q system design and architecture. The Y!
Q back-end comprises three major system components for processing contextual search
queries: Content Analysis (CA), Query Planning and Rewriting Framework (QPW), and
Contextual Ranking (CR) [29]. Liang Jeff Chen et al., proposes Query Model and
Ranking Model. In Query model a document, denoted by d, is modeled as a tuple of
fields, each consisting of a bag of words [33].

       The various architectures mentioned above suggest to improve the retrieval
process by enhancing the context of query. A comparison of these architectures is
presented in Table 2.

                                               Models /
 Architecture        Authors      Inputs                                 Inference
                                               Methods

                                                              65% queries were found on top
      SAFE            Giorgio    Keyword     The Context      of theranked list25% of cases,
architecture [19].     Orsi       Search,      Model           users found the query in the
                                                                      second position

                                                indexer
                                                                 CoSi will learn what user is
                                               a context-      asking for & rank the intended
   CoSisystem        Haizhou     keyword     sensitive cost   interpretationhigher such that the
architecture [23].     Fu         queries        model          end users can _nd them more
                                                 query                      easily.
                                              interpreter
                                                 Model
                                               Generator        space-complexity of O(1) per
Architecture Of                   patterns
                     Christian                                   node in the FP-tree & O(1)
the CONQUER                      and their   Model Index
                     Sengstock                                runtime-complexity overhead for
  System[ 37].                   synopses     Suggestion         each node update opertion.
                                                Service
                                                  CA           Y!Q is superior to Yahoo! WS
  Y!Q System                                  component        32.3% of the context and query
                      Reiner
  Design And                                    QPW’s         pairs, while Yahoo! WS is better
                      Kraft
Architecture[29].                                             only 8.3% of them (with 59.4%
                                                   CR                       tied.)

     Table 2: Comparison of various Architectures proposed by different authors for IR




                                             186
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

4.   METHODOLOGIES PROPOSED BY DIFFERENT BY AUTHORS

        A K Sharma et al., proposes two algorithms, Local Site Search for Query and
Definition Generation & Annotation. As the response pages are retrieved from dictionary
based sites, it is assumed that they will contain the direct thesaurus and synonyms of the
query terms[26]. Lidong Bing et al., proposes scoring algorithm and Latent Topic
Analysis and Training Algorithm [32].
        Wenwei Xue et al., proposes algorithm for context attribute matching and context
schema matching [27]. Reiner Kraft et al.,proposed two algorithms for ranking and
filtering of documents. They are rank averaging and MC4 [29]. Liang Jeff Chen et al
proposes Data-Mining-based Selection and graph decomposition algorithm [33].
Huanhuan Cao et al., proposes algorithm for clustering queries. In their method, a cluster
C is a set of queries [36]. ZimingZhuang and Silviu Cucerzan proposes re-ranking
algorithm. Q-Rank is based on a straight-forward yet very effective rationale, that the
most frequently seen query extensions of a target query (terms extracted from queries that
contain the target query as an affix) and adjacent queries (queries that immediately
precede or follow a query in a user search session) provide important hints about users’
search intents [35]. Zhen Liao et al., proposes Query Stream Clustering with Iterative
Scanning (QSC-IS). Query Stream Clustering with Master-Slave Model (QSC-MS) and
query suggestion algorithm [1]. Mariam Daoud et al., proposed session based
personalized search algorithm which describes the general view of the overall process of
our session-based personalized search is set according to the algorithm [30]. Minmin
Chen et al., proposed adaptive self training algorithm [31]. Self training is a very
commonly used algorithm to wrap complex models for semi-supervised learning [30].

        The various algorithms used in IR range from query clustering, query ranking, to
query suggestion to query expansion. The query clustering usually clusters similar queries
that leads to a similar or same documents viewed by the user. In query ranking algorithm
the queries are ranked according to frequency with which users raise their queries. The
algorithms that use the concept of query expansion use some kind feedback or probability
technique to expand the query. A comparison of these methodologies has been presented
in Table 3.

5. APPLICATIONS OF IR

        The applications of IR are mainly classified into general applications and domain
specific applications. The general applications includes digital libraries, Search Engines
etc, Domain specific application includes Expert Search Finding, Genomic IR Geographic
IR etc.,

5.1General applications of IR

Digital libraries: A digital library is a library in which collections are stored in digital
formats (as opposed to print, microform, or other media) and accessible by computers.



                                            187
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

                                                   Parameters
     Author         Technique / Methodology                                 Outcome / Results performance
                                                   considered
                    Defination_Generator_An
 AK Sharma et                                       Keywords             From 0.6 lakhs to a 1.6 lakhs relevance
                           notator(D)
   al[26].                                                              results is achieved from 2.5 lakhs results
                     Local _Site_searching            Query
                       Scoring algorithm              Query                   The differences between the
Lindongbing et                                                           performances of our method and CTA
                    Latent topic analysis and
    al[32].                                          Ranking             are significant with significance level
                       training algorithm
                                                                                          0.05.
                        Context attribute        a pair of context
                           matching                  attributes
                                                 schema matcher           CAMSUBSYN achieved as high as
 Wenweixue et
                                                integrates a local      100% precision and 64% recall upon our
   al[27].
                    Context schema matching      schema into the                        dataset
                                                   current set of
                                                 global schemas
                                                assigning a score
                    Rank averaging algorithm    to every position
                                                   in a rank list,        95 % confidence interval is [2.873,
                                                the input is k          2.972]), compared to an average of 2.54
 Reneirkraft et
                                                ranked          lists      ([2.45, 2.66]) based on ComScore
    al[29].
                                                which                     (which includes MSN, Google, and
                        MC4 algorithm
                                                are the top few                          Yahoo)
                                                results of k sub
                                                queries.
                                                For two keyword
                       Data-mining based
                                                   combinations         The average number of MeSH terms in a
Liang Jeffchen et      selection algorithm
                                                      P1; P2,            citation after the inheritance is 44better
     al[33].
                      Graph decomposition            keyword                  ranking in 21 out of 30 queries
                           algorithm               combinations
                                                                        The average overall precision of CRF-B,
Huanhuancao et      Algorithm for clustering        Diameter             CRF-B-C and CRF-B-C-T is improved
   al[36].                  queries              parameter Dmax           across different K by 50%, 52% and
                                                                                   57%, respectively.
                                                                           Interpolation parameter (γ). When
Zimingzhuang et                                                         varying γ, on average, Q-Rank improved
                      Re-ranking algorithm      adjacent queries
    al[35].                                                             the rankings for 75.8% of the re-ranked
                                                                                        queries.
                    Query stream clustering        The M1-th
                    with iterative scanning          query.
 Zhen liao et al    Query stream clustering                             Total response time is still small, that is,
                                                  x modM= ω.
     [1].           with master-slave model                                      about 0.3 millisecond.
                                                    preceding
                       Query suggestion
                                                     queries
                                                                         The setting (r =0,3) produces the best
    Mariam            Session personalized                               improvement in personalized search
                                                      Query
   daoud[30].           search algorithm                                  since it produces higher precision
                                                                           improvement at P@5 (11,63%).
                     Adaptive self training
Minminchen[31]                                      Unlabeled           51.38% precision with only 10% of the
                    with conditional random
      .                                              queries                   training data labeled.
                             fields

                     Table 3: Comparison of different methodologies for IR

                                                   188
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

The digital content may be stored locally, or accessed remotely via computer networks. A
digital library is a type of IR system.

Search engines :
   - Desktop search: is the name for the field of search tools which search the contents of
      a user's own computer files, rather than searching the Internet. These tools are
      designed to find information on the user's PC, including web browser histories, e-mail
      archives, text documents, sound files, images and video.
   - Enterprise search : Enterprise search is the practice of making content from multiple
      enterprise-type sources, such as databases and intranets, searchable to a defined
      audience.
   - Federated search : Federated search is an IR technology that allows the simultaneous
      search of multiple searchable resources. A user makes a single query request which is
      distributed to the search engines participating in the federation. The federated search
      then aggregates the results that are received from the search engines for presentation
      to the user.
   - Mobile search : Mobile search is an evolving branch of IR services that is centered
      on the convergence of mobile platforms and mobile phones, or that it can be used to
      tell information about something and other mobile devices. Web search engine ability
      in a mobile form allows users to find mobile content on websites which are available
      to mobile devices on mobile networks
   - Social search : Social search or a social search engine is a type of web search that
      takes into account the Social Graph of the person initiating the search query. When
      applied to web search this Social-Graph approach to relevance is in contrast to
      established algorithmic or machine-based approaches where relevance is determined
      by analyzing the text of each document or the link structure of the documents.

Web search : It is designed to search for information on the World Wide Web. The search
results are generally presented in a line of results often referred to as Search Engine Results
Pages (SERPs). The information may be a specialist in web pages, images, information and
other types of files. Some search engines also mine data available in databases or open
directories.

5.2 Domain Specific applications of IR

        In domain specific IR the information is based on a particular domain and
classification based on the specific domain. The domain may be legal system, geographic
system etc…
Expert search finding: Expert search is a task of growing importance in Enterprise settings.
An expert search system predicts and ranks the expertise of a set of candidate persons with
respect to the user’s query.
Genomic IR: The in-silico revolution has changed how biologists characterise DNA and
protein sequences. As a first step to exploring the structure and function of an unknown
sequence, biologists search large genomic databases for similar sequences. This process of
Genomic IR has allowed significant advances in biology and led to advancements in critical
areas such as cancer research.



                                             189
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

Geographic IR : Geographic IR (GIR) is the augmentation of IR with geographic metadata.
GIR involves extracting and resolving the meaning of locations in unstructured text. This is
known as Geo-parsing. After identifying location references in text, a GIR system must index
this information for search and retrieval
Legal IR : Legal IR is the science of IR applied to legal text, including legislation, case law,
and scholarly works. Accurate legal IR is important to provide access to the law to laymen
and legal professionals
Vertical search : A vertical search engine, as distinct from a general web search engine,
focuses on a specific segment of online content. The vertical content area may be based on
topicality, media type, or genre of content. Common verticals include shopping, the
automotive industry, legal information, medical information, and travel.

5.3 Other Applications of IR

        IR has been applied in other fields also such as Adversarial IR , Automatic
summarization, Question Answering etc.,
Adversarial IR : Adversarial IR is a topic in IR related to strategies for working with a data
source where some portion of it has been manipulated maliciously. Tasks can include
gathering, indexing, and filtering, retrieving and ranking information from such a data source.
Adversarial IR includes the study of methods to detect, isolate, and defeat such manipulation
Automatic summarization : Automatic summarization is the creation of a shortened version
of a text by a computer program. The phenomenon of information overload has meant that
access to coherent and correctly-developed summaries is vital. As access to data has
increased so has interest in automatic summarization. An example of the use of
summarization technology is employed in Google search engine.
Multi-document summarization : Multi-document summarization is an automatic
procedure aimed at extraction of information from multiple texts written about the same topic
    - Compound term processing : Compound term processing is the name that is used
        for a category of techniques in IR applications that performs matching on the basis of
        compound terms. Compound terms are built by combining two (or more) simple
        terms, for example "triple" is a single word term but "triple heart bypass" is a
        compound term.
Cross-lingual retrieval : Cross-Language IR (CLIR) is a subfield of IR dealing with
retrieving information written in a language different from the language of the user's query.
    - Document classification : The task of document classification is to assign a
        document to one or more classes or categories. This may be done "manually" (or
        "intellectually") or algorithmically. The intellectual classification of documents has
        mostly been the province of library science, while the algorithmic classification of
        documents is used mainly in information science and computer science
Spam filtering : is a statistical technique of e-mail filtering. It makes use of a naive Bayes
classifier to identify spam e-mail.
Question answering : Question Answering (QA) is a computer science discipline within the
fields of IR and Natural Language Processing (NLP) which is concerned with building
systems that automatically answer questions posed by humans in a natural language. A QA
implementation, usually a computer program, may construct its answers by querying a
structured database of knowledge or information, usually a knowledge base.



                                              190
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

6. OPEN ISSUES/CHALLENGES

         Although the discussed models implement efficiently the stated objectives, but
still they lack in efficient retrieval process when context is to be considered. When user
submits a query for the first time, the search engine is unable to find a context of the
query. However, if some events of web pages can be captured, this problem can be
resolved. Some of the open challenges in this area are

       Reducing the volume of the documents for effective retrieval. i.e., to improve the
       quality of documents to be considered for retrieval through filtering of irrelevant
       and redundant documents
       Ranking of structured and unstructured documents for better accuracy in retrieval
       Context awareness in both modeling and scaling up of query suggestion
       Visualization and presentation of search results with in-depth summarized
       analysis.

To address the above challenges, we propose a novel retrieval technique which is query
based on the context along with concept which enhances retrieval operation through
exploitation of unstructured documents that can increase the focused retrieval of
documents especially from web by capturing recent browsing sessions of the user.

The snippets used in modern Web search are query based and are proven to be better than
static document summaries. For instance, we can examine for the word clouds, in respect
of the following:

Depth on the query side: to add depth on the user side is a bottleneck for delivering
more accurate retrieval results. Users provide only 2 to 3 keywords on average to search
in the complete Web.
Depth in the document representation: Documents on the Web are rich in structure.
Most of the structural elements however are not used consistently throughout the Web. A
key question is how to compact with semi structured information.
Depth on the result side: While a query can have thousands of relevant results, only the
first 10 or 20 results will get any attention in a Web search interface. Often these first n
results will still contain redundant information.

       Our main objective is to exploit query context and document structure to address
following challenges

       Ambiguity in query from the user
       Appropriate feedback from the user search logs
       Effective use and exploitation of structured and unstructured documents for better
       query formulation and search results.




                                            191
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

7. CONCLUSION
        In this paper, we have discussed and analyzed various models, algorithms and
architectures against their performance that have been used by various researchers in IR. The
various models discussed are used for query ranking and query expansion with the help of
various feedback techniques that adds context to the query. The various architectures
discussed are either completely new architectures or some variations in the existing
architecture models to improve the retrieval process by enhancing the context of query. The
various algorithms used in IR range from query clustering, query ranking, to query
suggestion and query expansion. The query clustering usually clusters a similar query that
leads to a similar set of documents viewed by the user. In query ranking algorithm, the
queries are ranked according to frequency with which the users submit their queries. The
algorithms that use the concept of query expansion use some kind feedback or probability
technique to expand the query. Although the discussed models implement efficiently the
stated objectives, but still they lack in efficient retrieval process when context is to be
considered. Hence exploitation of structured and unstructured documents which can increase
the focused retrieval of documents from web has become a challenging one.
REFERENCES
[1]   Zhen Liao, Nankai University, Daxin Jiang, Microsoft Research Asia, Enhong Chen,
      University of Science and Technology of China, Jian Pei, Simon Fraser University,
      HUANHUAN CAO, University of Science and Technology of China, Hang Li, Microsoft
      Research Asia “Mining Concept Sequences from Large-Scale Search Logs for Context-Aware
      Query Suggestion “ACM Transactions, October 2011.
[2]   Mario Cataldi Università di Torino, Claudio Schifanella Università di Torino K. SelçukCandan
      Arizona State University, Maria Luisa SapinoUniversità di Torino Luigi Di Caro Università di
      Torino “CoSeNa: a Context-based Search and Navigation System” 2009 October ACM.
[3]   Michal Kajaba and PavolNavrat, “Personalized Web Search Using Context Enhanced
      Query”.International Conference on Computer Systems and Technologies - CompSysTech’09
[4]   Chang Liu and Nicholas J. Belkin “Implicit Acquisition of Context for Personalization
      ofInformation Retrieval Systems”CaRR 2011, February 13, 2011, Stanford, CA, USA.
[5]   Ziv Bar-Yossef Google Inc. MATAM, Bldg 30 Israel and Naama Kraus Computer Science
      Department Technion, Israel “Context-Sensitive Query Auto-Completion”CIKM’10, October
      26–30, 2010, Toronto, Ontario, Canada. Copyright 2010 ACM.
[6]   RianneKaptein University of Amsterdam, “Effective Focused Retrieval by Exploiting Query
      Context and Document Structure” ACM October 6, 2011.
[7]   Zheng Ye1;2, Xiangji Huang2 and Hongfei Lin1 1Department of Computer Science and
      Engineering, Dalian University of Technology Dalian China 2 School of Information
      Technology York University, Toronto, Ontario, M3J 1P3, Canada “A Bayesian Network
      Approach to Context Sensitive Query Expansion” SAC’11 March 21-25, 2011, TaiChung,
      Taiwan. Copyright 2011 ACM.
[8]   Minmin Chen1,Jian-Tao Sun2, Xiaochuan Ni2, Yixin Chen1 1Department of Computer Science
      and Engineering Washington University in Saint Louis, Saint Louis, MO, USA 2Microsoft
      Research Asia, Beijing, P.R. China “ Improving Context-Aware Query Classification
      viaAdaptive Self-training” October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011
      ACM.
[9]   KekeCai, Chun Chen*, Jiajun Bu, Peng Huang, Zhiming Kang College of Computer Science,
      University Hangzhou,China “Exploration of Query Context for Information Retrieval” May 8–
      12, 2007, Banff, Alberta, Canada. ACM.


                                               192
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

[10] Lev Finkelstein, EvgeniyGabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, GadiWolfman,
     And EytanRuppin Zapper Technologies, Inc. “Placing Search in Context: The Concept
     Revisited” ACM Transactions on Information Systems, Vol. 20, No. 1, January 2002.
[11] Raymond Y.K. Lau, Centre for Information Technology Innovation, Queensland University of
     Technology and Peter D. Bruza and Dawei Song, Distributed Systems Technology Centre, The
     University of Queensland, Australia “Belief Revision for Adaptive Information Retrieval” July
     25–29, 2004, Sheffield, South Yorkshire, UK. Copyright 2004 ACM.
[12] Jiang Bian,College of Computing, Georgia Institute of Technology, Tie-Yan Liu, Tao Qin
     Microsoft Research Asia,HongyuanZha,College of Computing, Georgia Institute of Technology
     “ Ranking with Query-Dependent Loss for Web Search” February 4–6, 2010, New York City,
     New York, USA. Copyright 2010 ACM.
[13] Yunping Huang, Le Sun Institute of Software, Chinese Academy of Sciences, Beijing, China
     and Jian-Yun Nie ,Department of Computer Science and Operations Research, University of
     Montreal, Canada “Query Model Refinement Using Word Graphs” October 26–30, 2010,
     Toronto, Ontario, Canada. Copyright 2010 ACM.
[14] Jing Bai 1, Jian-Yun Nie 1,Hugues Bouchard 2, and Guihong Cao 1 1 Department IRO,
     University of Montreal Canada 2 Yahoo! Inc. Montreal, Quebec, Canada “Using Query
     Contexts in Information Retrieval” July 23–27, 2007, Ámsterdam, The Netherlands. Copyright
     2007 ACM.
[15] Yan Qi Arizona State University Tempe, USA, K. SelçukCandan, Arizona State University,
     Tempe, AZ 85287, USA and Maria Luisa Sapino ,Universita’ di Torino,Italy”FICSR:
     Feedback-based InConSistencyResolution and Query Processing on Misaligned Data Sources”
     June 12–14, 2007, Beijing, China. Copyright 2007 ACM.
[16] Tangjian Deng, Liang Zhao, Ling Feng Tsinghua ,National Laboratory for Information Science
     and Technology Tsinghua University, Beijing, China and WenweiXue Nokia Research Center,
     Beijing, China “Information Re-finding by Context: A Brain MemoryInspired Approach”
     October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM.
[17] Xing Wei, FuchunPeng, Huihsin Tseng Yumao Lu, Benoit Dumoulin Yahoo! Labs, California,
     USA, “Context Sensitive Synonym Discovery for Web SearchQueries” November 2–6, 2009,
     Hong Kong, ChinaCopyright 2009 ACM.
[18] Ivan T. Bowman, School of Computer Science, University of Waterloo And Kenneth Salem
     School of Computer Science ,University of Waterloo “ Optimization of Query Streams Using
     SemanticPrefetching” June 1318 2004, Paris, France, Copyright 2004 ACM.
[19] Giorgio Orsi, Politecnico di Milano,Italy,LetiziaTanca,Politecnico di Milano, Italy, Eugenio
     Zimeo,Universitá del Sannio,Italy“Keyword-based, Context-aware Selection of Natural
     Language Query Patterns” March 22–24, 2011, Uppsala, Sweden., Copyright 2011 ACM.
[20] Huanhuan Cao1¤,Daxin Jiang2 Jian Pei3 Enhong Chen1 Hang Li2 ,1University of Science and
     Technology of China 2Microsoft Research Asia 3Simon Fraser University “Towards Context-
     Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search
     Logs” April 20–24, 2009, Madrid, Spain. ACM.
[21] Carla Teixeira Lopes, Departamento de EngenhariaInformáticaFaculdade de Engenharia,
     Universidade do Porto, Rua Dr. Roberto Frias , Portugal, Cristina Ribeiro, Departamento de
     EngenhariaInformáticaFaculdade de Engenharia, Universidade do “Context Effect on Query
     Formulation and Subjective Relevance in Health Searches” August 18–21, 2010, New
     Brunswick, New Jersey, USA. Copyright 2010 ACM.
[22] Xiaohui Yan, JiafengGuo, Xueqi Cheng, Institute of Computing Technology, CASBeijing,
     China “Context-Aware Query Recommendation by Learning High-Order Relation in Query
     Logs” October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM.
[23] HaizhouFu,North Carolina State, University, Raleigh, NC, SidanGao,North Carolina State
     University, Raleigh, NC,KemaforAnyanwu,North Carolina State, University, Raleigh, NC
     “CoSi: Context-Sensitive Keyword Query Interpretation on RDF Databases” 2011, March 28–
     April 1, 2011, Hyderabad, India. ACM.

                                               193
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME

[24] Ying-Hsang Liu Nicholas J. Belkin, Rutgers University, USA “Query Reformulation, Search
     Performance, and Term Suggestion Devices in Question-Answering Tasks” Information
     Interaction in Context, 2008, London, UK Copyright 2008 ACM.
[25] Protima Banerjee, College of Information Science and Technology, Drexel University
     Philadelphia, and Hyoil Han ,College of Information Science and Technology, Drexel
     University Philadelphia, USA “Incorporation of Corpus-Specific Semantic Information into
     Question Answering Context” October 30, 2008, Napa Valley, California, USA. Copyright
     2008 ACM.
[26] A. K. Sharma Computer Engg. Department YMCA Univ. of Sc. & Technology Faridabad,
     India, NeelamDuhan Computer Engg. Department YMCA Univ. of Sc. & Technology
     Faridabad, India and Bharti Sharma Computer Engg. Department MVN Instt. ofEngg&
     Technology Palwal, India“A Semantic Search System using Query Definitions” December 28-
     30, 2010, Allahabad, UP, India. Copyright 2010 ACM.
[27] WenweiXue, HungkengPung, Paulito P. PalmesSchool of Computing, National University of
     Singapore , Singapore 117543 and Tao GuInstitute for Infocomm Research ,Terrace, Singapore
     “Schema Matching for Context-Aware Computing” September 21-24, 2008, Seoul, Korea.
     Copyright 2008 ACM.
[28] Huanhuan Cao1 Derek Hao Hu2 Dou Shen3 Daxin Jiang4 ,Jian-Tao Sun4 ,Enhong Chen and
     Qiang Yang2 ,1University of Science and Technology of China 2Hong Kong University of
     Science and Technology 3Microsoft Corporation 4Microsoft Research Asia “Context-Aware
     Query Classification” July 19–23, 2009, Boston, Massachusetts, USA. Copyright 2009 ACM.
[29] Reiner Kraft, Chi Chao Chang, FarzinMaghoul, Ravi Kumar Yahoo!, Inc. Sunnyvale, USA
     “Searching with Context”.
[30] Mariam Daoud,LyndaTamine-Lechani and MohandBoughanem Institute de Recherche
     enInformatique de Toulouse, France“Learning user interests for a session-based personalized
     search” Information Interaction in Context, 2008, London, UK. Copyright 2008 ACM.

[31] Ji-Rong Wen, Microsoft Research Asia Beijing, China,Ni Lao, Tsinghua University Beijing,
     China and Wei-Ying Ma Microsoft Research Asia Beijing, China “Probabilistic Model for
     Contextual Retrieval” July 25-29, 2004, Sheffield, South Yorkshire, UK. Copyright 2004 ACM.
[32] Lidong Bing Wai Lam ,Department of Systems Engineering and Engineering Management, The
     Chinese University of Hong Kong Shatin, Hong Kong and Tak-Lam Wong Department of
     Mathematics and Information Technology The Hong Kong Institute of Education “Using Query
     Log and Social Tagging to Refine Queries Based on Latent Topics” October 24–28, 2011,
     Glasgow, Scotland, UK. Copyright 2011 ACM.
[33] Liang Jeff Chen, UC San Diego La Jolla, CA, US and YannisPapakonstantinou UC San Diego
     “Context-sensitive Ranking for Document Retrieval” June12–16, 2011, Athens, Greece.
     Copyright 2011 ACM.
[34] Reiner Kraft, FarzinMaghoul and Chi Chao ChangYahoo!, Inc.701 First AvenueSunnyvale, CA
     94089“Y!Q: Contextual Search at the Point of Inspiration” October 31–November 5, 2005,
     Bremen, Germany. Copyright 2005 ACM.
[35] ZimingZhuang, The Pennsylvania State University, University Park, USA and SilviuCucerzan
     Microsoft Research Redmond, USA “Re-Ranking Search Results Using Query Logs”
     November 5–11, 2006, Arlington, Virginia, USA. ACM.
[36] Huanhuan Cao1 Daxin Jiang2 Jian Pei3 Qi He4, Zhen Liao5, Enhong Chen1 ,Hang Li2
     ,1University of Science and Technology of China ,2Microsoft Research Asia, 3Simon Fraser
     University,4Nanyang Technological University ,5Nankai University“Context-Aware Query
     Suggestion by Mining Click-Through and Session Data” August 24–27, 2008, Las Vegas,
     Nevada, USA. Copyright 2008 ACM.
[37] Christian Sengstock and Michael Gertz Institute of Computer Science, University of
     Heidelberg, Germany“CONQUER: A System for Efficient Context-awareQuery Suggestions”
     2011, March 28–April 1, 2011, Hyderabad, India, ACM.

                                              194

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:2/13/2013
language:
pages:13