; A Study about Developing Trends in Information Retrieval from Google to Swoogle
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

A Study about Developing Trends in Information Retrieval from Google to Swoogle

VIEWS: 105 PAGES: 6

  • pg 1
									                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 8, August 2011

     A Study About Developing Trends In Information
           Retrieval From Google To Swoogle

                 S.Kalarani Ph.D scolar                                                            Dr.G.V.Uma
          Department of Information Technology                                                Prof/ Department of IST
           St Joseph’s Institute of Technology                                                    Anna University
                     Chennai - 119                                                                  Chennai - 26
                             .                                                                           .


Abstract— Information retrieval technology has been central to              Web, where the user will not have control over the data in
the success of the Web. Web 2.0 is defined as the innovative use of         modifying and manipulating it. The Web 2.0 is Read and
the World Wide Web to expand social and business growth and                 Write Web. Google search engine can be cited as an example
to explore collective intelligence from the community. The                  of Web2.0.The google.com site is the top most-visited website
features of Web 2.0 include user behavior and software design               in the world. Some of its features include a definition link for
perspectives. A high level technical architecture is included in            most searches including dictionary words, the number of
Web 2.0 features. Google Web Search is a web search                         results that one has got on a search, links to other searches (e.g.
engine owned by Google Inc. and is the most-used search engine              for words that Google believes to be misspelled, the links are
on the Web. Google receives several hundred million queries each            provided with the search results using the proposed spelling.)
day through its various services. The main purpose of Google
                                                                            The following sections elaborate on the information retrieval in
Search is to hunt for text in WebPages. Also the functionality of
                                                                            Google and the future Web3.0.
Google search engine in retrieving the information is based on
the 3 principles. Keyword Search, where the search engine                       The Google search engine mainly concentrates upon the
examines its index and provides a listing of best-matching web              following:
pages according to its criteria, usually with a short summary                    a) Keyword Search and indexing
containing the document's title and sometimes parts of the text.
                                                                                 b) page ranking
The index is built from the information stored with the data and
the method by which the information is indexed. Page Rank is a                   c) web crawlers
link analysis algorithm, used by the Google Internet search
engine that      assigns a numerical weighting to each element of               Section 1 explains about the keyword search and indexing
a hyperlinked set of documents. In simple words this means that             in Google. Section 2 deals about ranking of pages in Google
your results will be ordered by the relative importance of your             and its advantage over other search engines. The next section
search terms in the document. Google also uses Indexing. It has             illustrates the architecture of Web crawling and its
an index of all pages it's crawled based on the terms in each page.         effectiveness in information retrieval through precision and
Inverted index technique is now replaced and stores the                     recall. Also it deals about the inverted indexing. The final
information as stems. Another principle is finally, we point out            section deals with the Web3.0 which is mainly concerned with
the limitations of the current technologies in order to analyze the         ontology and concepts relating to the intelligent information
new technology development in the Web 3.0 model. The core of                retrieval system model based on domain ontology are
the Semantic Web is “ontology”. Also the requirements of                    highlighted. Also an model based on relation based search
ontology in the context of the Web are outlined. Advantages of              engine is proposed.
using ontology in both knowledge-base-style and database-style
applications are demonstrated using one real world application.
 (Abstract)

Keywords-component; Keyword search, knowledge base, indexing,
inverted index, ontology, page ranks, stemming, Web 2.0, Web 3.0-
based on ontology.                                                                   II.    TRADITIONAL METHODOLOGY OF
                                                                                               INFORMATION RETRIEVAL

                       I.    INTRODUCTION                                   A. Keyword search and indexing
      The search engines have provided everyone easiest way                     Google uses the technique of keyword based search for
to link and fetch the information amidst having large collection            the queries. The keyword can be regarded as a phrase
of web pages .This has been made feasible through web
                                                                            specified in the web page by the author. For example when
information retrieval. The Web1.0 was merely a Read only



                                                                      135                               http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 8, August 2011
 you go to Google and type "Google" into the search box,                      query into separate keywords using a plus (+) symbol
 Google checks its index of all terms available on the Internet               depending on the space, and tries to search the query form its
 and finds the entry for the term "Google" and with it the list of            database.
 all pages that have that term referenced in it. It is mainly
 concerned with the subject and its content. Such words are
 gathered by Web Crawlers and organized in a table with a                     B. Page rank
 pointer to a database consisting of URL’s.                                       Page ranking is a central part of existing information
          The URL database consists of a set of “inlist” and                  retrieval [5]; this is achieved through polling (constant
“outlist” which will act as a reference to other web pages                    monitoring). Google uses the concept of page rank to rank the
containing the same information. All these references are                     pages based on the idea that information on the web could be
identified by an index value.                                                 ordered in a hierarchy by "link popularity”, a page is ranked
          Apart from indexed keywords, there is a considerable                higher as there are more links to it. Page Rank is calculated
amount of data available in online databases which are                        with the help of probability distribution. And it’s done for
accessible by means of queries but not by links. This is called               collections of documents of different sizes.
as invisible or deep Web [1] .The deep Web contains library
catalogs, official legislative documents of governments and                             The Page Rank computations involve a series of
other content which is dynamically prepared to respond to a                   iterations. Through these iterations the page is ranked. The
query.                                                                        page which is viewed the most will be ranked higher
                                                                              compared to other pages. Apart from page ranks Google offers
                                                                              the users with a way to customize the search engine, by setting
                                                                              a default language, using the Safe Search filtering technology
                                                                              and set the number of results that can be shown on each page
                                                                              [3]. Google has enabled the customization by placing long-
                                                                              term cookies on user’s machines to store these preferences, a
                                                                              tactic which also enables them to track a user's search terms
                                                                              and retain the data for more than a year.




   Figure 1.URL database with pointers to related information

          In the case of keyword based search, Google guides
its users with a set of related keywords to make it easier for
the users to work. Consider an example for searching “founder
of facebook”, while using Google users will be mentored with
other related keywords from its database. But in the case of
other search engine like yahoo, this type of assistance will not
be provided.



                                                                                       Figure 3. Example of page ranking

                                                                                 A recent enhancement in Google is its instant search, which
                                                                              reduces the search time of users by 2 to 5 seconds in every
                                                                              search, thereby reducing it collectively by 11 million seconds
                                                                              per hour approximately. For any query, the first 1000 results
                                                                              can be shown with a maximum of 100 displayed per page.
                                                                              "Instant Search" is not enabled. If "Instant Search" is enabled,
                                                                              only 10 results are displayed, regardless of this setting.



Figure 2. Comparison based on keyword search Between Google and Yahoo
        When the users provide more than a single keyword
like a query for searching, Google automatically splits the




                                                                        136                                http://sites.google.com/site/ijcsis/
                                                                                                           ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                             Vol. 9, No. 8, August 2011
                                                                             •   Precision-- The ability to retrieve top-ranked
                                                                                 documents that is mostly relevant.
                                                                             •   Recall--The ability of the search to find all of the
                                                                                 relevant items in the corpus.

                                                                        Problems with both precision and recall are:
                                                                                 Number of irrelevant documents in the collection is
                                                                        not taken into account.
                                                                             • Precision is undefined when no document is retrieved.
                                                                             • Recall is undefined when there is no relevant
                                                                                 document in the collection.
Figure 4. Page ranking architecture
                                                                        D.       Inverted index
   The exact percentage of the total of web pages that Google
indexes are not known, as it is very difficult to accurately
                                                                             An Inverted Index data structure is used to support most
calculate. Google not only indexes and caches web pages, but
also takes "snapshots" of other file types, which include PDF,          IR techniques. Each word in the table has an index entry,
Word documents, Excel spreadsheets, Flash SWF, plain text               which specifies the documents that include that word. A
files, and so on. Except in the case of text and SWF files, the         search for documents that contain a word then becomes a
cached version is a conversion to XHTML, allowing those                 simple lookup in the index. The list of words may be reduced
without the corresponding viewer application to read the file.          by using a ‘stop word list’ or Stemming [5]. The key terms of
                                                                        a query or document may be represented by stems rather than
                                                                        by the original words. Thus "computation" might be stemmed
 C.     WEB CRAWLER ARCHITECTURE
                                                                        to    "comput"     –    provided    that    different   words
           Web search engines work by storing information               with the same 'base meaning' are reduced to the same form and
      about many web pages, which they retrieve from the html           words with distinct meanings are kept separate. Stemming
      itself. These pages are retrieved by a Web crawler                provides several advantages:
      (sometimes also known as a spider) -- an automated Web                 •   a query can find a document with different
      browser which follows every link on the site.[2] The                       morphological variants of the search term (improved
      contents of each page are then analyzed to determine how                   recall); and
      it should be indexed (for example, words are extracted
      from the titles, headings, or special fields called Meta               •   Reduction in the number of distinct terms needed to
      tags.) Data about web pages are stored in an index                         represent the corpus reduces computer processing
      database for use in later queries.
                                                                                 requirements.

                                                                           The inverted index is largely responsible for the impressive
                                                                        speed and scalability of current search engines. However, for
                                                                        large collections of documents the compilation of an inverted
                                                                        index can be a time consuming task, and the fast searching of
                                                                        an inverted index may require large amounts of computer
                                                                        memory. A large amount of research has been conducted on
                                                                        the problem of compressing inverted indexes to reduce these
                                                                        memory requirements.

                                                                        E.       Drawbacks
                                                                             The major draw backs observed in the traditional
                                                                        information retrieval systems are
                                                                             • Traditional information retrieval technology mainly
                                                                                 makes use of the approaches, such as, category,
      Figure 5. Web crawler architecture                                         index, and keyword and so on. However, this method
                                                                                 can’t reflect the deep meaning of the word. And
Also the Effectiveness is related to the relevancy of retrieved                  hence the retrieval result is very large which fails to
items. They include subjective, situational, cognitive, and                      meet the user’s requirement.
dynamic. The effectiveness is measured through two
parameters:                                                                  • Coverage of retrieval is limited. As one word could
                                                                                 have more than one meaning and several words can




                                                                  137                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 9, No. 8, August 2011
         have same meaning, the retrieval’s coverage will be                       b) Knowledge acquisition: When a system based on
         negatively affected if user cannot provide all possible         knowledge is constructed, speed and reliability can be
         words to be searched. Especially for specialized                improved if existing ontology can be used as starting point and
         article retrieval, as there are many several different          basis to guide the knowledge acquisition.
         definitions for some specialized terms and                                c) Reliability: As the description of ontology is
         knowledge, it is difficult for researchers to quickly           formalized, this formalized statement makes automatic
         and correctly find all articles related to certain              consistence check become possible. So, software system’s
         scientific research topic among huge technical                  reliability will be improved.
         articles.                                                                 d) Specification: Ontology analysis is helpful to
                                                                         determine system’s (such as knowledge database) requirement
One way to solve this problem is to improve information                  and criterion.
retrieval from traditional key-work based approach to                        Ontology determines exact meaning of concepts by strictly
knowledge or concept based approach. That’s, to combine                  defining them and understanding the relationship among them,
information retrieval with artificial intelligence technology            in order to express commonly recognizable and sharable
and natural language technology.                                         knowledge [8]. So, it will be much more convenient for
                                                                         computer to process information in one domain, if this
           III INTRODUCTION TO ONTOLOGY                                  domain’s ontology can be built by abstracting or summarizing
                                                                         one group of concept and relating them with these concepts. [5-
                                                                         10]
A.   The Concept of Ontology
                                                                                 IV. INTELLIGENT INFORMATION RETRIEVAL
         Ontology comes from Philosophy domain, aiming at                      SYSTEM MODEL BASED ON DOMAIN ONTOLOGY
studying nature and composition of objective things. Some
scholars think that Ontology is one certain category system on           A. Ontology knowledge Module
certain domain of the world, which doesn’t depend on any
certain languages. In knowledge engineering domain,                                The first step for building an intelligent information
Ontology is built based on the knowledge concepts, terms and             retrieval system based on ontology is to develop the related
their mutual relationship about the system.                              domain ontology with help of specialist in that domain, and
                                                                         then, gather the required data source and store them onto a
   Ontology generally falls in four categories                           database (relationship database and knowledge database)
         a) Top level ontology
         b) Domain ontology
         c) Task ontology
d) Application ontology.                                                 B. Units Retrieval module
         Among them, domain ontology means the concepts
and their mutual relationship with ordinary domain, such as                        As showed in Figure6, with inputted retrieving
medical, automobile, and so on. [7]\The target of constructing           condition from user, retrieving transformer will use words and
ontology is to normalize concept and terms in one or other               words with similar or same meaning, from retrieving user
domain, and provide convenience to practical application in              interface, to build a retrieving set.
that domain or among several domains.

B.   The Function of Ontology

          The function of ontology can be summarized as
Communication, interoperability and system engineering.
1) Communication: It means providing common words for
communication between people or between organizations,
which forms the basis for communication.
2) Interoperability: Ontology builds translation and mapping
mechanism among different model construction methods,
equations, languages and software tools, in order to do
integration among different system.
3) System engineering: Ontology analysis can provide
following benefits to system engineering [4]:
          a) Reusability: ontology is formalized description
basis for important entities, properties, process and their              Figure 6. Ontology based retrieval module.
mutual relationship. This formalized description can become
reusable and shared component in software systems.




                                                                   138                                    http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 9, No. 8, August 2011
           For example, when user retrieves “computer”, the                        This model processes the keywords, not only are the
engine retrieves all the information related to computers like           keywords processed, but also the relationship between the
“personal computer”, “micro computer” from the ontological               entities offered by the architecture of semantic web. A page
system built for this domain. This enhanced approach will                will be returned to users only when it includes the relationship
make the search query more precise. What’s more, retrieving              between keywords, and those pages that are related only with
transformer is very helpful to improve retrieving hitting rate.          the keywords and without the proper relationship are
For example, when user inputs “apple”, does user mean apple              discarded.
fruit or the brand of                                                              The sample screenshots illustrate the Ontology based
Apple PC?                                                                retrieval of information. The user selects the domain of
                                                                         information retrieval. Here the train is chosen as the domain
        All above confused information can be clarified by               and related information is provided which are provided during
concept description in ontology knowledge database.                      annotation through sign in option. Then the list of related sites
                                                                         will be listed based on the usage (page ranking).The sites are
                     V. SYSTEM EVALUATION                                listed in the table and the user selects the site which is relevant
                                                                         to them. The screenshot 2 provided the detailed information
This model solved following retrieving problems with support             which is required.
of knowledge database.

A.   Retrieving hit rate is high

         Ontological model limits all possible interpretation of
one term into only reasonable one, by adding ontology
Knowledge database between user and database, which can
solve the problem of multiple meaning to one word. This
model can also be adapted with dynamic change of users and
information source, to only provide retrieval within the
domain to users.


B. Retrieval coverage rate is broad.

         In this model, system can infer a set of words with
same meaning or similar meaning from user inputted
retrieving word, as real retrieving word to retrieving system,
because concept description on equivalent relation, such as
words with same or similar meaning and word abbreviation,
will be added into ontology’s knowledge database. This
method will lower down user’s burden and improve retrieving
coverage range.


C. Retrieving inferring.
         Ontology is a conceptual explanation in a certain
domain. It makes the terms in this domain form a knowledge
system, which can express corresponding meaning logic and
can be used for inferring, which can efficiently and correctly
feedback user’s most important information.


V. IMPLEMENTATION OF RELATION BASED SEARCH
                  ENGINE




                                                                   139                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 8, August 2011
                       VII. CONCLUSION                                             [9]Dr. Muhammad Shahbaz, Dr. Syed Muhammad Ahsen, Farzeen Abbas,
                                                                                   Muhammad Shaheen Syed Athar Masood,"An efficient method to improve
                                                                                   Information Recovery on Web”, Journal of American Science, Vol.7, No.7,
          We have presented a framework about information                          2011.
retrieval system Model based on Ontology. Initially, the
information retrieval in Google is analyzed and drawbacks are                      [10]. Shivani Agarwal and Michael Collins,” Maximum Margin Ranking
                                                                                   Algorithms for
pointed out. Then, the concept of Ontology and its application                     Information Retrieval", Springer, 2010.
in intelligent retrieval domain are elaborated with information
retrieval system structure model based on domain Ontology.                         [11]. Jianguo Jiang, Zhongxu Wang, Chunyan Liu, Zhiwen Tan, Xiaoze Chen,
The idea of combining ontologies and knowledge is the key                          Min Li,"The Technology of Intelligent Information Retrieval Based on the
                                                                                   Semantic Web",IEEE 2nd International Conference on Signal Processing
tool for developing the proposed system. Final result is in the                    Systems,2010.
form of user friendly direct url. Hence the searching time is
considerably reduced. The efficient methods on how to utilize                      [12]A. Abusujhon, M.Tamib,"Improving Load Balance and Query
domain knowledge of Ontology to realize the query based on a                       Throughput of Distributed IR Systems”, International Journal of Computing
                                                                                   and ICT Research, Vol. 4, No. 1, June 2010.
concept, during the process of information retrieval are
improvised. Also its performance based upon this model’s                           [13]. Peter D. Turney and Patrick Pantel,"From Frequency to Meaning: Vector
performance is evaluated.                                                          Space Models of Semantics”, Journal of Artificial Intelligence Research, No.
                                                                                   37, P.p. 141-188, 2010.
                             REFERENCES
                                                                                   [14] Abdelkrim Bouramoul, Mohamed-Khireddine Kholladi and Bich-Lien
[1] En.wikipedia.org
                                                                                   Doan,” Using content to improve the evaluation of the information retrieval
                                                                                   system”, International Journal of Database Management Systems ( IJDMS ),
[2] www.semanticfocus.com
                                                                                   Vol.3, No.2, May 2011
[3] J.Farrugia,”Model-Theoretic semantics       for the web“, Budapest,
                                                                                   [15]. Guo Chengxia and Huang Dongmei,"Research on Domain Ontology
Hungary, May2003
                                                                                   Based Information Retrieval Model", International Symposium on Intelligent
                                                                                   Ubiquitous Computing and Education, 2009.
[4] Semantic Web Services Ontology www.daml.org

[5] Zaihisma Che Cob , Rusli Abdullah “Ontology-based Semantic Web                                            AUTHORS PROFILE
Services Framework for Knowledge Management System” 978-1-4244-2328-
6/08/$25.00 © 2008 IEEE                                                                 1.   S.Kalarani M.E,.(Ph.D)
[6] Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1998.The page rank                     Associate Professor
citation ranking: Bringing order to the web. Technical report, Stanford                      Department of Information Technology
Database group.                                                                              St Joseph’s institute of technology
                                                                                             Chennai – 119.
[7]Nilsson N.J. Artificial intelligence: A new synthesis. Beijing: China
Machine Press and Morgan Kaufmann Publishers, Inc, 1999 , pp.215–316                    2.   Dr.G.V.Uma
                                                                                             Professor / IST
[8]Dong Hui, Yang Ning, Yu Chuanming , “Research on the Ontology based                       Anna university
Retrieval Model of Digital Library (I) ”, Journal of the China Society for                   Chennai -26.
Scientific and Technical Information, 2006(3), pp. 269-275




                                                                             140                                   http://sites.google.com/site/ijcsis/
                                                                                                                   ISSN 1947-5500

								
To top