Accurate And Efficient Crawling The Deep Web: Surfacing Hidden Value by ijcsiseditor


									                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 5, May 2011
                                     Accurate and efficient crawling
                                  The deep web: Surfacing hidden value

                    SUNEET KUMAR                                                             ANUJ KUMAR YADAV
       Associate Proffessor ;Computer Science Dept.                                  Assistant Proffessor ;Computer Science Dept.
            Dehradun Institute of Technology,                                             Dehradun Institute of Technology,
                      Dehradun,India                                                                Dehradun,India

                   RAKESH BHARATI                                                              RANI CHOUDHARY
       Assistant Proffessor ;Computer Science Dept.                                      Sr. Lecturer;Computer Science Dept.
            Dehradun Institute of Technology,                                          Babu Bnarsi Das Institute of Technology,
                      Dehradun,India                                                               Ghaziabad,India

Abstract—Searching Focused web crawlers have recently                      Northern Light, or search directories such as Yahoo!,
emerged as an alternative to the well-established web search     , or Look Smart. Eighty-five percent of Web users
engines. While the well-known focused crawlers retrieve relevant           use search engines to find needed information, but nearly as
web-pages, there are various applications which target whole               high a percentage cite the inability to find desired information
websites instead of single web-pages. For example, companies are
represented by websites, not by individual web-pages. To answer
                                                                           as one of their biggest frustrations.[2] According to a recent
queries targeted at websites, web directories are an established           survey of search-engine satisfaction by market-researcher
solution. In this paper, we introduce a novel focused website              NPD, search failure rates have increased steadily since
crawler to employ the paradigm of focused crawling for the                 1997.[3] The importance of information gathering on the Web
search of relevant websites. The proposed crawler is based on              and the central and unquestioned role of search engines -- plus
two-level architecture and corresponding crawl strategies with an          the frustrations expressed by users about the adequacy of these
explicit concept of websites. The external crawler views the web           engines -- make them an obvious focus of investigation.
as a graph of linked websites, selects the websites to be examined
next and invokes internal crawlers. Each internal crawler views
the web-pages of a single given website and performs focused                        II. GENERAL DEEP WEB CHARACTERISTICS
(page) crawling within that website. Our experimental evaluation
demonstrates that the proposed focused website crawler clearly
                                                                           Deep Web content has some significant differences from
outperforms previous methods of focused crawling which were                surface Web content. Deep Web documents (13.7 KB mean
adapted to retrieve websites instead of single web-pages.                  size; 19.7 KB median size) are on average 27% smaller than
                                                                           surface Web documents. Though individual deep Web sites
   Keywords-Deep Web;Quality Documents; Surface Web;Topical                have tremendous diversity in their number of records, ranging
Database .                                                                 from tens or hundreds to hundreds of millions (a mean of 5.43
                                                                           million records per site but with a median of only 4,950
                       I.    INTRODUCTION                                  records), these sites are on average much, much larger than
                                                                           surface sites. The rest of this paper will serve to amplify these
A. The Deep Web                                                            findings. This mean deep Web site has a Web-expressed
Internet content is considerably more diverse and the volume               (HTML-included basis) database size of 74.4 MB (median of
certainly much larger than commonly understood. First,                     169 KB). Actual record counts and size estimates can be
though sometimes used synonymously, the World Wide Web                     derived from one-in-seven deep Web sites. On average, deep
is but a subset of Internet content. Other Internet protocols              Web sites receive about half again as much monthly traffic as
besides the Web include FTP (file transfer protocol), e-mail,              surface. The median deep Web site receives somewhat more
news, Telnet, and Gopher (most prominent among pre-Web                     than two times the traffic of a random surface Web site
protocols). This paper does not consider further these non-                (843,000 monthly page views vs. 365,000). Deep Web sites on
Web protocols [1]. Second, even within the strict context of               average are more highly linked to than surface sites by nearly
the Web, most users are aware only of the content presented to             a factor of two (6,200 links vs. 3,700 links), though the
them via search engines such as Excite, Google, AltaVista, or              median deep Web site is less so (66 vs. 83 links). This

                                                                                                      ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 5, May 2011
suggests that well-known deep Web sites are highly popular,                  terabytes. [5][6a] Compared to the current surface Web
but that the typical deep Web site is not well known to the                  content estimate of 18.7 TB, this suggests a deep Web size
Internet search public. One of the more counter-intuitive                    about 400 times larger than the surface Web., the deep Web
results is that 97.4% of deep Web sites are publicly available
                                                                             size calculates as 120 times larger than the surface Web. At
without restriction; a further 1.6% is mixed (limited results
publicly available with greater results requiring subscription               the highest end of the estimates, the deep Web is about 620
and/or paid fees); only 1.1% of results are totally subscription             times the size of the surface Web. Alternately, multiplying the
or fee limited. This result is counter intuitive because of the              mean document/record count per deep Web site of 5.43
visible prominence of subscriber-limited sites such as Dialog,               million times 200,000 total deep Web sites results in a total
Lexis-Nexis, Wall Street Journal Interactive, etc. (We got the               record count across the deep Web of 543 billion
document counts from the sites themselves or from other                      documents.[6b] Compared to the estimate of one billion
published sources.) However, once the broader pool of deep                   documents, this implies a deep Web 550 times larger than the
Web sites is looked at beyond the large, visible, fee-based
                                                                             surface Web. At the low end of the deep Web size estimate
ones, public availability dominates.
Deep Web sites contain data of about 750 terabytes (HTML-                    this factor is 170 times; at the high end, 840 times. Clearly, the
included basis) or roughly forty times the size of the known                 scale of the deep Web is massive, though uncertain. Since 60
surface Web. These sites appear in a broad array of domains                  deep Web sites alone are nearly 40 times the size of the entire
from science to law to images and commerce. We estimate the                  surface Web, we believe that the 200,000 deep Web site bases
total number of records or documents within this group to be                 is the most reasonable one. Thus, across database and record
about eighty-five billion. Roughly two-thirds of these sites are             sizes, we estimate the deep Web to be about 500 times the size
public ones, representing about 90% of the content available
                                                                             of the surface Web.
within this group of sixty. The absolutely massive size of the
largest sites shown also illustrates the universal power                      Figure 2. Inferred Distribution of Deep Web Sites, Total Database
function distribution of sites within the deep Web, not                                                   Size (MBs)
dissimilar to Web site popularity [3] or surface Web sites. One
implication of this type of distribution is that there is no real
upper size boundary to which sites may grow.

The under count due to lack of randomness and what we
believe to be the best estimate above, namely the Lycos-Info
mine pair, indicate to us that the ultimate number of deep Web
sites today is on the order of 200,000.

Figure 1. Inferred Distribution of Deep Web Sites, Total Record Size

                                                                                   IV. DEEP WEB COVERAGE IS BROAD, RELEVANT
                                                                             The subject coverage across all 17,000 deep Web sites used in
                                                                             this study. These subject areas correspond to the top-level
                                                                             subject structure of the Complete-Planet site. The table shows
                                                                             a surprisingly uniform distribution of content across all areas,
                                                                             with no category lacking significant representation of content.
                                                                             Actual inspection of the Complete-Planet site by node shows
                                                                             some subjects are deeper and broader than others. However, it
                                                                             is clear that deep Web content also has relevance to every
                                                                             information need and market.
Plotting the fully characterized random 100 deep Web sites
against total record counts produces Figure 1. Plotting these
same sites against database size (HTML-included basis)
produces Figure 2. Multiplying the mean size of 74.4 MB per
deep Web site times a total of 200,000 deep Web sites results
in a total deep Web size projection of 7.44 pet bytes, or 7,440

                                                                                                         ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 5, May 2011
        Table 1. Distribution of Deep Sites by Subject Area                 Figure 3. Distribution of Deep Web Sites by Content Type

                            Deep web coverage

                 Agriculture                      2.7%

                    Arts                          6.6%

                  Business                        5.9%

               Computing Web                      6.9%

                  Education                       4.3%

                Employment                        4.1%

                 Engineering                      3.1%

                 Government                       3.9%
                                                                          A. Deep Web Growing Faster than Surface Web
                   Health                         5.5%                    Lacking time-series analysis, we used the proxy of domain
                                                                          registration date to measure the growth rates for each of 100
                 Humanities                      13.5%                    randomly chosen deep and surface Web sites. These results are
                                                                          presented as a scatter gram with superimposed growth trend
                 Law/polices                      3.9%                    lines in Figure 4

                  Lifestyle                       4.0%                      Figure 4. Comparative Deep and Surface Web Site Growth
                News/Media                       12.2%

              People, companies                   4.9%

              Recreation, Sports                  3.5%

                 References                       4.5%

                Science, Math                     4.0%

                   Travel                         3.4%

                  Shopping                        3.2%

                                                                          . Use of site domain registration as a proxy for growth has a
                                                                          number of limitations. First, sites are frequently registered
More than half of all deep Web sites feature topical databases.
                                                                          well in advance of going "live." Second, the domain
Topical databases plus large internal site documents and
                                                                          registration is at the root or domain level (e.g.,
archived publications make up nearly 80% of all deep Web
                                                                 The search function and page -- whether
sites. Purchase-transaction sites - including true shopping sites
                                                                          for surface or deep sites -- often is introduced after the site is
with auctions and classifieds -- account for another 10% or so
                                                                          initially unveiled and may itself reside on a subsidiary form
of sites. The other eight categories collectively account for the
                                                                          not discoverable by the who is analysis.
remaining 10% or so of sites.

                                                                           V. ORIGINAL DEEP CONTENT NOW EXCEEDS ALL PRINTED
                                                                                              GLOBAL CONTENT
                                                                          International Data Corporation predicts that the number of
                                                                          surface Web documents will grow from the current two billion
                                                                          or so to 13 billion within three years, a factor increase of 6.5
                                                                          times;[7] deep Web growth should exceed this rate, perhaps
                                                                          increasing about nine-fold over the same period. Figure 4

                                                                                                      ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 9, No. 5, May 2011
compares this growth with trends in the cumulative global                     authoritative result is needed -- the complete description of a
content of print information drawn from a recent UC Berkeley                  chemical compound, as an example. The searches may be the
study.[8a]                                                                    same for the two sets of requirements, but the answers will
                                                                              have to be different. Meeting those requirements is daunting,
 Figure 5. 15-yr Growth Trends in Cumulative Original Information             and knowing that the deep Web exists only complicates the
                        Content (log scale)                                   solution because it often contains useful information for either
                                                                              kind of search. If useful information is obtainable but excluded
                                                                              from a search, the requirements of either user cannot be met.
                                                                              We have attempted to bring together some of the metrics
                                                                              included in this paper,[15] defining quality as both actual
                                                                              quality of the search results and the ability to cover the

                                                                                Table 2. Total "Quality" Potential, Deep vs. Surface Web

                                                                                    Search Type              Total          Quality
                                                                                                         Docs(Million)  Docs(Million)
                                                                                                      Surface Web
                                                                                 Single Site search                 160              7
The total volume of printed works (books, journals,                               Meta-site search                  840             38
newspapers, newsletters, office documents) has held steady at                  Total Surface possible             1,000             45
about 390 terabytes (TBs).[8b] By about 2000, deep Web
                                                                                                       Deep web
original information content equaled all print content produced
                                                                                Mega deep Search               110,000         14,850
through history up until that time. By 2005, original deep Web
content is estimated to have exceeded print by a factor of                      Total deep Possible            550,000         74,250
seven and is projected to exceed print content by a factor of                            Deep v. Surface web Improving Ratio
sixty three by 2011. Other indicators point to the deep Web as                  Single Site Search                668:1       2,063:1
the fastest growing component of the Web and will continue to                     Meta-site search                131:1          393:1
dominate it. [9] Even today, at least 240 major libraries have                     Total Possible                 655:1       2,094:1
their catalogs on line; UMI, a former subsidiary of Bell &
Howell, has plans to put more than 5.5 billion document                       Web sites may improve discovery by 600 fold or more.
images online;[11] and major astronomy data initiatives are                   Surface Web sites are fraught with quality problems. For
moving toward putting pet bytes of data online. [12] These                    example, a study in 1999 indicated that 44% of 1998 Web
trends are being fueled by the phenomenal growth and cost                     sites were no longer available in 1999 and that 45% of existing
reductions in digital, magnetic storage.[8c][13] International                sites were half-finished, meaningless, or trivial.[16]) Lawrence
Data Corporation estimates that the amount of disk storage                    and Giles' NEC studies suggest that individual major search
capacity sold annually grew from 10,000 terabytes in 1994 to                  engine coverage dropped from a maximum of 32% in 1998 to
116,000 terabytes in 2000, and it is expected to increase to                  16% in 1999.[4] Peer-reviewed journals and services such as
1,400,000 terabytes in 2011.[14] Deep Web content accounted                   Science Citation Index have evolved to provide the authority
for about 1/338th of magnetic storage devoted to original                     necessary for users to judge the quality of information. The
content in 2010; it is projected to increase to 1/200th by 2011.              Internet lacks such authority. An intriguing possibility with the
As the Internet is expected to continue as the universal                      deep Web is that individual sites can themselves establish that
medium for publishing and disseminating information, these                    authority. For example, an archived publication listing from a
trends are sure to continue.                                                  peer-reviewed journal such as Nature or Science or user-
                                                                              accepted sources such as the Wall Street Journal or The
                                                                              Economist carry with them authority based on their editorial
            VI. DEEP VS. SURFACE WEB QUALITY                                  and content efforts. The owner of the site vets what content is
The issue of quality has been raised throughout this study. A                 made available. Professional content suppliers typically have
quality search result is not a long list of hits, but the right list.         the kinds of database-based sites that make up the deep Web;
Searchers want answers. Providing those answers has always                    the static HTML pages that typically make up the surface Web
been a problem for the surface Web, and without appropriate                   are less likely to be from professional content suppliers. By
technology will be a problem for the deep Web as well.                        directing queries to deep Web sources, users can choose
Effective searches should both identify the relevant                          authoritative sites. Search engines, because of their
information desired and present it in order of potential                      indiscriminate harvesting, do not direct queries. By careful
relevance -- quality. Sometimes what is most important is                     selection of searchable sites, users can make their own
comprehensive discovery -- everything referring to a                          determinations about quality, even though a solid metric for
commercial product, for instance. Other times the most                        that value is difficult or impossible to assign universally.

                                                                                                         ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                               Vol. 9, No. 5, May 2011
                     VII. CONCLUSION                                    roughly double the amount of deep Web data storage to fifteen
Serious information seekers can no longer avoid the                     pet bytes than is indicated in the main body of the report.
importance or quality of deep Web information. But deep Web             [7]. As reported in Sequoia Software's IPO filing to the SEC,
information is only a component of total information                    March                   23,               2000;              see
available. Searching must evolve to encompass the complete                     &
Web. Directed query technology is the only means to integrate           ipage=1117423 & doc=1 & total=266 & back=2 & g=.
deep and surface Web information. The information retrieval             [8]. 8a, 8b, 8c. P. Lyman and H.R. Varian, "How Much
answer has to involve both "mega" searching of appropriate              Information," published by the UC Berkeley School of
deep Web sites and "meta" searching of surface Web search               Information Management and Systems, October 18. 2000. See
engines to overcome their coverage problem. Client-side tools 
are not universally acceptable because of the need to                   info/index.html. The comparisons here are limited to
download the tool and issue effective queries to it.[17] Pre-           achievable and retrievable public information, exclusive of
assembled storehouses for selected content are also possible,           entertainment and communications content such as chat or e-
but will not be satisfactory for all information requests and           mail.
needs. Specific vertical market services are already evolving           [9]. As this analysis has shown, in numerical terms the deep
to partially address these challenges.[18] These will likely            Web already dominates. However, from a general user
need to be supplemented with a persistent query system                  perspective, it is unknown.
customizable by the user that would set the queries, search             10]. See
sites, filters, and schedules for repeated queries.
                                                                        [11]. See
These observations suggest a splitting within the Internet
information search market: search directories that offer hand-
                                                                        [12]. A. Hall, "Drowning in Data," Scientific American, Oct.
picked information chosen from the surface Web to meet
popular search needs; search engines for more robust surface-           1999;                                                  [formerly
level searches; and server-side content-aggregation vertical  ].
"info-hubs" for deep Web information to provide answers                 [13]. As reported in Sequoia Software's IPO filing to the SEC,
where comprehensiveness and quality are imperative.                     March               23,            2000             ;        see
                                                                        ipage=1117423 & doc=1 & total=266 & back=2 & g=.
                        REFERENCES                                      [14]. From Advanced Digital Information Corp., Sept. 1, 1999,
[1]. A couple of good starting references on various Internet           SEC                         filing;                    [formerly
protocols can be found at            &
and                                                                     exp=terabytes%20and%20online                                  &               g=">
ernet/Internet_Protocols/.                                              & exp=terabytes%20and%20online & g=].
[2]. Tenth edition of GVU's (graphics, visualization and                [15]. Assumptions: SURFACE WEB: for single surface site
usability) WWW User Survey, May 14, 1999. [formerly                     searches - 16% coverage; for meta search surface searchers -                     84% coverage [higher than NEC estimates in reference 4;
10/tenthreport.html.]                                                   based on empirical Bright-Planet searches relevant to specific
[3]. "4th Q NPD Search and Portal Site Study," as reported by           topics]; 4.5% quality retrieval from all surface searches. DEEP
Search              Engine           Watch            [formerly         WEB: 20% of potential deep Web sites in initial Complete-]. NPD's Web               Planet release; 200,000 potential deep Web sources; 13.5%
site is at                                         quality retrieval from all deep Web searches.
[4]. S. Lawrence and C.L. Giles, "Accessibility of Information          [16]. Online Computer Library Center, Inc., "June 1999 Web
on the Web," Nature 400:107-109, July 8, 1999.                          Statistics," Web Characterization Project, OCLC, July 1999.
[5]. 1024 bytes = I kilobyte (KB); 1000 KB = 1 megabyte                 See the Statistics section in
(MB); 1000 MB = 1 gigabyte (GB); 1000 GB = 1 terabyte                   [17]. Most surveys suggest the majority of users are not
(TB); 1000 TB = 1 pet byte (PB). In other words, 1 PB =                 familiar or comfortable with Boolean constructs or queries.
1,024,000,000,000,000 bytes or 1015.                                    Also, most studies suggest users issue on average 1.5
[6]. 6a, 6b. Our original paper published on July 26, 2000, use         keywords per query; even professional information scientists
d estimates of one billion surface Web documents and about              issue 2 or 3 keywords per search. See further Bright-Planet's
100,000 deep Web sea reachable databases. Since publication,            search                           tutorial                     at
new information suggests a total of about 200,000 deep Web    
searchable databases. Since surface Web document growth is              [18]. See, as one example among many,, at
no w on the order of 2 billion documents, the ratios of surface         [formerly]. L.
to Web documents ( 400 to 550 times greater in the deep Web)
still approximately holds. These tren ds would also suggest

                                                                                                   ISSN 1947-5500

To top