Modeling and Analyze the Deep Web: Surfacing Hidden Value
Shared by: ijcsiseditor
Categories
Tags
IJCSIS, call for paper, journal computer science, research, google scholar, IEEE, Scirus, download, ArXiV, library, information security, internet, peer review, scribd, docstoc, cornell university, archive, Journal of Computing, DOAJ, Open Access, June 2011, Volume 9, No. 6, Impact Factor, engineering, international, proQuest, computing, computer, technology
-
Stats
- views:
- 124
- posted:
- 7/5/2011
- language:
- English
- pages:
- 6
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Modeling and Analyze the Deep Web: Surfacing
Hidden Value
SUNEET KUMAR ANUJ KUMAR YADAV
Associate Proffessor ;Computer Science Dept. Assistant Proffessor ;Computer Science Dept.
Dehradun Institute of Technology, Dehradun Institute of Technology,
Dehradun,India Dehradun,India
suneetcit81@gmail.com anujbit@gmail.com
RAKESH BHARATI RANI CHOUDHARY
Assistant Proffessor ;Computer Science Dept. Sr. Lecturer;Computer Science Dept.
Dehradun Institute of Technology, BBDIT,
Dehradun,India Ghaziabad,India
goswami.rakesh@gmail.com ranichoudhary04@gmail.com
Abstract—Focused web crawlers have recently emerged as an percent of Web users use search engines to find needed
alternative to the well-established web search engines. While the information, but nearly as high a percentage site the inability
well-known focused crawlers retrieve relevant web-pages, there to find desired information as one of their biggest
are various applications which target whole websites instead of frustrations.[2] According to a recent survey of search-engine
single web-pages. For example, companies are represented by satisfaction by market-researcher NPD, search failure rates
websites, not by individual web-pages. To answer queries
targeted at Websites, web directories are an established solution.
have increased steadily since 1997.[3] The importance of
In this paper, we introduce a novel focused website crawler to information gathering on the Web and the central and
employ the paradigm of focused crawling for the search of unquestioned role of search engines -- plus the frustrations
relevant websites. The proposed crawler is based on two-level expressed by users about the adequacy of these engines --
architecture and corresponding crawl strategies with an explicit make them an obvious focus of investigation. Our key findings
concept of websites. The external crawler views the web as a include:
graph of linked websites, selects the websites to be examined next • Public information on the deep Web is currently 400
and invokes internal crawlers. Each internal crawler views the to 550 times larger than the commonly defined World
web-pages of a single given website and performs focused (page)
Wide Web.
crawling within that website. Our Experimental evaluation
demonstrates that the proposed focused website crawler clearly • The deep Web contains 7,500 terabytes of
outperforms previous methods of focused crawling which were information compared to 19 terabytes of information
adapted to retrieve websites instead of single web-pages. in the surface Web.
• The deep Web contains nearly 550 billion individual
Keywords- Deep Web ; Link references ; Searchable Databases ; documents compared to the one billion of the surface
Site page-views. Web.
• More than 200,000 deep Web sites presently exist.
I. INTRODUCTION • Sixty of the largest deep-Web sites collectively
A. The Deep Web contain about 750 terabytes of information --
sufficient by themselves to exceed the size of the
Internet content is considerably more diverse and the surface Web forty times.
volume certainly much larger than commonly understood.
• On average, deep Web sites receive 50% greater
First, though sometimes used synonymously, the World Wide
monthly traffic than surface sites and are more highly
Web (HTTP protocol) is but a subset of Internet content. Other
linked to than surface sites; however, the typical
Internet protocols besides the Web include FTP (file transfer
(median) deep Web site is not well known to the
protocol), e-mail, news, Telnet, and Gopher (most prominent
Internet-searching public.
among pre-Web protocols). This paper does not consider
• The deep Web is the largest growing category of new
further these non-Web protocols [1]. Second, even within the
information on the Internet.
strict context of the Web, most users are aware only of the
content presented to them via search engines such as Excite, • Deep Web sites tend to be narrower, with deeper
Google, AltaVista, or Northern Light, or search directories content, than conventional surface sites.
such as Yahoo!, About.com, or LookSmart. Eighty-five • Total quality content of the deep Web is 1,000 to
2,000 times greater than that of the surface Web.
119 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Deep Web content is highly relevant to every few documents and sites. It was a manageable task to post all
information need, market, and domain. documents as static pages. Because all pages were persistent
• More than half of the deep Web content resides in and constantly available, they could be crawled easily by
topic-specific databases. conventional search engines. In July 1994, the Lycos search
engine went public with a catalog of 54,000 documents.[10]
A full ninety-five per cent of the deep Web is publicly Since then, the compound growth rate in Web documents has
accessible information -- not subject to fees or subscriptions. been on the order of more than 200% annually! [11] Sites that
were required to manage tens to hundreds of documents could
easily do so by posting fixed HTML pages within a static
B. How Search Engines Work directory structure. However, beginning about 1996, three
Search engines obtain their listings in two ways: Authors phenomena took place. First, database technology was
may submit their own Web pages, or the search engines introduced to the Internet through such vendors as Bluestone's
"crawl" or "spider" documents by following one hypertext link Sapphire/Web (Bluestone has since been bought by HP) and
to another. The latter returns the bulk of the listings. Crawlers later Oracle. Second, the Web became commercialized
work by recording every hypertext link in every page they initially via directories and search engines, but rapidly evolved
index crawling. Like ripples propagating across a pond, to include e-commerce. And, third, Web servers were adapted
search-engine crawlers are able to extend their indices further to allow the "dynamic" serving of Web pages (for example,
and further from their starting points.The surface Web Microsoft's ASP and the Unix PHP technologies). Figure 2
contains an estimated 2.5 billion documents, growing at a rate represents, in a non-scientific way, the improved results that
of 7.5 million documents per day.[4a] The largest search can be obtained by Bright-Planet technology. By first
engines have done an impressive job in extending their reach, identifying where the proper searchable databases reside, a
though Web growth itself has exceeded the crawling ability of directed query can then be placed to each of these sources
search engines[5][6] Today, the three largest search engines in simultaneously to harvest only the results desired -- with
terms of internally reported documents indexed are Google pinpoint accuracy.
with 1.35 billion documents (500 million available to most
searches),[7] Fast, with 575 million documents [8] and FIGURE 2. HARVESTING THE DEEP AND SURFACE WEB WITH A DIRECTED
QUERY ENGINE
Northern Light with 327 million documents.[9]
Moreover, return to the premise of how a search engine
obtains its listings in the first place, whether adjusted for
popularity or not. That is, without a linkage from another Web
document, the page will never be discovered. But the main
failing of search engines is that they depend on the Web's
linkages to identify what is on the Web. Figure 1 is a graphical
representation of the limitations of the typical search engine.
The content identified is only what appears on the surface and
the harvest is fairly indiscriminate. There is tremendous value
that resides deeper than this surface content. The information
is there, but it is hiding beneath the surface of the Web.
FIGURE 1. SEARCH ENGINES: DRAGGING A NET ACROSS THE WEB'S
SURFACE HIDDEN VALUE ON THE WEB
Additional aspects of this representation will be discussed
throughout this study. For the moment, however, the key
points are that content in the deep Web is massive --
approximately 500 times greater than that visible to
conventional search engines -- with much higher quality
throughout.
III. STUDY OBJECTIVES
II. SEARCHABLE DATABASES: HIDDEN VALUE ON THE
To perform the study discussed, we used our technology in
WEB
an iterative process. Our goal was to:
How does information appear and get presented on the • Quantify the size and importance of the deep Web.
Web? In the earliest days of the Web, there were relatively
120 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Characterize the deep Web's content, quality, and • Estimating the total number of records or documents
relevance to information seekers. contained on that site.
• Discover automated means for identifying deep Web • Retrieving a random sample of a minimum of ten
search sites and directing queries to them. results from each site and then computing the
• Begin the process of educating the Internet-searching expressed HTML-included mean document size in
public about this heretofore hidden and valuable bytes. This figure, times the number of total site
information storehouse. records, produces the total site size estimate in bytes.
• Indexing and characterizing the search-page form on
the site to determine subject coverage.
A. What Has Not Been Analyzed or Included in Results
This paper does not investigate non-Web sources of Estimating total record count per site was often not
Internet content. This study also purposely ignores private straightforward. A series of tests was applied to each site and
intranet information hidden behind firewalls. Many large are listed in descending order of importance and confidence in
companies have internal document stores that exceed terabytes deriving the total document count:
of information. Since access to this information is restricted,
its scale can not be defined nor can it be characterized. Also, • E-mail messages were sent to the webmasters or
while on average 44% of the "contents" of a typical Web contacts listed for all sites identified, requesting
document reside in HTML and other coded information (for verification of total record counts and storage sizes
example, XML or Java script),[12] this study does not (uncompressed basis); about 13% of the sites
evaluate specific information within that code. We do, provided direct documentation in response to this
however, include those codes in our quantification of total request.
content. Finally, the estimates for the size of the deep Web • Total record counts as reported by the site itself. This
include neither specialized search engine sources -- which may involved inspecting related pages on the site,
be partially "hidden" to the major traditional search engines – including help sections, site FAQs, etc.
nor the contents of major search engines themselves. This • Documented site sizes presented at conferences,
latter category is significant. Simply accounting for the three estimated by others, etc. This step involved
largest search engines and average Web document sizes comprehensive Web searching to identify reference
suggests search-engine contents alone may equal 25 terabytes sources.
or more [13] or somewhat larger than the known size of the
• Record counts as provided by the site's own search
surface Web.
function. Some site searches provide total record
counts for all queries submitted. For others that use
B. A Common Denominator for Size Comparisons the NOT operator and allow its stand-alone use, a
All deep-Web and surface-Web size figures use both total query term known not to occur on the site such as
number of documents (or database records in the case of the "NOT ddfhrwxxct" was issued. This approach returns
deep Web) and total data storage. Data storage is based on an absolute total record count. Failing these two
"HTML included" Web-document size estimates.[11] This options, a broad query was issued that would capture
basis includes all HTML and related code information plus the general site content; this number was then
standard text content, exclusive of embedded images and corrected for an empirically determined "coverage
standard HTTP "header" information. Use of this standard factor," generally in the 1.2 to 1.4 range [14].
convention allows apples-to-apples size comparisons between • A site that failed all of these tests could not be
the surface and deep Web. The HTML-included convention measured and was dropped from the results listing.
was chosen because:
V. ANALYSIS OF STANDARD DEEP WEB SITES
• Most standard search engines that report document
sizes do so on this same basis. Analysis and characterization of the entire deep Web
• When saving documents or Web pages directly from involved a number of discrete tasks:
a browser, the file size byte count uses this
convention. • Estimation of total number of deep Web sites.
• Deep web Size analysis.
All document sizes used in the comparisons use actual byte • Content and coverage analysis.
counts (1024 bytes per kilobyte) • Site page views and link references.
• Growth analysis.
• Quality analysis.
IV. ANALYSIS OF LARGEST DEEP WEB SITES
A. Estimation of Total Number of Sites
Site characterization required three steps:
The basic technique for estimating total deep Web sites
uses "overlap" analysis, the accepted technique chosen for two
121 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
of the more prominent surface Web size analyses.[5b][15] We multiplier applied to the entire population estimate. We
used overlap analysis based on search engine coverage and the randomized our listing of 17,000 search site candidates. We
deep Web compilation sites noted above. The technique is then proceeded to work through this list until 100 sites were
illustrated in the diagram below: fully characterized. We followed a less-intensive process to
the large sites analysis for determining total record or
FIGURE3. SCHEMATIC REPRESENTATION OF "OVERLAP" ANALYSIS
document count for the site. Exactly 700 sites were inspected
in their randomized order to obtain the 100 fully characterized
sites. All sites inspected received characterization as to site
type and coverage; this information was used in other parts of
the analysis.
C. Content Coverage and Type Analysis
Content coverage was analyzed across all 17,000 search
sites in the qualified deep Web pool (results shown in Table
1); the type of deep Web site was determined from the 700
hand-characterized sites. Broad content coverage for the entire
pool was determined by issuing queries for twenty top-level
domains against the entire pool. Because of topic overlaps,
total occurrences exceeded the number of sites in the pool; this
total was used to adjust all categories back to a 100% basis.
TABLE 1. DISTRIBUTION OF DEEP SITES BY SUBJECT AREA
Overlap analysis involves pair-wise comparisons of the
Deep web coverage
number of listings individually within two sources, na and nb,
and the degree of shared listings or overlap, n0, between them. Agriculture 2.7%
Assuming random listings for both na and nb, the total size of
Arts 6.6%
the population, N, can be estimated. The estimate of the
fraction of the total population covered by na is no/nb; when Business 5.9%
applied to the total size of na an estimate for the total
Computing Web 6.9%
population size can be derived by dividing this fraction into
the total size of na. These pair-wise estimates are repeated for Education 4.3%
all of the individual sources used in the analysis.
Employment 4.1%
To illustrate this technique, assume, for example, we know our
total population is 100. Then if two sources, A and B, each Engineering 3.1%
contain 50 items, we could predict on average that 25 of those
Government 3.9%
items would be shared by the two sources and 25 items would
not be listed by either. According to the formula above, this Health 5.5%
can be represented as: 100 = 50 / (25/50) There are two keys Humanities 13.5%
to overlap analysis. First, it is important to have a relatively
accurate estimate for total listing size for at least one of the Law/polices 3.9%
two sources in the pair-wise comparison. Second, both sources Lifestyle 4.0%
should obtain their listings randomly and independently from
one another. This second premise is in fact violated for our News/Media 12.2%
deep Web source analysis. Compilation sites are purposeful in People, companies 4.9%
collecting their listings, so their sampling is directed. And, for
search engine listings, searchable databases are more Recreation, Sports 3.5%
frequently linked to because of their information value which References 4.5%
increases their relative prevalence within the engine
listings.[4b] Thus, the overlap analysis represents a lower Science, Math 4.0%
bound on the size of the deep Web since both of these factors Travel 3.4%
will tend to increase the degree of overlap, n0, reported
between the pair wise sources. Shopping 3.2%
Hand characterization by search-database type resulted in
B. Deep Web Size Analysis assigning each site to one of twelve arbitrary categories that
In order to analyze the total size of the deep Web, we need captured the diversity of database types. These twelve
an average site size in documents and data storage to use as a categories are:
122 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Topic Databases -- subject-specific aggregations of The queries were specifically designed to limit total results
information, such as SEC corporate filings, medical returned from any of the six sources to a maximum of 200 to
databases, patent records, etc. ensure complete retrieval from each source.[21] The specific
• Internal site -- searchable databases for the internal technology configuration settings are documented in the
pages of large sites that are dynamically created, such endnotes.[22] The "quality" determination was based on an
as the knowledge base on the Microsoft site. average of our technology's VSM and mEBIR computational
• Publications -- searchable databases for current and linguistic scoring methods. [23] [24] the "quality" threshold
archived articles. was set at our score of 82, empirically determined as roughly
• Shopping/Auction. accurate from millions of previous scores of surface Web
• Classifieds. documents.
• Portals -- broader sites that included more than one of
these other categories in searchable databases. VI. CONCLUSION
• Library -- searchable internal holdings, mostly for
university libraries. This study is the first known quantification and
characterization of the deep Web. Very little has been written
• Yellow and White Pages -- people and business
or known of the deep Web. Estimates of size and importance
finders.
have been anecdotal at best and certainly underestimate scale.
• Calculators -- while not strictly databases, many do
For example, Intelliseek's "invisible Web" says that, "In our
include an internal data component for calculating
best estimates today, the valuable content housed within these
results. Mortgage calculators, dictionary look-ups,
databases and searchable sources is far bigger than the 800
and translators between languages are examples.
million plus pages of the 'Visible Web.'" They also estimate
• Jobs -- job and resume postings. total deep Web sources at about 50,000 or so.[25] A mid-1999
• Message or Chat . survey by About. Com’s Web search guide concluded the size
• General Search -- searchable databases most often of the deep Web was "big and getting bigger."[26] A paper at
relevant to Internet search topics and information. a recent library science meeting suggested that only "a
D. Site Page-views and Link References relatively small fraction of the Web is accessible through
search engines."[27] The deep Web is about 500 times larger
Netscape's "What's Related" browser option, a service
than the surface Web, with, on average, about three times
from Alexa, provides site popularity rankings and link
higher quality based on our document scoring methods on a
reference counts for a given URL.[17] About 71% of deep
per-document basis. On an absolute basis, total deep Web
Web sites have such rankings. The universal power function (a
quality exceeds that of the surface Web by thousands of times.
logarithmic growth rate or logarithmic distribution) allows
Total number of deep Web sites likely exceeds 200,000 today
page-views per month to be extrapolated from the Alexa
and is growing rapidly.[28] Content on the deep Web has
popularity rankings. [18] The "What's Related" report also
meaning and importance for every information seeker and
shows external link counts to the given URL. A random
market. More than 95% of deep Web information is publicly
sampling for each of 100 deep and surface Web sites for
available without restriction. The deep Web also appears to be
which complete "What's Related" reports could be obtained
the fastest growing information component of the Web.
were used for the comparisons.
REFERENCES
E. Growth Analysis [1]. A couple of good starting references on various Internet
The best method for measuring growth is with time-series protocols can be found at http://wdvl.com/Internet/Protocols/
analysis. However, since the discovery of the deep Web is so and
new, a different gauge was necessary. Who is [19] searches http://www.webopedia.com/Internet_and_Online_Services/Int
associated with domain-registration services [16] return ernet/Internet_Protocols/.
records listing domain owner, as well as the date the domain [2]. Tenth edition of GVU's (graphics, visualization and
was first obtained (and other information). Using a random usability) WWW User Survey, May 14, 1999. [formerly
sample of 100 deep Web sites [17b] and another sample of http://www.gvu.gatech.edu/user_surveys/survey-1998-
100 surface Web sites [20] we issued the domain names to a 10/tenthreport.html.]
Who is search and retrieved the date the site was first [3]. 3a, 3b. "4th Q NPD Search and Portal Site Study," as
established. These results were then combined and plotted for reported by Search Engine Watch [formerly
the deep vs. surface Web samples. http://searchenginewatch.com/reports/npd.html]. NPD's Web
site is at http://www.npd.com/.
F. Quality Analysis [4]. 4a, 4b "Sizing the Internet, Cyveillance [formerly
Quality comparisons between the deep and surface Web http://www.cyveillance.com/web/us/downloads/Sizing_the_Int
content were based on five diverse: The five subject areas ernet.pdf].
were agriculture, medicine, finance/business, science, and law.
123 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[5]. 5a, 5b. S. Lawrence and C.L. Giles, "Searching the World [19]. See, for example among many, Better Who is at
Wide Web," Science 80:98-100, April 3, 1998. http://betterwhois.com.
[6]. S. Lawrence and C.L. Giles, "Accessibility of Information [20]. The surface Web domain sample was obtained by first
on the Web," Nature 400:107-109, July 8, 1999. issuing a meaningless query to Northern Light, 'the AND NOT
[7]. See http://www.google.com. ddsalsrasve' and obtaining 1,000 URLs. This 1,000 was
[8]. See http://www.alltheweb.com and quoted numbers on randomized to remove (partially) ranking prejudice in the
entry page. order Northern Light lists results.
[9]. Northern Light is one of the engines that allows a "NOT [21]. An example specific query for the "agriculture" subject
meaningless" query to be issued to get an actual document areas is "agriculture* AND (swine OR pig) AND 'artificial
count from its data stores. See http://www.northernlight.com insemination' AND genetics."
NL searches used in this article exclude its "Special [22]. The Bright-Planet technology configuration settings
Collections" listing. were: max. Web page size, 1 MB; min. page size, 1 KB; no
[10]. See date range filters; no site filters; 10 threads; 3 retries allowed;
http://www.wiley.com/compbooks/sonnenreich/history.html. 60 sec. Web page timeout; 180 minute max. Download time;
[11]. 11a, 11b. This analysis assumes there were 1 million 200 pages per engine.
documents on the Web as of mid-1994. [23]. The vector space model, or VSM, is a statistical model
[12]. Empirical Bright-Planet results from processing millions that represents documents and queries as term sets, and
of documents provide an actual mean value of 43.5% for computes the similarities between them. Scoring is a simple
HTML and related content. Using a different metric, NEC sum-of-products computation, based on linear algebra. See
researchers found HTML and related content with white space further: Salton, Gerard, Automatic Information Organization
removed to account for 61% of total page content (see 7). Both and Retrieval, McGraw-Hill, New York, N.Y., 1968; and,
measures ignore images and so-called HTML header content. Salton, Gerard, Automatic Text Processing, Addison-Wesley,
Reading, MA, 1989.
[13]. Rough estimate based on 700 million total documents [24]. See, as one example among many, CareData.com, at
indexed by AltaVista, Fast, and Northern Light, at an average [formerly http://www.citeline.com/pro_info.html].
document size of 18.7 KB (see reference 7) and a 50% [25] See the Help and then FAQ pages at [formerly
combined representation by these three sources for all major http://www.invisibleweb.com].
search engines. Estimates are on an "HTML included" basis. [26] C. Sherman, "The Invisible Web," [formerly
http://websearch.about.com/library/weekly/aa061199.htm]
[14]. For example, the query issued for an agriculture-related [27] I. Zachery, "Beyond Search Engines," presented at the
database might be "agriculture." Then, by issuing the same Computers in Libraries 2000 Conference, March 15-17, 2000,
query to Northern Light and comparing it with a Washington, DC; [formerly
comprehensive query that does not mention the term http://www.pgcollege.org/library/zac/beyond/index.htm]
"agriculture" [such as "(crops OR livestock OR farm OR corn [28] The initial July 26, 2000, version of this paper stated an
OR rice OR wheat OR vegetables OR fruit OR cattle OR pigs estimate of 100,000 potential deep Web search sites.
OR poultry OR sheep OR horses) AND NOT agriculture"] an Subsequent customer projects have allowed us to update this
empirical coverage factor is calculated. analysis, again using overlap analysis, to 200,000 sites. This
[15]. K. Bharat and A. Broder, "A Technique for Measuring site number is updated in this paper, but overall deep Web size
the Relative Size and Overlap of Public Web Search Engines," estimates have not. In fact, still more recent work with foreign
paper presented at the Seventh International World Wide Web language deep Web sites strongly suggests the 200,000
Conference, Brisbane, Australia, April 14-18, 1998. The full estimate is itself low.
paper is available at
http://www7.scu.edu.au/1937/com1937.htm.
[16]. See, for example,
http://www.surveysystem.com/sscalc.htm, for a sample size
calculator.
[17. 17a, 17b. See http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd
[formerly http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd]
[18]. See reference 38. Known page-views for the logarithmic
popularity rankings of selected sites tracked by Alexa are used
to fit a growth function for estimating monthly page-views
based on the Alexa ranking for a given URL.
124 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Related docs
Other docs by ijcsiseditor
Digital Images Encryption in Spatial Domain Based on Singular Value Decomposition and Cellular Automata
Views: 0 | Downloads: 0
Agent Behavior in Multiagent Systems: Issues and Challenges in Design, Development and Implementation
Views: 1 | Downloads: 0
Optimizing Cost, Delay, Packet Loss and Network Load in AODV Routing Protocols
Views: 2 | Downloads: 0
Get documents about "