Text Query-dtSearch-_1D1B81.qxd

Document Sample
Text Query-dtSearch-_1D1B81.qxd Powered By Docstoc
					                        Where the #@*! is That?:
                                The Art of the
                                 Text Query
    Let’s say law enforcement
needs to locate all references to    How can dtSearch, for example, search over terabytes

“incendiary devices” in a            of text in less than a second? It does so by building an
computerized stack of                index that stores the location of each word within a
documents.                           document. Once an index is complete, search time is
    Or MegaHugeCorp is               generally less than a second, even through millions of
planning a merger with               files.
GigaHugeCorp, and the Federal
Trade Commission has asked
MegaHugeCorp for all                    And these searches would           our marketplace without any
materials on file relating to        yield ... NOTHING! No                 apparent thought to
“marketplace competitive             “incendiary devices” in the first     ‘marketplace competitive
analysis.”                           scenario. Nothing in the merger       analysis.’” Nothing in the
    Or let’s say that during a       case: “I’m sorry Federal Trade        blueberry pie example: “I knew
high-speed chase through the         Commission, but we’ve                 my dog was eating computer
streets of Prague, a CIA             remarkably gotten to dominate         files!” And in the CIA case, 42
operative is able to wrest a
single typed paragraph out of
the hands of a suspected
member of a terrorist
organization. The CIA needs to
find other possible matches to
this text from millions of
satellite and wire intercepts.
    Or let’s say the whole
family is coming to dinner, and
you need the blueberry pie
recipe you typed into your
laptop last year.
    In the first scenario, the
obvious text search is for
incendiary devices, while in the
CIA case a search for a few
keywords from the stolen
paragraph might work. In the
MegaHugeCorp case,

analysis might be the search.
marketplace competitive

And to locate the pie, the
obvious search string is
                       Reprinted with permission of PC AI Online Magazine V. 14 #1
                  For more information about PC AI Online Magazine, visit www.pcai.com
       dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz
                                      An additional option is to          completeness or retrieving all
 The key to processing            find out the names of specific          the potentially relevant
                                                       and enter
 natural language searches incendiary devices in a user-
                                  them as synonyms
                                                                          information increases, so do
                                                                          “false hits” or the retrieval of
 in dtSearch is the vector        defined addendum to the built-          irrelevant documents.
 space model, which               in thesaurus. The combined              Maximizing the chance of
 compares a natural               built-in thesaurus and user-            retrieving “the smoking gun” in
 language search request          defined thesaurus allows                a search, while minimizing the
 to documents with                automatic synonym expansion             retrieval of irrelevant
 matching search terms.           covering a wide range of search         documents, requires casting the
                                  terms, all with a simple search         net just right.
billion documents, which is far       Now, thanks to concept              Boolean, Phrase,
too many retrieved documents      searching, it is easy to find           Proximity, Wildcards,
for even a pack of trained        every document that includes            Field, Numeric Range
chimps to thumb through.          any synonym of incendiary.                  If concept searching and
    This is where advanced        But what about derivatives of           stemming were the only search
query techniques, available in    these words, such as                    tools, then it would be tough
dtSearch, for example, are
          ®                       inflammatories in addition to           casting the net just right.
helpful. In the “incendiary       inflammatory? This type of              Boolean operatives such as and,
devices” case, concept searching expansion requires stemming.             or and not help refine the search
and stemming provide the              Stemming uses a built-in            request. For example, in the
solution. In the merger case,     algorithm that is familiar with         merger scenario, a search for
Boolean and proximity             the native language, in this case
                                  English, to expand a search             analysis is a basic Boolean
                                                                          market analysis or competitive
searching do the trick. In the
CIA case, the solution is         request to include word                 search in combination with a
relevancy-ranked natural          derivatives automatically. A            phrase search. Such a search
language searching and variable search for apply that includes            retrieves all documents that
term weighting. And in the        stemming finds applies,                 contain either the phrase market
blueberry pie case, the answer is applied, applying, but not              analysis or the phrase
fuzzy searching because you                                               competitive analysis or both.
                                      This brings up an important             But what if you also need to
misspelled blueberry.
                                  point of search request                 find documents containing text
Concept Searching and             formation. When not highly              such as: the market that would
                                  familiar with the target
                                  documents (which is the case in         Finding that type of text
    Concept searching, also
                                                                          be relevant for the analysis.

known as synonym or thesaurus all of the examples above, with             requires market near analysis or
searching, expands a single       the possible exception of the           competitive near analysis. How
search request into multiple      blueberry pie recipe), then cast        near? Let’s say within 25 words,
conceptual dimensions. For        the net broadly. But not too            giving the resulting query:
example, with concept             broadly. Choosing all words
searching, a search for           related to incendiary, or even          analysis. This query finds all
                                                                          (market or competitive) w/25

incendiary automatically          all related words of related            documents that contain either
expands (using the search         words, retrieves documents              the term market or the term
program’s built-in thesaurus) to with terms no more relevant              competitive relatively near to
include such synonyms as          than felon or outlaw.
arsonist and inflammatory.            The fundamental principle               Wildcards, which

Going broader into “related       of text queries is: when                complement a Boolean search,
words” provides combustible       searching a very large and              include the question mark (?)
and bomb.                         diverse database, as search             for replacing a single letter in a

                      Reprinted with permission of PC AI Online Magazine V. 14 #1
                 For more information about PC AI Online Magazine, visit www.pcai.com
      dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz
word, and the asterisk (*) for
replacing any number of
                                     Subset, Superset, and “Anything But” Search Requests

letters in a word. Suppose
MegaHugeCorp’s previous
names were                                                          1
MegaMediumCorp and
MegaSmallCorp. A search for
Mega*Corp retrieves all three.
(Almost all text searches
should be case insensitive.
With the possible exception of
source code searching, case-
sensitive searches are usually
a bad idea since they miss too
many relevant words.)
                                                     3 2                                              4
    Another element that
works well with Boolean
searches is numeric range
searching, such as searching
for any number between 11
and 127. Field searching, or
limiting a search to a specific              5
document field, also works
well with Boolean searching.
    Combining Boolean               1 (market or competitive) w/25 analysis
searches with stemming and          2 ((market or competitive) w/25 analysis) and not supermarket
concept searching is also           3 ((market or competitive) w/25 analysis) not w/75 supermarket
powerful. For example, with         4 (((market or competitive) w/25 analysis) and not supermarket)
stemming on, the search             5 monopoly and not ((((market or competitive) w/25 analysis)
                                      or exclusionary
request (market or                    and not supermarket) or exclusionary)

retrieves not only market, but
competitive) w/25 analysis

also markets and marketing.
                                    analysis.”                                  Trial and error results in a
Narrow, Broaden and                     Suppose the query (market           narrowed search request:

                                    finds a slew of documents
    Although it is possible to
                                    or competitive) w/25 analysis           ((market or competitive) w/25

arrive at a query such as           pertaining to the analysis of           This finds the same document
                                                                            analysis) and not supermarket.

                                    supermarket shoppers. The               set as the previous search
analysis through logical            documents date to a time when           request, excluding all
(market or competitive) w/25

deduction alone, a dose of trial    MegaHugeCorp considered                 documents that contain the
and error is often necessary.       offering its non-food wares in          word supermarket. The search
This is particularly true if a      supermarkets but then rejected          results represent a subset of the
high degree of familiarity with     the idea. Because neither               previous search. Alternatively,
the target database is lacking.     merging company presently               a search request that creates a
In the merger scenario, it is       sells through supermarkets,             slightly larger subset, by
unlikely that anyone has            these documents fall outside of         excluding only documents that
previously seen every single        the Federal Trade                       contain the word supermarket
document relating to                Commission’s document                   within 75 words of the
“marketplace competitive            request.                                market/competitive/analysis

                       Reprinted with permission of PC AI Online Magazine V. 14 #1
                  For more information about PC AI Online Magazine, visit www.pcai.com
       dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz
cluster is: ((market or
competitive) w/25 analysis) not      Under the Hood of Natural Language Searching
    Suppose the search request       The key to processing natural language searches in dtSearch is
w/75 supermarket.

                                     the vector space model, which compares a natural language
                                     search request to documents with matching search terms. This
((market or competitive) w/25

requires expansion to include        model views the search request as a series of “n” dimensions
analysis) and not supermarket

the search term exclusionary.        in space, with “n” corresponding to the number of words in the
Broadening the search request        search request. The formula looks for the smallest vector angle
creates a superset of the            between the search request and other documents with
previous search. Effectively,        matching search terms, also viewed as “n” dimensions in
this takes the previous search       space. Because vector space natural language searching
request and adds an “or”             weighs search request terms against the density and frequency
element: (((market or                of search terms in a document collection, this feature is only
                                     available in indexed searching. (See Indexed vs. Unindexed
                                     Searching on the next page)
competitive) w/25 analysis) and
not supermarket) or

    Finally, after painstaking

review of every retrieved           natural language searching, also      according to their relevancy,
document in the search request      known as query-by-example, is         with the document having the
                                    a possible solution. Suppose the      highest relevancy ranking first.
                                    CIA operative retrieved the               The natural language search,
(((market or competitive) w/25

or exclusionary, the legal          following block of text               using a query in raw format like
analysis) and not supermarket)

department suggests adding the      representing, remarkably, a           the above paragraph, singles out
term monopoly. An “anything         terrorist limited warranty:           keywords: representations,
but” search such as monopoly               any and all other
                                           representations and            etc. The search ignores
                                                                          warranties, express, implied,

                                           warranties, express or         connectors and other “noise”
and not ((((market or

                                           implied, including but         words: any, and, all, other, etc.
competitive) w/25 analysis) and

exclusionary) ensures retrieving           not limited to implied         The search then finds the
not supermarket) or

only new files excluded from               warranties of                  documents containing the
the previous search.                       merchantability, fitness       closest match to the keywords,
                                           for a particular purpose,      taking into account the density
Natural Language                           including without              and the rarity of hits.
Searching                                  limitation, whether blue           For example, if express
    Until now, all search                                                 appears in 2 million documents
requests have been structured                                             and warranties appears in only
                                           bird succeeds in flying

or Boolean, with keywords                  are expressly excluded         two, then warranties would
                                           over the orange house,

such as market, competitive,               and disclaimed.                have a much higher relevancy
analysis, supermarket and               To find a document that           weighting. Natural language
exclusionary, and structural        contains the closest match to         searching is also combinable
connectors such as or, and,         this text—perhaps a document          with stemming and concept
w/25 and not. Boolean is great      containing draft negotiations         searching. These options yield
for searches involving a clear      involving this paragraph—with         warranty and warranties as
idea of what meets the terms of     natural language searching            well as guarantee and
a search request. But what if       requires simply cutting and
the CIA, for example, has only      pasting this entire paragraph             Besides its status as one of

a general sense of looking for      into a search request. Natural        the most advanced search types,
some type of document match?        language searching then               natural language searching is
In that case, relevancy-ranked      retrieves matching documents          also the easiest. For example,

                      Reprinted with permission of PC AI Online Magazine V. 14 #1
                 For more information about PC AI Online Magazine, visit www.pcai.com
      dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz
                            Indexed vs. Unindexed Searching
    How can dtSearch, for               dtSearch also allows             confiscating a stack of hard
example, search over terabytes      unindexed and combination            drives, wants to know if any
of text in less than a second? It   indexed/unindexed searching.         data on them is pertinent to a
does so by building an index        These search options are useful      criminal investigation.
that stores the location of each    for a single pass through            Although unindexed searching
word within a document. Once        material to discover if there is     is much slower than indexed
an index is complete, search        any relevant information. For        searching, it is faster to do a
time is generally less than a       example, unindexed searching         single unindexed search than to
second, even through millions       might be useful to a law             build a search index and then
of files.                           enforcement agency that, after       do an indexed search.

dtSearch Search Type                        Indexed                      Unindexed

                                            Speed: usually               Speed: much slower than
                                            instantaneous, even          indexed search; but building
                                            across millions of           an index and doing a single
                                            documents                    search is slower than doing
                                                                         an unindexed search
Concept / Synonym / Thesaurus               Yes                          Yes
Fielded Data                                Yes                          Yes
Phrase                                      Yes                          Yes
Boolean                                     Yes                          Yes
Proximity and directed proximity            Yes                          Yes
Wildcard                                    Yes                          Yes
Numeric range                               Yes                          Yes
Macro                                       Yes                          Yes
Stemming (finds variations                  Yes                          Yes
on endings, like applies, applied,
applying in a search for apply)
Phonic                                      Yes                          Yes
Fuzziness (adjusts from 0 to 10             Yes (fuzziness is not        Yes
for fine-tuning fuzziness to the level      “hardwired” into the
of OCR or typographical errors in           index, making it
files—a search for alphabet with a          adjustable at the
fuzziness of 1 would find alphaqet;         time of search)
with a fuzziness of 3, it would find
both alphaqet and alpkaqet)
Natural language searching, with            Yes                          No
vector-space relevancy ranking
Variable term weighting                     Yes                          Yes
Unicode                                     Yes                          Yes

                      Reprinted with permission of PC AI Online Magazine V. 14 #1
                 For more information about PC AI Online Magazine, visit www.pcai.com
      dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz
using natural language               requests. With ((market or            with the correct document.
searching, a completely                                                        The answer is fuzzy
unstructured query—Get me            and not supermarket, an               searching. Turning on search
                                     competitive) w/25 analysis)

                                     alternative to using and not          fuzziness to a low level finds
                                     supermarket to exclude all            words that match one or two
the memo by Sam Smith on the

in 1996—leads right to the           documents that contain the            deviations in letters: bluegerry,
weather forecast for hurricanes

most relevant document.              word supermarket would be to          bluugerry, etc. Turning on
                                     search for ((market or                fuzziness to a higher level finds
Variable Term Weighting                                                    words with even more
    Let’s assume a natural           and supermarket:-10. This             deviations in letters: blubber
                                     competitive) w/25 analysis)
language query for the               downweights supermarket               and bluster.
confiscated paragraph comes          without excluding it entirely.            Once again, there is a direct
up with millions of warranties                                             correspondence between
for, of all things, birdhouses.                                            retrieving all possible word
Filtering out actual birdhouse
                                     Fuzzy and Phonic
                                                                           variations and generating “false
warranties would greatly speed
                                         And now for the missing           hits.” For this reason, it’s
up the search effort. Words          blueberry pie recipe. After all       usually best to do the search
such as seed and nest are            of these complex Boolean and          first with a low level of
typical in birdhouse warranties,     natural language search               fuzziness, and only if that
but are probably irrelevant to       requests, a simple search for         doesn’t work, to increase to a
the quest for the terrorists. A      blueberry should be a piece of        higher level of fuzziness. Note
search with variable term            well ... pie. A search for            that with fuzzy searching, a
weighting could give these           blueberry, with stemming on,          misspelled search term can also
words a negative weight, with        finds the entry whether it is         find a search term that is
the resulting natural language       blueberry or blueberries. But         spelled correctly in the original
query: seed:-7 nest:-7 any and       suppose blueberry is mispelled        document.
                                     “bluegerry.” Boolean searches             Fuzzy searching is useful
warranties ....
all other representations and
                                     alone could not easily come up        for text with spelling errors,
    Adding the negative ratings                                            such as typographical and OCR
overrides the default of having                                            errors. For sound-alike errors,
all keywords in a natural                                                  phonic searching can also come
language search request rank                                               in handy. For example, a search
positively by search term                                                  for Smith finds Smythe with
density and rarity. Instead, the                                           phonic searching.
natural language search request                                                Both fuzzy and phonic
proceeds as before, with                                                   searching are combinable with
additional negative scoring of                                             Boolean, natural language and
retrieved documents for seed                                               other search features. Just in
and nest.                                                                  case the rest of the world also
    Adding greater positive                                                can’t spell, all search requests
weight to certain keywords                                                 in the previous sections are
further deviates from a pure                                               combinable with fuzzy
natural language search                                                    searching.
request. For example, if flying
is a key term in the paragraph,                                            Please visit dtSearch online at
it might justify a very positive                                           www.dtsearch.com
rating such as flying:9.
                                     A dtSearch search for

    It is also possible to add
                                     blueberry with a fuzzy

variable term weights to
                                     level of 1 would retrieve

structured or Boolean search
                                     bluegerry as well as

                       Reprinted with permission of PC AI Online Magazine V. 14 #1
                  For more information about PC AI Online Magazine, visit www.pcai.com
       dtSearch Sales & Support: Website: www.primeinfotech.biz/dtsearch Email: sales@primeinfotech.biz

Shared By: