Docstoc

Queries in Oracle 9i Text

Document Sample
Queries in Oracle 9i Text Powered By Docstoc
					Oracle Text Queries                                                                   Nina Bergli
Draft                                                                                 30.10.03 / 1



                            Queries in Oracle 9i Text
                                            Draft
                                         Nina Bergli
                                         Oct. 30, 2003


The context index in Oracle Text supports a wide range of query operators with the
CONTAINS operator that is defined through the index. CONTAINS must be used when the
index is of type CONTEXT. Contains queries can be used in combinations with all of the
operators discussed below.


ABOUT

An ABOUT query is a query on a document theme, This implies that text documents can be
queried on their theme and not just the words it contain. The ABOUT operator is applied in
order to make Oracle retrieve documents that contain concepts that are related to a specific
query word or query phrase. Concepts in the knowledge-base are both concrete, for example,
issuing a query for whales might return documents about orcas or other specific instances of
the species, and abstract, for example documents about lions and happiness. The document
need not contain the term whale to be retrieved. It follows that the term or phrase specified in
the query need not exactly match the themes stored in the index. Oracle normalizes the word
or phrase before performing lookup in the index. The normalization process of a query is
based on the knowledge base supplied by Oracle, and which is case-sensitive. Thus, ABOUT
queries yield the best results if queries are formulated in proper case. This must not be
interpreted as a definite requirement since Oracle will look for other case combinations if no
results are found for the case as in the query. For example, a query for Blue Whale might
return documents about blue whales

Themes are extracted from documents during indexing and enabled by a hierarchical list of
categories and concepts that represents a view of the world. This hierarchical list is called a
knowledge-base or more commonly a thesaurus. Oracle supplies a knowledge-base/thesaurus
for English and French. This knowledge base is enabled through the theme component of a
CONTEXT index. The system is then able to look up concept information in the index.

The knowledge-base supplied is basically a hierarchy of, terms within different categories for
example science and business. Since the ABOUT operator uses the supplied knowledge base
in English and French to interpret the phrase that is entered, the ABOUT query is limited to
knowing and interpreting the concepts in the knowledge base. By adding a domain specific
knowledge base, the results of an ABOUT query can be improved.

The about query increases the number of relevant documents retrieved with respect to a
specific query. Oracle scores results for an ABOUT query with the most relevant documents
retrieving the highest score.
An ABOUT query returns a score, which is a relevance measure. Thus, high scores (scores
that are close to or equal 1) represent documents that are likely to be highly relevant, while
Oracle Text Queries                                                                     Nina Bergli
Draft                                                                                   30.10.03 / 2

low scores represent documents not that likely to be relevant. A threshold for score values to
be returned from queries can be defined in the index.1


.



ACCUMULATE (,)
The ACCUMULATE operator is used to search for documents that contain at least one
occurrence of any of the query terms. The ACCUMULATE operator ranks documents
according to the total term weight of a document. Term weight refers to the weight that is
assigned to a query term. A query where the query terms are not explicitly assigned weights
will automatically be assigned the weight 1 for each term.
Accumulate scoring guarantees that if a document A matches p terms with a total term weight
of m, and document B matches q terms with a total term weight of m+1, document B is
guaranteed to have a higher relevance score than document A, regardless of the numbers p
and q.
If two documents have the same weight M, the higher relevance score goes to the document
with the higher weighted average term score.


Other operators
       -   And (&): The score returned is the score of the lowest query term.
       -   Equivalence (=): Specifies an acceptable substitution for a word in a query. The
           following example returns all documents that contain either the phrase “bluewhales
           are social animals” or “minkewhales are social animals”: ‘bluewhales=minkewhales
           are social animals’.
       -   Minus (-): The operator is used to search for documents that contain one query term
           and the presence of another term should cause the document to be ranked lower.
       -   Near (;): The near operator returns a score based on the proximity of two or more
           query terms. Oracle returns higher scores for terms closer together and lower scores
           for terms further apart in a document. The scoring for the NEAR operator combines
           frequency of the terms with proximity of the terms. A score between 1 and 100 is
           generated for each document that satisfies the query condition. Near can not be used in
           ABOUT queries. The operator takes the following arguments:
               o Word 1-n: The terms in the query can be either single words or phrases and
                   must be separated by commas.
               o Max_span: Can be used to specify the maximum distance allowed between
                   two query terms. Generally, it is used to specify the size of the biggest clump.
                   Clump is defined as the smallest group of words in which all query terms
                   occur. All clumps begin and end with a query term.
               o Order: Can be specified to true if the documents retrieved should contain the
                   query terms in the same order as in the query formulation. Max_span must be
                   specified when using the order argument.
       -   Not ( ): The Not operator is used to search for documents containing one query term
           but not another.


1
    The Threshold operator is discussed further in the section”Other operators”.
Oracle Text Queries                                                                   Nina Bergli
Draft                                                                                 30.10.03 / 3

   -   Or (|): The Or operator is used to search for documents that contains at least one
       occurrence of one of the query terms.
   -   Soundex (!): This operator is used to expand queries with words that have similar
       sounds, that is, words that sound like each other. This function is only available in
       English.
   -   Stem ($): The stem operator is used to search for terms that have the same linguistic
       root as the query term.
   -   Fuzzy: The operator is used to expand queries to include words that are spelled
       similarly to the specified query term. This type of expansion is helpful for finding
       more accurate results when there are frequent misspellings in the document set. The
       syntax enables ranking of the result set so that documents that contain words with high
       similarity to the query word are scored higher than documents with lower similarity.
       The number of words generated by a fuzzy expansion depends on what is in the index.
       Thus, results can vary significantly.
       The parameters for fuzzy operator are
           o Query term
           o Score: specifies a similarity score between 1 and 80. Default is 60.
           o Numresults: Specifies the maximum number of terms to use. Valid range is
               from 1 – 5000. Default is 100.
           o Weight: Specifies whether or not the results are to be weighted according to
               their similarity scores.

   -   Broader term operators (BTO): The BTO’s are used to expand a query to include
       terms that has been defined in a thesaurus as the broader or higher-level term for a
       specific query term. Expansion of a broader term is also possible until the top of the
       thesaurus hierarchy is reached. A level argument can be optionally specified for a
       BTO query. The level then specifies how many levels of the thesaurus hierarchy that is
       to be traversed. The default level value is one. If the name of the thesaurus to be used
       is specified, the levels to be traversed must also be specified. The default value for the
       thesaurus argument is DEFAULT. A thesaurus called DEFAULT must therefore exist
       in the thesaurus tables.
           o BT: Expands a query to a broader term as defined in the thesaurus.
           o BTG: Expands a query to include all broader generic terms as defined in the
               thesaurus.
           o BTP: Expands a query to include all terms defined as broader partitive terms
               for the specific term in the thesaurus.
           o BTI: Expands a query to include all terms defined as broader instance terms for
               the specific term in the thesaurus.

   -   Narrower term operators (NTO): The use of NTOs are equal to the use of BTOs but
       it works opposite. The NTOs can as the BTOs be further divided into:
           o NT
           o NTG
           o NTP
           o NTI
       All these operators equals the operators under BTO so they will not be commented
       further.

   -   Preferred term (PT): Used to replace a query term with the preferred term as defined
       in the thesaurus for the query term.
Oracle Text Queries                                                                 Nina Bergli
Draft                                                                               30.10.03 / 4

   -   Related term (RT): Expands a query to include all related terms that have been
       defined in the thesaurus.
   -   Synonym (SYN): Expands a query to include all terms that have been defined as
       synonyms for the query term in the thesaurus.
   -   Translation term (TR): Expands a query to include all defined foreign language
       terms. The following are arguments to the operator:
           o Term
           o Language
           o Thesaurus
   -   Translation Term Synonym (TRSYN): Combination of TR and SYN.
   -   Top Term (TT): Replaces a term in the query with the top term that has been defined
       for the specific term in the standard hierarchy (BT, NT) in the thesaurus. Top terms in
       the generic (BTG, NTG), partitive (BTP, NTP), and instancr (BTI, NTI) hierarchies
       are not returned.
   -   Threshold (>): The operator can be used at two levels, the expression and the query
       term level. At the expression level documents that score below the defined threshold
       are eliminated. In the case of the query term level, the threshold operator selects a
       document based on how a term scores in the document.
   -   Weight (*): This operator is used to mark the more important terms in a query given
       that the query has at least two terms. Terms that are considered to be negative for the
       overall result can be given weights below 1 so that documents containing this term
       obtain a reduced score. Weight multiplies the score by the supplied factor and the
       maximum sum that can be achieved is 100, thus implying that documents that score
       above 100 will be set to a score of 100. The weight operator is useful in accumulate,
       OR and AND queries when the expression has more than one query term. When there
       is no weighting on individual terms, the score will not tell which of the terms that
       appears more often in the document.
   -   Wildcards (%, _): Can be used to expand a search to a pattern search. The wildcard
       characters are % and _. % specifies that any characters can appear in multiple
       positions represented by the wildcard. _ specifies a single position in which any
       character can occur. Wildcard query performance is improved by adding a substring
       and a prefix index.
Oracle Text Queries                                                                   Nina Bergli
Draft                                                                                 30.10.03 / 5




Example queries in Oracle Text

The text table text_tab defined in Appendix A is used in order to illustrate the query operators
in Oracle Text. The definition is as follows:

create type keyword_nt as object(
keyword varchar2(20));

create or replace type keyword_tab as table of keyword_nt;

create or replace type text_type as object
(
identifier number,
dummy char(1), --dummy column for indexing
title varchar2(200),
topic varchar2(200),
format varchar2(10),
published_date date,
media_type varchar2(200),
description varchar2(4000),
source varchar2(200),
keyword_list keyword_tab,
language varchar2(100),
text clob,

static procedure addKeyword(k varchar2, i in number));


create table text_tab of text_type
(
primary key (identifier)
)
nested table keyword_list store as text_ntab(
(primary key (nested_table_id, keyword))
organization index compress)
return as locator;
Oracle Text Queries                                                                   Nina Bergli
Draft                                                                                 30.10.03 / 6


Operator       Example query                                                     Result
About          ‘about(Whales)’ -- Documents about whales                         ok
               ‘about(whales and reproduction)’ – Documents about whales
               and reproduction
               In terms of scoring, documents that contain both themes will
               score higher than documents that have only one of the themes in
               the query.

               The ABOUT operator can be combined with other operators for
               example AND or NOT.
               about (minke whales) AND reproduction
               about(blue whales) not about (reproduction)

               Select score (1),identifier,title
               From text_tab
               Where CONTAINS(text,’about(whales)’,1)>0
               Order by score(1) desc;

               select identifier,title,score(1)
               from text_tab
               where contains ( text, 'about(whales)',1 ) > 0;
Accumulate     SELECT SCORE(1), title,source                                     SCORE(1)
                FROM text_tab                                                    TITLE
                       WHERE CONTAINS(text, '                                    ---------------
               bluewhale,orcas,beluga,killer-whale', 1) > 0                      -------
                       ORDER BY SCORE(1) DESC;                                   28 Minke
                                                                                 whales
                                                                                        1
                                                                                 1Communic
                                                                                 ation -Music
                                                                                 to swim by
And            SELECT SCORE(1), title,description                                no rows
               FROM text_tab                                                     selected
                     WHERE CONTAINS(text, ' bluewhale AND killer-whale',
               1) >0

Equivalence SELECT SCORE(1), title from text_tab                                 no rows
                   WHERE CONTAINS(text, ' bluewhales=minkewhales                 selected
            are social animals ', 1) > 0
                   ORDER BY SCORE(1) DESC;

Minus          SELECT SCORE(1), title,description                                ok
               FROM text_tab
                   WHERE CONTAINS(text, ' bluewhale - killer-whale', 1)
               >0

Near           SELECT SCORE(1), title, published_date from text_tab              ok
                      WHERE CONTAINS(text, 'whales;reproduction;death',
               1) > 0
Oracle Text Queries                                                                    Nina Bergli
Draft                                                                                  30.10.03 / 7

                      AND published_date >= ('01-01-2000')
                      ORDER BY SCORE(1) DESC;

Not            SELECT SCORE(1), title,description,source from text_tab            ok
                    WHERE CONTAINS(text, ' whales~minkewhales', 1) >
               0
                    ORDER BY SCORE(1) DESC;

Or             SELECT SCORE(1), title,description,source from text_tab            ok
                    WHERE CONTAINS(text, ' bluewhales OR
               minkewhales', 1>0
                    ORDER BY SCORE(1) DESC;

Soundex        select title from text_tab where contains ( text, '!whale',1 ) > 0 ok
Stem           select identifier,title,score(1) from text_tab where contains (    ok
               text, '$moving',1 ) > 0;
Fuzzy          SELECT identifier, title                                           ok
               FROM text_tab
               WHERE CONTAINS(TEXT, ‘fuzzy(whale, 70, 6, weight)’ ,
               1)>0;
               This query expands to the first 6 variations of whale in the index
               that have a similarity score over 70. Documents in the result set
               are also weighted so that documents that have the more similar
               words compared to whale are ranked highest.

               Using the appropriate number of commas can skip unnecessary
               parameters. For example:
               SELECT identifier, title
               FROM text_tab
               WHERE CONTAINS(TEXT, ‘fuzzy(whale, , , weight)’ ,
               1)>0;
               In this case, documents that score above 60 are retrieved, since
               60 is the default value for score. The parameter numresults will
               have the value 100.

               select * from fuzzy_tbl where contains (text,'?hales')>0;

Treshold       select identifier,title,score(1)                                   ok
               from text_tab
               where contains ( text, '(pilot-whale > 4) and communication',1 )
               >0
               order by score(1) desc;
Weight         select identifier,title,score(1)                                   ok
               from text_tab
               where contains ( text, 'pilot-whale or communication*3',1 ) > 0
               order by score(1) desc;
Wildcards      select identifier,title,score(1)                                   ok
               from text_tab
               where contains ( text, 'pilot% or communication*3',1 ) > 0
               order by score(1) desc;
Oracle Text Queries                                                                       Nina Bergli
Draft                                                                                     30.10.03 / 8




Comments about thesaurus queries in Oracle
Oracle supplies a test thesaurus in English. The following path locates the thesaurus:
C:\oracle\ora92\ctx\sample\thes. The thesaurus is by default called ‘drOthsus’. If the above
queries were to be defined without specifying the name of the thesaurus, Oracle would not
find it unless it is called ‘default’. Before using the thesaurus in queries, it has to be loaded.
The CTXLOAD procedure is described in McGregor (2002).

The examples are syntactically correct, but some of them do not provide semantic meaning.
This is because they are only intended as examples at the present point of time.


BT                                ‘BT(Bluewhale)’
                                  ‘BT(Bluewhale, 2, drOthsus)’
                                  When homographs are under
                                  consideration a qualifier can
                                  be specified in order to avoid
                                  misconception.
                                  BT(crane{(waterfowl)}) --
                                  crane or bird

                                  select text from text_tab
                                  where contains ( text,
                                  'BT ( orcas, 1, dr0thsus )' ) >
                                  0;


BTG                               select text from text_tab
                                  where contains ( text,
                                   'BTG ( Whale, 1, dr0thsus)' )
                                  > 0;

BTP                               select text from text_tab
                                  where contains ( text,
                                   'BTP ( Pilot-Whale, 1,
                                  dr0thsus )' ) > 0;

BTI                               select text from text_tab
                                  where contains ( text,
                                   'BTI ( Orcas, 1, dr0thsus )' )
                                  > 0;

NT                                select text from text_tab
                                  where contains ( text,
                                   'NT ( Whale, 1, dr0thsus )' )
                                  > 0;

NTG                               select text from text_tab
                                  where contains ( text,
Oracle Text Queries                                     Nina Bergli
Draft                                                   30.10.03 / 9

                       'NTG ( Cats, 1, dr0thsus)' ) >
                      0;

NTP                   select text from text_tab
                      where contains ( text,
                        'NTP ( Animal, 1, dr0thsus )'
                      ) > 0;

NTI                   select text from text_tab
                      where contains ( text,
                        'NTI ( Animal, 1, dr0thsus )'
                      ) > 0;

PT                    select text from text_tab
                      where contains ( text,
                       'PT ( Africa, dr0thsus )' ) >
                      0;

RT                    select text from text_tab
                      where contains ( text,
                       'Wolf ( Term, dr0thsus )' ) >
                      0;

SYN                   select text from text_tab
                      where contains ( text,
                       'SYN ( African lion,
                      dr0thsus )' ) > 0;

TR                    select text from text_tab
                      where contains ( text,
                       'TR ( Whale,baleine ,
                      dr0thsus )' ) > 0;
                      -- Term
                      --
                      Baleine_French_For_Whale

TRSYN                 select text from text_tab
                      where contains ( text,
                       'TRSYN ( Killer-Whale, all,
                      dr0thsus )' ) > 0;
Oracle Text Queries                                                      Nina Bergli
Draft                                                                   30.10.03 / 10


Combining structural predicates with the CONTAINS operator:

SELECT SCORE(1), title, published_date from text_tab
     WHERE CONTAINS(text, 'lions', 1) > 0
     AND published_date >= ('01-01-2000')
     ORDER BY SCORE(1) DESC;




REFERENCES
McGregor (2002). Oracle Text - Reference, Oracle. 2002. available at:
http://www.csis.gvsu.edu/GeneralInfo/Oracle/text.920/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:22
posted:12/5/2011
language:English
pages:10