High Level Design - Bilkent University Undergraduate Students Web by wuyunqing

VIEWS: 18 PAGES: 23

									                            Bilkent University

            Department of Computer Engineering

                                    Senior Project

                                   “BilMedia”

    Bilkent Media Tracking System
                                       Seher Acer
                                       Elif Demirli
                               Özlem Başak İskender
                                  Şadiye Kaptanoğlu

                                         Supervisor

                                  Prof. Dr. Özgür Ulusoy

                                      Jury Members

                                  Vis. Prof. Dr. Fazlı Can

                            Assoc. Prof. Dr. Uğur Güdükbay

                               High Level Design Report

                                        Jan 7, 2009

This report is submitted to the Department of Computer Engineering of Bilkent University in
partial fulfillment of the requirements of the Senior Projects course CS491.
Table of Contents



Table of Contents ....................................................................................................................... 2
1     Introduction ......................................................................................................................... 3
    1.1      Purpose of the System ................................................................................................. 4
    1.2      Design Goals................................................................................................................ 4
    1.3      Definitions and Acronyms ........................................................................................... 5
    1.4      References ................................................................................................................... 6
    1.5      Overview ..................................................................................................................... 7
2     Current Software Architecture ............................................................................................ 8
3     Proposed Software Architecture ......................................................................................... 9
    3.1      Overview ..................................................................................................................... 9
    3.2      Subsystem Decomposition ........................................................................................ 10
    3.3      Hardware/Software Mapping .................................................................................... 17
    3.4      Persistent Data Management ..................................................................................... 18
    3.5      Access Control and Security...................................................................................... 19
    3.6      Global Software Control ............................................................................................ 19
    3.7      Boundary Conditions ................................................................................................. 20
4     Subsystem Services ........................................................................................................... 21
5     Glossary ............................................................................................................................ 22
1 Introduction

Computers store large amounts of information in unstructured form. Books, news, reports and
many other kinds of documentation are stored as large volumes of texts in natural language
form. To manage those unstructured data is a very large problem for IT industry. [1]
Information Extraction is a technique that brings a solution to this problem. In Natural
Language Processing (NLP), information extraction is a type of information retrieval whose
goal is to automatically extract structured information from unstructured machine-readable
documents. [2] After extracting “pieces” of information from large volumes of texts, data
becomes meaningful and can be automatically manipulated.

As the amount of information on the World Wide Web grows, it becomes increasingly
difficult to find just what we want. Although general-purpose search engines such as
AltaVista and Google offer quite useful coverage, it is often difficult to get high precision
from these. When we know that we want information of a certain type, or on a certain topic, a
domain-specific Internet portal can be a powerful tool. A portal is an information gateway
that often includes a search engine plus additional organization and content. Portals usually
offer powerful methods for finding domain-specific information.

       BilMedia Portal will be a media tracking system that is designed for retrieving and
displaying news about “Bilkent University” published on virtual newspapers. The system will
retrieve news from RSS feeds as well as from the crawled sites, and extract target information
using various information extraction techniques. The selected news will be published on the
portal. The site will provide interface for performing sophisticated search operations and for
displaying the results.

       Throughout the report, we will present the high-level design artifacts for BilMedia
Portal. The introduction sections will present the purpose of the system, our design goals,
some definitions and acronyms that will be useful while reading the report, references and
overview. After introductory sections, the current state of existing similar systems will be
described in Current Software Architecture section. The two main parts of the report are
Proposed Software Architecture and Subsystem Services. Proposed Software Architecture
section will include the following subsections: Overview, Subsystem Decomposition,
Hardware Software Mapping, Persistent Data Management Access Control & Security,
Global Software Control and Boundary Conditions. In these sections, design decisions and
detailed information about the core of the system will be given. After this, we will mention
the services that are provided by the different subsystems in Subsystem Services section. We
will end the report with Conclusions and Glossary.


1.1 Purpose of the System

This project aims to obtain a generic system that filters and extracts related information from
Web pages about a given institution (a company, university, politician, etc.). The research
component involves adapting/implementing information extraction techniques. The prototype
we will implement will do the following jobs:

          Getting RSS feeds from various news sites
          Crawling the news content
          Filtering news about Bilkent
          Extracting required information (who, where, what)
          Storing the extracted data in a database
          Providing interface for querying and displaying sophisticated search operations


1.2 Design Goals

The design goals of BilMedia are mainly divided into three subsections which include
functional, non-functional and pseudo requirements that need to be fulfilled during the
development of the project. As a general design goal, although we are currently working on
the project of “Bilkent Media Tracking System”, we want to develop the project in a modular
way. By this way, our system will have independent modules, and we will be able to turn it
into a general media tracking system with little effort.

There will be two user groups for the system: Authorized user (Administrator) and any other
ordinary user who accesses the system through the Web. Our system will provide following
functionalities for these user groups:

      Anyone: View all news about Bilkent, view department specific news, search news,
       subscribe to news, and perform sophisticated query operations.
      Administrator: Eliminate improper news and adjust the news for publication on the
       site, configure crawling options.
The non-functional design goals of our system are classified as follows.

 Response Time: Since our system will extract the relevant data with respect to the users’
   search criteria from various amounts of documents, it is important to have a small
   response time. The user should not wait long for the search results. So, the waiting time
   should be shortened by choosing a proper algorithm.
 Usability: The Bilkent Media Tracking System is intended for public usage. Since it will
   be used by wide range of people, ease of use is significant. Our system will provide pure,
   efficient and easily learnable interface.
 Reliability: Our system will have the ability to perform and maintain its functions in
   routine and unexpected circumstances. Also, it is important to provide reliable data to the
   user. For this purpose, we will select reliable news sources from RSS feeds.
 Availability: The Bilkent Media Tracking System will be a web oriented application. For
   this reason, the access to the system is an important concept for us. Our server will be
   available for the respond of the users at least %90 of time.
 Reusability: Our design will follow systematic methodologies in order to allow others to
   reuse and improve our design and implementation.

The pseudo requirements of the system are the following.

 We plan to use Informa Library in order to perform operations on RSS feeds of various
   newspapers. The goal of the Informa Project is to provide a news aggregation library
   based on the Java Platform. [3]
 Since our system will be a web based application, it will support all operating systems.
 We will use the GATE tool and the SRV algorithm for information extraction and
   learning purposes.
 We plan to implement our prototype system in NetBeans IDE using JSF and JSP.
 We will use MySQL for the database part of the prototype.


1.3 Definitions and Acronyms

In this section, expansions of some acronyms and abbreviations that are used in the report will
be shown. Definitions part will be included in the glossary and will not be shown in this
section.
     ACRONYM /                                         EXPANSION
   ABBREVIATIONS
ANNIE                        A Nearly-New Information Extraction System
BilMedia                     Bilkent Media Tracking System
GATE                         General Architecture for Text Engineering
IE                           Information Extraction
IT                           Information Technology
JAPE                         Java Annotation Patterns Engine
JSF                          Java Server Faces
JSP                          Java Server Pages
NLP                          Natural Language Processing
SRV                          Stochastic Real Value Units Algorithm
SVM                          Support Vector Machines


1.4 References

[1] Blumberg, Robert & Atre, Shaku. The Problem with Unstructured Data, DM Review
Magazine, February 2003.

[2] http://en.wikipedia.org/wiki/

[3] http://informa.sourceforge.net/index.html

[4] Dayne Freitag, Andrew McCallum. Information Extraction with HMM Structures Learned
by Stochastic Optimization. In American Association for Artificial Intelligence
(www.aaai.org), 2001.

[5] http://gate.ac.uk/

[6] http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering

[7] http://gate.ac.uk/sale/tao/splitch8.html#x10-1970008

[8] http://gate.ac.uk/sale/tao/splitch11.html#chapt:mlapi

[9] Dayne Freitag. Information extraction from HTML: Application of a general machine
learning approach. In Proceedings of the Fifteenth National Conference on Artificial
Intelligence (AAAI-98), 1998.

[10] Dayne Freitag. Machine Learning for Information Extraction in Informal Domains. In
Machine Learning, 39, 169–202, 2000.
[11] Mary Elaine Califf. Relational Learning Techniques for Natural Language Information
Extraction. PhD thesis, University of Texas at Austin, August 1998.

[12] http://larbin.sourceforge.net/index-eng.html

1.5 Overview

The Internet makes available a tremendous amount of text that has been generated for human
consumption; unfortunately, this information is not easily manipulated or analyzed by
computers. [4] Information extraction (IE) is a process which takes unseen documents and
produces fixed-format, unambiguous data as output. [5] In general, this data may be used
directly for display to users, or may be stored in a database for later analysis, or may be used
for indexing purposes in information retrieval applications. In this project, we will use a
selected IE method in order to extract information from various news sites and use this
information to select the news related to Bilkent University. So we will use the output data
directly for display to users. Furthermore, we will store some information such as the
keywords extracted from the web site and the URL of the web site in a database for further
search purposes.
2 Current Software Architecture

There is no current software system in Bilkent that accomplishes similar tasks with our
system. Bilkent News site that is the web site of Bilnews (the local newspaper of Bilkent)
provides news about Bilkent however, it is a very primitive web site. The news are retrieved
and published manually. They are presented as free texts in unstructured form. No search
option is provided to the readers, the user has to scan the archive manually in order to retrieve
any past news he is looking for. As it is seen obviously, the system is very insufficient.

When the popular Turkish news websites are considered, it is seen that although they are
better than Bilnews website, they do not benefit from information extraction technology. As
far as we could observe, they present the news as free text. They provide advanced search
feature. However, the search is done quite straight forward. Different results are returned from
searching an acronym and its long form which displays that they do not have a complete and
accurate search mechanism.
3 Proposed Software Architecture


3.1 Overview

Our project aims to filter news about a specific institution from various news sites, extract
some predefined information (i.e. who, where, what) from the news and provide an interface
that will enable users to perform sophisticated search operations. This section will provide
some artifacts for our system’s architecture.

The next section will describe the decomposition of our system into smaller subsystems. The
subsystems are Data Gathering Module, Filtering Module, Information Extraction Module,
and Web Portal Module. Each module will be described in the corresponding section.

Our system relies on reliable hardware resources and software components and also an
effective allocation among them. We provided the hardware software mapping of our system
in Hardware/Software Mapping Section.

Data storage and management is also a crucial issue for our system. We need to design and
implement a database in order to store both retrieved and filtered news and also the extracted
structured information efficiently. Persistent Data Management section will describe our
database design.

Access Control and Security section will provide the user groups of our system and
authorization levels of them.

As it is mentioned before, our system can easily be divided into modules. Those modules
works in a way that each of them gets another module’s output as input. There should be a
control mechanism that manages all modules and the information flow between them. Global
Software Control section describes that control mechanism.
3.2 Subsystem Decomposition

The subsystem decomposition of our project can be seen more easily in Figure 1. The first
step is data gathering by which the data collection of our system will be obtained. The results
of this operation will be the input for filtering which cleans the data up from html tags and
eliminates the news that are not related to Bilkent. Next step is the information extraction
which will be performed in two separate ways and whose results will be the input of the web
portal.




Figure 1: The general picture of proposed software system, showing the subsystem
decomposition
Data Gathering Module

During the project development, we will gather news from various sources and this process is
referred to as data gathering. First of all, we will use the database of “Bilkent News” for the
first prototype of the project. We will also download web pages that contain news in order to
form the data collection of our portal, by the help of a well known and efficient crawler,
Larbin. Moreover, we will use the RSS feeds of various news sites for gathering information
about newly arrived news as well as for obtaining general information containing the URL of
the news.

In our case, a crawler is an agent that traverses the Web, looking for documents to add to the
data collection of the portal. When aiming to populate a domain-specific collection, the
crawler should explore the Web in a directed fashion in order to find domain-relevant
documents efficiently. So it should try to avoid hyperlinks that lead to off-topic areas, and
concentrate on links that lead to documents of interest.

We plan to use Larbin as crawler. Larbin is a web crawler which is freely available. It is
intended to fetch a large number of web pages to fill the database of a search engine, and it
can be used as a crawler for general search engines as well as a crawler for a specialized
search engine. [12]

Filtering Module

After the data gathering process, we will have a collection of news in our file system. The
next step should be the process of cleaning up this collected data. We will have two kinds of
filtering processes in this module. These are referred to as html parsing and Bilkent filtering
processes.

The html parsing process will be done with the help of a library that is written in Java, HTML
Parser. HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.
Primarily used for transformation or extraction, it features filters, visitors, custom tags and
easy to use JavaBeans. In summary, it is a fast, robust and well tested package. We will use
this library in order to get rid of the html tags in the downloaded web pages.

After obtaining the sole text of the news we will apply another filtering operation which is
referred to as Bilkent filtering. This is the process of obtaining the news sites that are related
to Bilkent University from the downloaded web pages. We will use keywords such as:
Bilkent, Bilkent University, Bilkent Univ., etc. By this approach we will have only the news
sites that are related to Bilkent in our database and the information extraction will be
performed with these news sites.

Information Extraction Module

Extracting characteristic pieces of information from the documents of a domain-specific
collection allows the user to search over these features in a way that general search engines
cannot. Information extraction is well suited to this task since it is concerned with identifying
phrases of interest in textual data. For many applications, extracting items such as names,
places, events, dates, and prices is a powerful way to summarize the information relevant to a
user’s needs. In the case of our project, the automatic identification of important information
can increase the accuracy and efficiency of a directed query. Moreover, our portal will have
some new features such as giving the user the opportunity to see extracted fields of news and
submitting queries with this information which will return complicated query results in a
simpler way.

In our project we will use two information extraction tools that are machine learning based
information extraction techniques. These are the Gate tool and the SRV algorithm which are
discussed in detail in the following sections.

    Gate Tool

GATE (General Architecture for Text Engineering) is a Java software toolkit, which is used
for all sorts of natural language processing tasks, including information extraction in many
languages. Languages currently handled in GATE include English, Spanish, Chinese, Arabic,
French, German, Hindi, Cebuano, Romanian, and Russian. [6]

GATE includes an information extraction system called ANNIE (A Nearly-New Information
Extraction System) which is a set of modules comprising a tokenizer, a gazetteer, a sentence
splitter, a part of speech tagger, a named entities transducer and a coreference tagger. It also
uses the JAPE (Java Annotation Patterns Engine) language for building rules in order to
annotate documents with tags. [6]

In briefly, ANNIE analyses texts and presents the specific information from them that the user
is interested in by means of its modules. In the first step, the tokenizer module splits the text
into very simple tokens such as numbers, punctuation and words of different types. After
splitting the text into tokens, ANNIE looks up to the gazetteer lists, which represents a set of
names, such as names of cities, organizations, days of the week, etc., to find all occurrences of
matching words in the text. Finally, the grammar rules which are specified in JAPE are used
to recognize and annotate the entity types. [7]

GATE also includes a Machine Learning layer. The current implementation of GATE mainly
supports Machine Learning for chunk recognition (entity recognition), text classification and
relation extraction. It also provides the facilities for active learning based on the learning
algorithm Support Vector Machines (SVM), mainly ranking the unlabelled documents
according to the confidence scores of the current SVM models for those documents. [8]

In the project, there will be mainly 2 purposes for using the GATE tool: for information
extraction and for learning purposes.

GATE will be used to extract specified information from the texts. The extraction will be in
Turkish and English. Since there is no support for Turkish in GATE, a Turkish plug-in will be
done to extend GATE to be able to extract information in Turkish. For this process, new
gazetteers and rules that are specific to Turkish language will be introduced. The gazetteers
will be Bilkent specific in order to capture more information about Bilkent from texts.

GATE will be also used for machine learning purposes. We will firstly use GATE for eager
learning and then try the active learning algorithms of the tool. After using these learning
algorithms, we will be able to see the results of these two learning types more clearly.

    SRV Algorithm

SRV (Stochastic Real Value Units Algorithm) is an implementation of a general purpose
relational learner for information extraction. [9] SRV makes no assumptions about document
structure and the kinds of information available for use in learning extraction patterns. Instead
structural and other information is supplied as input in the form of an extensible token
oriented feature set. [9] It takes two kinds of input: a representation language and a set of
examples to be used in training. [10] The given examples are divided into n classes. The goal
of the learner is to produce a set of logical rules to classify each novel example into one of
these classes.
SRV’s induction procedure is based on the notion of features, which are functions over
individual tokens. [10] For example, a simple feature is a function that maps a token to an
arbitrary value, which is categorical and usually Boolean. A simple feature can be capitalized,
which takes the value true for any token beginning with a capital letter and false otherwise.
On the other hand, a relational feature is a function that maps a token to another token in the
same document. [10] An example can be next token, which returns the token immediately
following its argument.

We will use the SRV algorithm in order to perform experiments on different data collections
and we will evaluate the results. Firstly, we will perform tests on eager learning method which
is shown in Figure 2. Eager learning is a learning method in which the system tries to
construct a general, input independent target function during training of the system. [2] In this
method, the system basically tries to generalize the training data before receiving queries.
Next, we want to try lazy learning as our method of learning in which generalization beyond
the training data is delayed until a query is made to the system. [2] The general view of lazy
learning method can be seen in Figure 3.




Figure 2: The eager learning method for machine learning
Figure 3: The lazy learning method for machine learning

Portal

This module implements the interface over which the users will interact with the system. Data
gathering, filtering and information extraction modules do their jobs completely hidden from
the end users. Web Portal Module provides interface for user-system communication. It gets
commands from end user over the graphical user interface, interprets and executes those
commands in communication with database and gives an appropriate feedback to user.

The portal will provide advanced search and display mechanisms to the user. Users will be
able to search news in two ways. The first one is the classical way. When the user wants to
perform a search operation, he enters some keywords into appropriate fields on the interface
and presses on the search button, search results are displayed as a list on the screen. In
addition to this classical search method, we will also implement another search mechanism by
benefiting the structured information that we extracted previously. The extracted information
pieces that appear in the news content are displayed as hyperlinks. When the user clicks on
one of the links, all the news in which the same extracted information appears will be listed.
The example screenshots for this second type of search mechanism and the search results are
displayed in Figures 4 and 5 respectively.

While designing and implementing BilMedia Web Portal, we will consider benefiting from
the extracted information pieces as much as possible. We will provide an advanced and fast
search mechanism with the use of the extracted information. We will also consider to design
and execute other efficient queries. As the last point, we will design highly-usable user
interface. Our user interface will be minimalist, consistent, flexible and easy to use.
Figure 4: The basic user interface of a specific news




Figure 5: The user interface of retrieved results when clicked on link “Cevdet Aykanat”
3.3 Hardware/Software Mapping

Our proposed system is composed of web crawling, information extraction, database and web
portal modules. We can represent the system in two higher level abstraction layers: System
Layer and Presentation Layer.

Both layers of the system will work on a UNIX mainframe. System Layer will be responsible
for execution and integration of core of the system. The mainframe will require a fast internet
connection for getting RSS feeds and crawling news. Web crawler, information extraction and
database modules will be implemented in System Layer. The mainframe must also have a
reliable data storage mechanism because it will keep all of the required for our system.

Presentation Layer will be responsible for implementing the interface between clients and the
system. The portal will work on Apache Server. Clients can directly connect to portal using
their own computers. An average PC will be sufficient as client computers. The visual
representation of hardware software mapping of our system is seen in Figure 6.




Figure 6: The picture of hardware/software mapping of proposed system

We will use java development kit (JDK) 1.6.0 in order to implement our web portal. We will
use (Java Server Pages) JSP technology in order to develop platform-independent dynamic
web-content in a simplified way.
3.4 Persistent Data Management

Since aim of this project is to extract and retrieve specific news from web pages, there is lots
of information to store and manage. We will use a relational database to store our web pages,
news, keywords and named entities in news, etc. So, retrieval of the information that user
requests will be fast and retrieved information will be permanent. When program gets a web
page as input – a Bilkent related news- , it stores web page in a simple format in database
after page is crawled. Extracted information from web page (news) and named entities in that
news are also stored in the database without losing the relations between news and named
entities after information extraction. ER diagram of our database can be seen in Figure 7.




Figure 7: The ER diagram of the database of proposed software
3.5 Access Control and Security

Bilkent Media Tracking System as a final product will be used in Bilkent University site or
Bilkent News Portal as a sophisticated tracking system for Bilkent related news. As a result,
our system will be accessed and controlled via web. The development of the system will be
done on a Bilkent CS Department server. Therefore, access to the server will be restricted by
an authentication system and only authenticated user which is referred to as administrator will
be able to make configurations in the system. Administrator is responsible for maintaining the
database consistency by checking the newly arrived news, eliminating duplicate or other
unnecessary news, configuring crawler options, etc.

All other users will be able to access the system through Internet. There is no user specific
information or privilege in our system other than the system administrator, so the users will
not be required to have a username or password. Moreover, there are not any security
violations since the users will only perform view, search, and different types of querying
operations. There is no way of modifying the system or database for the general users.

3.6 Global Software Control

The Software control mechanism of BilMedia is divided into four parts. These parts are the
data gathering part, the filtering part, the information extracting part and the user interface
part of the system.

Data Gathering

The purpose of the data gathering part is basically downloading domain-related pages and
updating the previously crawled pages in short time intervals.

The data gathering part is cannot be used publically, only be used by the developers and the
administrators of the system. It works until an objective is achieved or stopped by
administrators.

Filtering

The filtering part consists of two parts: html tag filtering and Bilkent filtering. All these
filtering parts take the previously gathered web pages as input and give a filtered output.

Bilkent filtering part filters the gathered web pages to take only the Bilkent related pages.
Html tag filtering part basically cleans the gathered web pages from HTML tags.

All the parts of the filtering are non-public as well and can only be used by the developers and
administrators.

Information Extraction

The information extracting part extracts the fields that the user specifies using the simple
interface provided by the system from the previously crawled and filtered web pages. After
extracting these specified fields, it basically stores the related fields into the table that contains
title, source, brief description, keywords of the news and a link that contains the related news.

This part is non-public as well and can be used by developers and administrators. When the
system runs, it takes the output of the filtering part and uses it as its input. It works until an
objective is achieved or stopped by administrators.

User Interface

The user interface part is the visual part of BilMedia system. It is opened to the public use and
the users perform their desired tasks using the interface. This part is event-driven and
responds to the actions of the users. By using the system, users specify the fields of the search
and request the result. Then system displays the requested result and waits for another action
from the user.

3.7 Boundary Conditions

Four main modules running in our system, namely Data Gathering, Filtering, Information
Extraction and Web Portal modules run consecutively, feeding their output as next module's
input.

     Data Gathering Module/Filtering Module/Information Extraction Module are
         initialized by the developers and there's no time constraint for the running time of
         those modules. It may run continuously or periodically. They are terminated when a
         predefined objective is achieved or by the will of developers or administrators.
     Web Portal Module is instantiated when it is accessed via a web browser by a user. It
         terminates when the user closes his web browser or moves to another page. There is
         no time constraint for the running time of that module; it stays alive as the user stays
         in our Web portal's pages.
4 Subsystem Services

In Table 1 below, services of subsystems in our project are shown:


Service Name         Service Input    Service Output                  Service Description

Data Gathering       Web addresses    Downloaded web pages            A web crawler or RSS feeds
Module               which will be    which will be searched for
                     searched for     search
                     news

Filtering Module     Downloaded       Sole texts of news that are     An HTML parser and a content
                     web pages        related to a specified content filterer (e.g. Bilkent-related
                                                                      news)

Information          Text of news     Keywords from news text         GATE ( with ANNIE, JAPE),
Extraction Module                     together with relations         modified GATE with machine
                                      among each other (e.g.          learning, SRV algorithm with
                                      Persons, Emails...)             machine learning

Web Portal Module Database            Query results from database, GUI interacting with user and
                     consisting of    e.g. Display of the news that database
                     previously       Bilkent received some
                     extracted        award.
                     information
5 Glossary

In this section, the words that are related to the concept will be explained.

        WORD                                            DEFINITION
Active Learning               A learning algorithm that can actively choose the data to be
                              annotated by querying the user for the annotations instead of
                              randomly picking the data to be manually annotated for training
                              set.
Annotation                    A type, a pair of nodes pointing to positions inside the
                              document content, and a set of pairs attribute values that encode
                              further linguistic information.
Bilnews                       The local newspaper of Bilkent.
Crawler                       A piece of software that roams the Web, collecting URLs and
                              other information from Web pages. Search engines are built
                              around the information that crawlers retrieve.
Data Gathering                The process of collecting domain-related pages and updating
                              the previously crawled pages in short time intervals.
Eager Learning                A learning method in which the system tries to construct a
                              general, input independent target function during training of the
                              system
Filtering                     The process that cleans the web pages from unnecessary
                              information.
HTML Parser                   A Java library used to parse HTML in either a linear or nested
                              fashion.
Informa Library               A news aggregation library based on the Java Platform.
Information Extraction        A type of information retrieval whose goal is to automatically
                              extract structured information from unstructured machine-
                              readable documents.
Information Retrieval         An area of data processing which is concerned with the swift
                              and accurate finding of information in large bodies of data.
Larbin                        Free web crawler. It is intended to fetch a large number of web
                              pages to fill the database of a search engine.
Lazy Learning                 A learning method in which generalization beyond the training
                              data is delayed until a query is made to the system.
Machine Learning              Subspecialty of artificial intelligence concerned with
                              developing methods for software to learn from experience or
                              extract knowledge from examples in a database.
Natural Language              A range of computational techniques for analyzing and
Processing                    representing naturally occurring text (free text) at one or more
                              levels of linguistic analysis (e.g., morphological, syntactic,
                              semantic, pragmatic) for the purpose of achieving human-like
                              language processing for knowledge-intensive applications.
Portal                        An information gateway that often includes a search engine plus
                              additional organization and content. It is a powerful method for
                              finding domain-specific information.
Presentation Layer            The layer that is responsible for implementing the interface
                              between clients and the system.
Query                         A search request submitted to a database or search engine. Used
                  to find specific content and files
RSS               Family of web feed formats that is used to publish frequently
                  updated works such as news, blog entries etc.
SRV               An implementation of a general purpose relational learner for
                  information extraction.
Structured Data   Documentation of discrete data using controlled vocabulary
                  rather than narrative text.
System Layer      The layer that is responsible for execution and integration of
                  core of the system.

								
To top