487 151

Document Sample
487 151 Powered By Docstoc
					          INTELLIGENT AGENT FOR FILTERING INFORMATION
                    IN INTERNET NEWSGROUPS

                                Hisham S. Katoah
        Business Administration Dep.,Faculy of Economics & Administration
                               King Abdulaziz Univ
                        Jeddah,Kingdom of Saudi Arabia

Abstract: Intelligent software agents are a rapidly developing area of research in
such fields as psychology, sociology and computer science. This paper provides an
overview of intelligent agent concepts and how it can be applied in practice. As an
application, this paper use intelligent agents for filtering information supplied by
Internet newsgroups. As an implementation, the intelligent agent was built as a virtual
machine on the top of the JAVA virtual machine. The results showed an intelligent
behavior of the system in filtering news articles, compared with traditional search
engines that use non-intelligent search techniques.

Keywords: Intelligent Software Agents, Information Filtering, and Search Engines.

1. Introduction                                    Negroponte [4] believes that agents
                                               are useful not because they can
     What is an Agent? Software                perform tasks a user could not perform
processes that act on behalf of the user       on their own using other tools, but
are known as agents. All agents exhibit        because they perform tasks, the user
similar characteristics, regardless of         finds trivial or mundane. By delegating
whether they are people or a piece of          the task of information retrieval to the
software. Campbell [1] describes an            agent, the user is able to direct their
agent as a process that performs tasks         attention to tasks that are more
on behalf of the user, by applying its         enjoyable or make better use of their
specialized knowledge. It makes                time. The main function of an
decisions about how to complete the            Intelligent Information Agent is
tasks, and has the ability to learn the        locating information resources for the
preferences of the user, to improve its        user. Different agents use different
performance in the future.                     methods for managing information. An
     An Intelligent Agent is an agent          agent’s competence ultimately depends
that uses stored knowledge, related to         on its ability to satisfy the information
its tasks and user preferences, to aid in      needs of its user. As such, its ability to
its performance of tasks and the               retrieve the right amount of quality
achievement of its goals. An Intelligent       information quickly is important
Information Agent is an Intelligent                Meeting information demand has
Agent that locates, collates and               become easier on one hand, but has
manipulates information contained in           also become more complicated and
stored resources on a distributed              difficult on the other. Because of the
information network [3]. Intelligent           emergence of information sources such
Information Agents communicate with            as the Internet (me source of
the user via their interface. The user         information this thesis will focus on
also views results and information             primarily). The sheer endlessness of
provided by the Intelligent Information        the information available through the
Agent through the agent interface.             Internet, which at first glance looks
                                               like its major strength, is at the same
time one of its major weaknesses. The           Internet newsgroups are online
current, conventional search methods        discussions (via posted messages) on
do not seem to be able to tackle these      thousands       of    different    topics.
problems. These methods are based on        Newsgroups can be compared to
the principle that it is known which        bulletin boards with messages tacked
information is available (and which         all over them [5]. Each newsgroup is
one is not) and where exactly it can be     devoted to a particular topic, and there
found. To make this possible, large         is a newsgroup for almost every topic
information systems such as databases       on this planet. UseNet (which is short
are supplied with (large) indexes to        for users' network) is made up of all
provide the user with this information.     the machines (servers) mat receive
With the aid of such an index one can,      network newsgroups. The network
at all times, look up whether certain       news (commonly referred to as
information can or cannot be found in       Netnews) is the mechanism that sends
the database, and - if available - where    the individual messages from local
it can be found.                            computer to all the computers that
     On the Internet this strategy fails    participate in UseNet. The basic idea
completely, the reasons for this being:     with UseNet is that when the user posts
      The dynamic nature of                an article from his personal computer
        me Internet itself                  to his news server, the article is sent to
      The dynamic nature of                other servers that agreed to exchange
        the information on                  Netnews with the server. These
        Internet                            servers, in turn send the article to other
      The information on the               machines, which send it to others; this
        Internet     are     very           continues until the user's article has
        heterogeneous                       reached      every      computer      that
     An alternative solution is provided    participates in UseNet. Because each
by intelligent software agents, which       machine can send articles to many
provides us with intelligent filtering      other machines, user's article can reach
techniques that can tackle the above-       the majority of UseNet computers
mentioned problems.                         within a few hours. Messages are
        The remainder of the paper is       commonly referred to as news articles.
organized as follows. Section 2             A news article is very similar to an e-
describes what are the Internet news        mail message. It has some information
groups, Usenet and the Net News             at the top of the article in the header
Transfer Protocol (NNTP) used to            lines and the content of the article in
communicate        with     newsgroups.     the message body. An article can
Section 3 describes the concept of          appear in more than one group at the
information filtering and different         same time-this is called cross posting
approaches of it. The concept of agent      of the article. The message body of the
learning is described in section 4.A full   article contains information that the
description of our newsgroup filter         sender
application is provided in section 5.       of the article wrote. In many cases, the
Finally, we present some conclusions        article ends with a signature; which is a
in section 6.                               comment or some information about
                                            the author. To get an idea of how
2. Internet Newsgroups and NNTP             discussion happens in newsgroups, one
protocol                                    might think of news server as a large
                                            building, and each newsgroup is a
                                            room in that building. Each room has a
name on the door, and a brief                word, which in some cases may be
description of the topic of discussion in    followed by a parameter. Commands
mat room. In some of these rooms, one        with parameters must separate the
might find a small number of people          parameters from each other and from
politely discussing a serious topic. In      the command by one or more space or
other rooms one may find a loud,             tab characters. Command lines must be
raucous group of people discussing a         complete with all required parameters,
heated topic. Newsreader software is         and may not contain more than one
used to connect to the specified news        command. Commands and command
servers, request the articles from a         parameters are not case sensitive. That
specific newsgroup, and download             is, a command or parameter word may
them to personal computer. They also         be upper case, lower case, or any
allow posting articles from the user's       mixture of upper and lower case. Each
personal computer to me news server          command line must be terminated by a
in the specified newsgroup.                  CR-LF (carriage return – line Feed)
    NNTP (Net News Transfer                  pair.
Protocol) specifies a protocol for the            Responses are of two kinds, textual
distribution, inquiry, retrieval, and        and status. Textual responses are sent
posting of news article using a reliable     only after a numeric status response
stream-based transmission of news            line has been sent indicate the text will
among the UseNet community. NNTP             follow. Text is sent as a series of
is designed so that news articles are        successive lines of textual matter, each
stored in a central database allowing a      terminated with CR-LF pair. A single
subscriber to select only those items he     line containing only a period (.) is sent
wishes to read. Indexing, cross-             to indicate the end of the last line of
referencing, and expiration of aged          text i.e., the server will send a CR-LF
messages are also provided. The news         pair at the end of the last line of text, a
server specified by this protocol uses a     period, and another CR-LF pair. If the
stream connection such as TCP and            text contained a period as the first
SMTP-like commands and responses.            character of the text line in the original,
It is designed to accept connections         that first period is doubled. Therefore,
from hosts, and to provide a simple          the client must examine the first
interface to the news database. This         character of each line received, and for
server is only an interface between          those beginning with a period,
programs and the news database. It           determine either mat this is the end of
does not perform any user interaction        the text or whether to collapse the
or presentation level functions. These       doubled period to a single one. The
user-friendly functions are better left to   intention is that text messages will
the client programs, which have a            usually be displayed on the user's
better     understanding       of     the    terminal whereas command/status
environment in which they are                responses will be interpreted by the
operating.                                   client program before any possible
    Commands and replies are                 display is done.
composed of characters from ASCII                 Status responses are status reports
character set. When the transport            from the server and indicate the
service provides an 8-bit transmission       response to the last command received
channel, each 7-bit character is             from the client. Status response lines
transmitted right justified in an octet      begin with a 3 digit numeric code that
with the high order bit cleared to zero.     is sufficient to distinguish all
Commands consist of a command                responses. Some of these may herald
the subsequent transmission of text.       interests (user profile). However, while
The first digit of the response broadly    the need for these systems has been
indicates the success, failure, or         widely recognized and adequate
progress of the previous command.          techniques for their implementation
Certain status responses contain           have emerged — we mainly refer to
parameters such as numbers and             the intelligent agent technology— two
names. The number and type of such         basic problems still remain open and
parameters is fixed for each response      need further investigation:
code to simplify interpretation of the          The mechanisms for learning,
response. Parameters are separated                 representing and updating the
from the numeric response code and                 user's information preferences,
from each other by a single space. All          Te processing algorithms to be
numeric parameters are decimal, and                adopted      to     extract    the
may have leading zeros. All string                 information content of the
parameters begin after the separating              incoming documents and the
space, and end before the following                matching algorithms to be
separating space or the CR-LF pair at              exploited to assess their
the end of the line. (String parameters            potential relevance.
may not, therefore, contain spaces.) All       Technological advances have made
text, if any, in the response which is     wide-area        information       sharing
not a parameter of the response must       commonplace. A suite of tools has
follow and be separated from the last      emerged for network information
parameter by a space. Also, note that      finding and discovery, e.g.. World
the text following a response number       Wide Web, Archie, and Gopher. These
may vary in different implementations      tools provide a means to search for
of me server. The 3-digit numeric code     existing information, but the exploding
should be used to determine what           volume of digital information makes it
response was sent.                         difficult for the user to keep up with
                                           the fast pace of information generation.
3. Information Filtering                   Instead of making the user go after
                                           information it is desirable to have
    The recent development of              information selectively flowed to the
communication         networks      and    user. hi particular, there is a need for
multimedia systems provide potential       tools to capture profiles of users'
users with the availability of a huge      information needs, and to find
amount of information [7], making          documents relevant to these needs, as
worse and worse the problem of             these needs change over time. An
information overload. This situation       intriguing thought for the future
has favored the development of             involves the different uses to which we
systems capable of automatically           can put such a user profile. Over a
identifying the subset of the available    period of time, it will become an
information, which is potentially          increasingly accurate predictor of the
relevant to the user information needs.    user's interests. One can imagine using
More specifically, filtering systems       the profile for many tasks: filtering
have been proposed, which interface        incoming       e-mail,     picking     out
the information source to the user, and    interesting UseNet news articles,
are aimed at automatically evaluate the    creating personalized newspapers,
potential relevance of incoming            automatic selection of goods to browse
information on the basis of an explicit    in an on-line shop, and so on. A very
description of the user information        simple kind of tool to capture profile of
users' information needs is already          filtering  systems,   and        assisted
available on the Internet: mailing lists.    browsing systems [8].
Hundreds of mailing lists exist              4. Agent Learning
covering a wide variety of topics. The
user subscribes to lists of interest to          One central element in intelligent
him and receives messages on the topic       behavior is the ability to learn from
via email. He may also send messages         experience. For all the sophisticated
to the lists to reach other subscribers. A   knowledge         representation       and
problem with the mailing lists, is that a    reasoning algorithms we develop, there
user whose information does not              is no way to that we can know a priori
exactly match certain lists will either      all of the situations that our intelligent
receive too many irrelevant, or too few      agent will encounter. Thus, being able
relevant messages. Most people using         to adapt to changes in the environment
Netscape or Internet Explorer or other       or to get at tasks through experience
Web browsers maintain a bookmark             becomes a significant differentiator for
(favorites) list mat contains pointers       any software system. Any agent that
(reference) to favorite Web pages.           can learn has an advantage from one
Often the pointers are grouped together      that cannot. Adding learning or
into categories. If it is possible to        adaptive behavior to an intelligent
discover salient features of the already     agent elevates it to higher level of
established categories, one can create a     ability. A learning agent can adapt to
software agent that can use existing         user's likes and dislikes. It can learn
Web search tools on behalf of an             which agents to trust and cooperate
individual user to automatically seek        with, and which ones to avoid. A
new Web documents that are                   learning agent can recognize situations
potentially interesting. Recent work         it has been in before and improve its
that arises at the intersection of           performance        based     on      prior
information retrieval and software           experience. There are many forms of
agents offers novel solutions to this        learning. First, there is rote learning,
problem. Information retrieval is a          where an example is given and the
well-established field of information        student (intelligent agent) copies the
science that addresses issues of             example and exactly reproduces the
retrieval from a large collection of         behavior. While wrote learning is a
documents in response to user queries.       simple form of learning it still can be
By comparison, agent research is a           powerful. For example, we could run
relatively new field of study, which has     simulation to generate new situations
grown out of artificial intelligence.        and the desired behavior, and then
Agent research is concerned with             present these examples to our agent.
issues of designing intelligent and          Our agent will would be better able to
autonomous software for a variety of         respond to situations one week after we
tasks.                                       started me train and would be even
    The following table describes some       better one month later (assuming
of the current systems that assist a user    additional knowledge don't slow down
in finding information over the              his response time). Another form of
Internet. Most of these systems can be       learning is parameter or weight
classified into one out of three             adjustment. In this case, we may know
categories: search engines, information      a priori what factors are important in
some decision, but we don't know how        a document search engine the next time
to weight their contribution to answer,     we make a query. There are several
m this case, we can adjust the              major paradigms, or approaches, to
weighting factors over time so that we      machine learning. These include
improve the likelihood of the correct       supervised,      unsupervised,     and
decision or output. This technique is       reinforcement learning. In addition,
the basis for neural network learning.      many researchers and application
Induction is a process of learning by       developers combine two or more of
example where we try to extract the         these learning approaches into one
important characteristics of the            system. How the training data is
problem thereby allowing us to              processed is a major aspect of these
generalize to novel situations or inputs.   learning paradigms. Figure 1 gives an
Decision trees and neural networks          overview of the general model of
both perform induction and can be           learning agents.
used for classification or regression       5. Project Description
(prediction) problems. The key aspect
of inductive methods is that the
examples        are    processed     and
automatically transformed into an
internal         form        (knowledge
representation), which captures me
essence of the problem. Another type
of learning is called clustering,
chunking or abstraction of knowledge.
While people learn from very specific
examples or situations, the ability to
detect common patterns and generalize
to new situations is a type of learning.
By chunking ten cases into one more
general case, we cut down the amount
of storage we need and also me search       Figure 1: A general model of learning agents
or processing time. By thinking at
higher or more abstract levels, we can
think "great thoughts " without getting         In this section, we describe how a
caught in the muddle of a million little    basic Internet news reader application
details. Clustering is another learning     is designed, implemented, and
algorithm, which is a type of chunking.     augmented with an intelligent agent
Clustering algorithms look at high-         that assists a user by filtering
dimensional data (data with many            information. Several alternate methods
attributes) and score them for              are provided for filtering the news
similarity based on some criterion. The     articles based on the user feedback.
result is that each sample is assigned to   The "News Filter" application,
a cluster or group with other examples      described in this section, provides the
deemed to be "similar". This similarity     basic functionality of a network
could be used as a way for assigning        newsreader, the "News Filter " can:
meaning to that group of samples. An            1. Connect to a specified news
example        would    be     clustering           server.
documents we found particularly                 2. Request the articles from a
useful, which could provide useful                  specific newsgroup
information to improve performance of
   3. Download articles to the
      personal computer using the
      Net News Transfer Protocol
      (NNTP).
   4. Accept from the user keywords
      of interest.
   5. Accept user feedback on the
      downloaded articles
   6. Use the specified keywords to                  Figure 2: full project description
      score the downloaded articles
   7. Use the keyword match scores
      and user's feedback, for each        5.1 Agent Sensors
      article, in the agent learning           Our agent takes the rule of a
      process using neural networks        personal secretary as it reads all the
      in order to build a. filter model.   articles before presenting them to the
   8. Filter any further articles using    user (the manager). The agent goes
      the selected filter model            through each article performing a
                                           search for counting the number of
    The goal of this application is to     keywords'* occurrence in each article.
help the user deal with all the            The agent then gets the user feed back
electronic noise generated in the          about the article (useless, not very
newsgroups. Wouldn't it be great if        useful, neutral, mildly interesting,
when you download a newsgroup, all         interesting). The above data are then
you saw were the articles that             inserted into the profile database and
genuinely interested you? All of the       will be considered as the percepts of
posts from that jerk on the West Coast     the intelligent agent, these percepts
would disappear. All of the Spam           will take the form [10]:
postings offering great cellular phone
service or the greatest software since     Article1(KWi_score,KW2_score,...
sliced bread would never again waste       ,KWn_score,sum,feedback(
your time. On the other hand, you          Article2(KWi_score,KW2_score,...
would never miss a news article or post    ,KWn_score,sum,feedback(
that discussed the topics that interest    Articlem(KWi_score,KW2_score,...
you. That is the motivation behind the     ,KWn_score,sum,feedback(
"News Filter" application. We will
show also how to apply intelligent              Given that, m articles are scored
agents to score and filter out the         against n keywords. The agent sensors
unwanted news articles. Also the basic     are designed to take the above
mechanism for reading Internet             percepts, to form the agent percept
newsgroups using NNTP is explored.         sequence, which is, used in me agent-
Figure 2 gives an overview of the full     learning process.
project design.




                                             5.2 Agent brain
    Given the agent percept sequence        necessary and the agent would yield
at any time, the agent uses this            when it was done processing so that
information     collected    from     its   the application could continue. A
environment in its learning process.        Boolean flag is set by the "News
Learning is applied through using           Filter" application when the user
neural networks [9] which is                selects to build neural network filter
considered the brain of the agent. The      model. When this flag is set, the filter
neural network is trained on the profile    agent will notice that fact when it
data as its training set, the network       wakes up from its periodic sleep () in
inputs are the normalized keyword           the run () method. This approach
match scores and the normalized sum,        allows the agent to autonomously and
the desired output is the user feedback,    asynchronously train the neural
training proceeds for an epochs of          network. When the training is
2500 to minimize error. After training      complete, the filter agent signals the
me neural network its final weights can     completion to the "News Filter"
be     used     in    prediction    and     application.
generalization for new percepts, so it is
considered an intelligent brain for the     5.4 Agent Effectors
agent.                                          The Agent output is the articles
                                            sorted according to the user interest. It
5.3 Agent Communication                     is obvious that the more trained the
     Communication is handled through       agent     is   the    more     accepted
using the agent as an event source and      performance it exhibits. When the
listener. The communication in "News        agent first put in the system it acts
Filter" agent takes place through           randomly      producing     undesirable
formal languages with two other agents      results so we have had included a basic
in its environment in order to construct    (dummy) filter using only keywords
its knowledge base:                         count.
     1. With      the   server    agent:
         communication takes place          6. Conclusion
         through the NNTP formal
         language                               The development of agents for
     2. With the Microsoft Access           information retrieval in the World
         database                engine:    Wide Web is at this moment an active
         communication takes place          area of research wherein existing
         through the JDBC formal            systems are evolving towards more
         language.                          sophisticated and useful tools to
                                            explore and extract the gold from this
5.3 Agent Autonomy                          mine of information. Tasks such as
    Agent autonomy is applied through       browsing, searching, filtering, and
using threading facilities of Java to       manipulating information from the
make the agent resident in the memory       Web may soon be delegated to
with periodically sleep& run [2]. The       electronic assistants which are able
agent is designed so that it runs in a      automatically adjust, understand and
separate thread from the application's      manipulating the WWW environment,
main thread. At its startup, the "News      even if it is not the most friendly
Filter" application instantiates and        environment for them. Additionally,
configures the agent and the start it up    the Web can be the ground where
in a separate thread. The application       agents can demonstrate their potential
could yield to the agent when               real utility. The goal is that these
agents will not be just pieces of
software,      but       instead      real   [8] Ken Lang. “Newsweeder: Learning
representatives of their users in the        to filter netnews. Technical report,
electronic world. They will be the           School of Computer Science, Carnegie
workers in this new information              Mellon University, 1995. URL:
universe, and it is our task to create the   http://anther.learning.sc.cmu.edu/ml95.
best possible workers that our               ps
imagination can produce.
                                             [9] Henry Lieberman, “Letizia: An
                                             Agent That Assists Web Browsing”. in
Reference                                    Proceedings of the 1995 International
                                             Joint     Conference    on     Artificial
[1] A.Kjertsi, “A survey on                  Intelligence, Montreal, Canada, August
Personalized Information System for          1995.                              URL:
the WWW”, 1997                               http://lieber.www.media.mit.edu/peopl
                                             e/~lieber/Lieberary/Letizia/Letizia.htm
[2] Cay S.Horstmann, “Core Java”,            l
1998
                                             [10] Alexandros Moukas. “Amalthaea:
[3] Horton, “Standard for Interchange        Information Discovery and Filtering
of USENET Messages”, USENET                  using    a   Multi-agent     Evolving
project, 1998                                Ecosystem”, Proceedings of the
                                             Conference     on      the   Practical
[4] Chan , Francis Chan. “Research on        Application of Intelligent Agents and
OMG/CORBA”. EECS Department,                 Multi-Agent Technology. London, UK,
Berkeley. February 1996. URL:                1996.                           URL:
http://www.ic.eecs.berleley.edu/~fchan       http://moux.www.media.mit.edu/peopl
/caddis/corba.html                           e/moux/papers/PAAM96/

[5] Dreilinger , Daniel Dreilinger.
“SavvySearch Home Page”. URL:
http://www.cs.colostate.edu/
~dreiling/smartform.html

[6] Susan Gauch, Guijun Wang, and
Mario Gomez. “ProFusion: Science,
Intelligent Fusion from Multiple,
Distributed Search Engines”. Journal
of Universal Computer Volume 2,
Number 9, Sept. 1996. URL:
http://www.designlab.ukans.edu/profus
ion/

[7] Donna Harman, “Relevance
Feedback Revisited”, in Proceedings of
the Fifteen Annual International ACM
SIGIR Conference on Research and
Development in Information Retrieval,
Copenhagen, Denmark, June 1992,
pp.1-10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:7/30/2012
language:
pages:9