Evaluation Section for evaluation section

Document Sample
Evaluation Section for evaluation section Powered By Docstoc
% Section for evaluation

% Subsection for Overview

The evaluation of ChronoSearch was centered on two commonly measured Information
Retrieval performance characteristics: precision and recall. In terms of ChronoSearch, precision
is the percent of results retrieved that the user is interested in and are relevant to the query term.
Recall is the percentage of events retrieved by the system out of the existing events on the Web.

In order to measure the precision and recall of the system, tests were performed using three input
entities and comparing results from ChronoSearch to existing manually built timelines on the
Web. The three input entities used for the evaluation were: Bill Gates, Steve Jobs, and Jim
Tressel. For each entity, a truth set was constructed by manually merging timelines that existed
on the Web for that entity. For example, manual timelines were constructed from timelines and
information that existed on the following websites: CNN, The Telegraph, NPR, and Wikipedia.
To measure recall, the output from CrhonoSearch was evaluated against the truth set.
Chronosearch’s recall percentage was then compared to the recall percentage of the manually
generated timelines found on the Web. The precision of the system was measured as a
comparison of the output of the system as currently implemented versus the baseline solution.
The baseline solution consisted of the current implementation of ChronoSearch minus the
duplicate removal methods and the bad sentence removal method. In other words, the baseline
solution of our system did not remove bad sentences due to irregular word lengths, and it did not
remove duplicates via the cosine similarity method or verb similarity method.

% Subsection for Recall Evaluation
\subsection{Recall Evaluation}

To measure the recall percentage attained by Chronosearch, the timelines produced by
ChronoSearch were compared to a manually generated truth set. To objectively select which
events belonged in the truth set, the truth set was constructed by manually merging existing
timelines already present on the Web. The recall percentage was then calculated as the
percentage of events in the truth set that were also found in the timeline being evaluated. The
manually built timelines from the Web along with ChronoSearch results were evaluated using
this methodology. The results from each timeline were then compared. The results of this
comparison are displayed in Figure ~\ref{fig:RecallEvaluation}. There are three different sets of
recall measurements, one for each of the person entities searched for in the evaluation. On the
left are the results for Steve Jobs, the middle Bill Gates, and on the right Jim Tressel. The recall
percentages for ChronoSearch are in blue. The results for Steve Jobs show that ChronoSearch
was able to achieve a recall percentage of nearly 75\%. In comparison, the manual timeline from
CNET only achieved a recall percentage of around 60\%. The manual timeline from Telegraph
did outperform ChronoSearch in this case with a recall percentage of nearly 82\%. The data for
Bill Gates is very similar; ChronoSearch had a recall percentage of 73\% whereas the manual
timeline on NPR demonstrated a 76\% recall and a personal history report had a 58\% recall. The
results for Jim Tressel show the recall percentage of ChronoSearch compared to a truth set
generated from Wikipedia only since there were no existing timelines available for Jim Tressel.
This data point shows that even when a timeline does not already exist for an entity,
ChronoSearch is still successful in constructing a timeline.

Overall, the recall percentage attained by ChronoSearch outperformed the average recall
percentage of manually generated timelines present on Web. Figure ~\ref{fig:RecallEvaluation}
also shows that ChronoSearch was also able to dynamically generate timelines of the same level
of quality in the absence of manually generated timelines on the Web.

% Chart for the Recall Evaluation
\caption{Recall Evaluation}

% Subsection for Precision Evaluation
\subsection{Precision Evaluation}

As mentioned earlier, the precision of our system was measured as a comparison of the output of
ChronoSearch as currently implemented versus the baseline solution output. The way in which
our system improved the precision of the output was to remove candidate events that were
duplicates or sentences that did not belong in the output. In order to do this, three removal
methods were utilized in our system. The first removal method got rid of bad sentences that had
irregular average word lengths outside of the boundary of [3.2, 7.2] characters per word. The
second method removed duplicate sentences via the cosine similarity method that had a
similarity greater than 50\%. The third and last method removed events that occurred on the
same day and had a verb similarity of greater than 0.5.

The overall precision improvement is shown in Figure ~\ref{fig:PrecisionImprovement}. This
was calculated as the percentage of results that were removed from the output excluding results
present in the truth set and removals that were later manually classified as false positives. For
example, as shown in Figure ~\ref{fig:PrecisionImprovement}, a single run of the baseline for
Steve Jobs produced a total of 187 output events, of which 48 events were removed by our
removal techniques. However, 10 of the 48 removals were false positives yielding an
improvement of 38/187 or 23\%. The average precision improvement of the system was nearly

% Chart for the Precision Improvement
\caption{Precision Improvement}

The number of output events removed by each of the three removal methods is shown in Figure
~\ref{fig:EventRemovalStatistics}. For example, 48 candidate events were removed from the
output for Bill Gates. Of those 48 events, 7 were removed for having irregular average word
lengths, 38 were removed from the cosine similarity mechanism as duplicate events, and 3 were
removed for having similar verbs in events that occurred on the same day.

% Chart for the Event Removal Statistics
\caption{Event Removal Statistics}

% Subsection for Correctness Statistics
\subsection{Correctness Measurements}

It is also useful to measure the accuracy of the removal methods utilized by ChronoSearch. In
order to do this, false positives were identified for the removal techniques utilized by the system.
These false positives account for sentences that were removed by one of the three mechanisms
described, however, they should not have been removed. The statistics regarding the false
positives incurred by the removal methods are shown in Figure ~\ref{fig:FalsePositives}. As an
example, it was mentioned that 7 event descriptions were removed for the Bill Gates run
because they were detected as being bad sentences due to average word length irregularities. Of
these 7 events removed, 4 of them were actual events that should not have been removed. For the
Bill Gates run of the system, there were no false positives for the cosine similarity removal
technique and 1 false positive for the verb similarity removal mechanism. Overall, the
ChronoSearch system demonstrated a relatively low average false positive rate under 15\%.

% Chart for the False Positives Removed
\caption{Removal Methods False Positives}
To measure the effectiveness of the duplicate detection mechanisms, we also measured the
number of duplicate events that were not detected and removed. This means that some events
were left in the output of the system that should have been removed as duplicate events.
However, the removal methods utilized by ChronoSearch were not entirely complete, which is to
be expected. The results of how many duplicate events were not detected for each person entity
run are shown in Figure ~\ref{fig:DuplicatesNotDetected}. As an example, for the Bill Gates
run, there were 10 resultant event descriptions in the output that circumvented the removal
mechanisms in the system. A total of 10 events out of the 51 total duplicates were not removed,
which means that nearly 20\% of the total duplicate events pertaining to Bill Gates were missed.
However, on average the 2 duplicate event detection mechanisms, cosine similarity and verb
similarity, were able to find and remove nearly 85\% of the total duplicate events. This means
that the average effectiveness of the duplicate detection methods was favorable.

% Chart for the Duplicate Events Not Detected
\caption{Duplicate Events Not Detected}

Shared By: