Project Status
Daniel Bevis
William King
Villanova University
Spring 2006
CS9010
Project Overview
Complete a subset of the Ontology Project
(Project Archive)
Generate ontology from existing
documentation
Determine if it is possible to generate an
Ontological classification (categories) from
raw data characteristics
Support flexibility to define a process that
allows the ontology to be naturally extended
as raw data is incorporated
Development Plan review
Select subset of subject areas
Initially select limited subject area
Important to support reasonably quick review and
analysis of results
Expand subject area iteratively if time permits
Define characteristics associated with a
subset of the raw data from the web site
Consider Processing of subject documentation
Natural Language
Indexing search with cross references
Consider simple keyword searches
Development Plan review
Build categories from the characteristics
Consider generating a tool that allows you to describe a
different subset from the rest of the raw data
Create higher level categories based upon common
subsets of characteristics
Repeat process until top level categories or characteristics
conform to existing high level classifications or prove
alternate categories
Place subjects into categories
Review categorization
Manually analyze results
Test existing categorization on remaining subjects of the
initially selected subset
Development Tools
Natural Language Recognition via NLTK is
the basis for initial research
Slow but well documented and supported
Installation details (Win32 API)
NLTK Lite 0.6.3 w/ Corpora package
Python 2.4.2
PyWordNet
WordNet 2.1
Numarray 1.5
Ontology Subset
Take SIGMICRO category as a single subject
set
Break data into subsets
Initial subset allows for simpler manual verification
and validation
International Symposium on Microarchitecture
Initially a small subset of the available archive material
will be used
Remaining subsets provide for further testing and
validation of technique
Additional subsets from the ACM
documentation will be added as time permits
Defined Process
Take a subset of the raw data elements and
define the elements characteristics
Read text in for processing
Tokenize text
Perform Probabilistic Parsing
via ViterbiParse
Consider other parsing techniques if time permits
Consider training parsing process
Select Tokens for analysis
Supposition: Nouns will provide adequate tokens to define
characteristics
Potential Goal: identify a ‘reasonable’ subset of tokens for
use as characteristics
Defined Process
Select Tokens for analysis (continued)
May be reasonable to use only a subset of nouns
Proper nouns are likely to have little impact if removed
Redundant terms and synonymous should likely be consolidated
What impact would the use of other types (e.g. verbs) have
in generating characteristics?
Limiting to Nouns will greatly reduce the amount of
information to be processed
Reduce processing time thereby allowing for faster generation of
results in an time consuming process
Defines a bound on what constitutes a characteristic and thereby
reduces volume of data to be manually reviewed during
development
Will initially require additional testing to verify concept
Defined Process
Based on common characteristics develop
categories
Analyze each individual document’s parse tree
Use statistical analysis of parse trees between documents
Supposition: Higher frequency of terms relative to all
documents implies higher level characteristic
Potential Goal: Identify a ‘reasonable’ subset of term inter-
relations for use as characteristics
Assume that some raw data values will cross categories
Group elements into those categories
Identify common characteristics associated with other
characteristics
Identify higher level characteristics and categories from
categories generated associated with the raw data
Recursive categorization approach
Current Development Focus
Automating retrieval of document
Obtain documents from web sources
automatically
Convert documents for use in NLTK
environment
Automate Execution of the analysis of
documents
Python based code to handle processing in
batch style execution
Use Existing NLTK tools where available