Embed
Email

Project Status

Document Sample
Project Status
Shared by: HC120104213216
Categories
Tags
Stats
views:
0
posted:
1/4/2012
language:
pages:
10
Project Status

Daniel Bevis

William King

Villanova University

Spring 2006

CS9010

Project Overview

 Complete a subset of the Ontology Project

(Project Archive)

 Generate ontology from existing

documentation

 Determine if it is possible to generate an

Ontological classification (categories) from

raw data characteristics

 Support flexibility to define a process that

allows the ontology to be naturally extended

as raw data is incorporated

Development Plan review

 Select subset of subject areas

 Initially select limited subject area

 Important to support reasonably quick review and

analysis of results

 Expand subject area iteratively if time permits

 Define characteristics associated with a

subset of the raw data from the web site

 Consider Processing of subject documentation

 Natural Language

 Indexing search with cross references

 Consider simple keyword searches

Development Plan review

 Build categories from the characteristics

 Consider generating a tool that allows you to describe a

different subset from the rest of the raw data

 Create higher level categories based upon common

subsets of characteristics

 Repeat process until top level categories or characteristics

conform to existing high level classifications or prove

alternate categories

 Place subjects into categories

 Review categorization

 Manually analyze results

 Test existing categorization on remaining subjects of the

initially selected subset

Development Tools

 Natural Language Recognition via NLTK is

the basis for initial research

 Slow but well documented and supported

 Installation details (Win32 API)

 NLTK Lite 0.6.3 w/ Corpora package

 Python 2.4.2

 PyWordNet

 WordNet 2.1

 Numarray 1.5

Ontology Subset

 Take SIGMICRO category as a single subject

set

 Break data into subsets

 Initial subset allows for simpler manual verification

and validation

 International Symposium on Microarchitecture

 Initially a small subset of the available archive material

will be used

 Remaining subsets provide for further testing and

validation of technique

 Additional subsets from the ACM

documentation will be added as time permits

Defined Process

 Take a subset of the raw data elements and

define the elements characteristics

 Read text in for processing

 Tokenize text

 Perform Probabilistic Parsing

 via ViterbiParse

 Consider other parsing techniques if time permits

 Consider training parsing process

 Select Tokens for analysis

 Supposition: Nouns will provide adequate tokens to define

characteristics

 Potential Goal: identify a ‘reasonable’ subset of tokens for

use as characteristics

Defined Process

 Select Tokens for analysis (continued)

 May be reasonable to use only a subset of nouns

 Proper nouns are likely to have little impact if removed

 Redundant terms and synonymous should likely be consolidated

 What impact would the use of other types (e.g. verbs) have

in generating characteristics?

 Limiting to Nouns will greatly reduce the amount of

information to be processed

 Reduce processing time thereby allowing for faster generation of

results in an time consuming process

 Defines a bound on what constitutes a characteristic and thereby

reduces volume of data to be manually reviewed during

development

 Will initially require additional testing to verify concept

Defined Process

 Based on common characteristics develop

categories

 Analyze each individual document’s parse tree

 Use statistical analysis of parse trees between documents

 Supposition: Higher frequency of terms relative to all

documents implies higher level characteristic

 Potential Goal: Identify a ‘reasonable’ subset of term inter-

relations for use as characteristics

 Assume that some raw data values will cross categories



 Group elements into those categories

 Identify common characteristics associated with other

characteristics

 Identify higher level characteristics and categories from

categories generated associated with the raw data

 Recursive categorization approach

Current Development Focus

 Automating retrieval of document

 Obtain documents from web sources

automatically

 Convert documents for use in NLTK

environment

 Automate Execution of the analysis of

documents

 Python based code to handle processing in

batch style execution

 Use Existing NLTK tools where available


Related docs
Other docs by HC120104213216
CARTA DE ADVERT�NCIA DISCIPLINAR
Views: 0  |  Downloads: 0
EMPLOYMENT APPLICATION INSTRUCTIONS:
Views: 1  |  Downloads: 0
Sheet1
Views: 7  |  Downloads: 0
MUT�GENOS
Views: 0  |  Downloads: 0
ANEXO I TOTAL
Views: 3  |  Downloads: 0
AFFIDAVIT OF PERMISSION
Views: 6  |  Downloads: 0
Diapositive 1
Views: 0  |  Downloads: 0
Color
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!