Embed
Email

nlp.ipipan.waw.plNLP-SEMINAR0310

Document Sample

Shared by: fjzhangxiaoquan
Categories
Tags
Stats
views:
1
posted:
1/23/2012
language:
pages:
33
SProUT

Shallow Processing with Unification

and Typed Feature Structures





Jakub Piskorski

Language Technology Lab

DFKI GmbH









Jakub Piskorski Warszawa, 6.10 .2003

Information Extraction



Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC),

London, have merged their UK private communication systems and networks

activities to form a new company, Siemens GEC Communication Systems Limited.







Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC),

London, have merged their UK private communication systems and networks

activities to form a new company, Siemens GEC Communication Systems Limited.





JOINT-VENTURE FOUNDATION EVENT



VENTURE: Siemens GEC Communication Systems Limited

PARTNERS: Siemens AG, The General Electric

TIME: February 18 1997

PRODUCT/SERVICE: communication systems, networks activities

LOCATION: Munich



Jakub Piskorski Warszawa, 6.10 .2003

Finite-State based approaches





 SPPC - pure finite-state based STP, small number of basic predicates



 SMES – predciates inspect arbitrary properties of the input tokens/fragments



 FASTUS – uses CPSL (Common Pattern Specification Language)



 GATE – uses JAPE (Java Annotation Patterns Engine)









Jakub Piskorski Warszawa, 6.10 .2003

Motivation for SProUT





 One System for Multilingual and Domain Adaptive Shallow Text Processing



 Trade-off between efficiency and expressiveness



 Modularity



 Flexible integration of different processing modules



 Portability



 Industrial standards









Jakub Piskorski Warszawa, 6.10 .2003

Credits





SProUT is a joint work by:

Witold Drożdżyński,

Ulrich Krieger,

Jakub Piskorski,

Ulrich Schäfer,

Feiyu Xu









Jakub Piskorski Warszawa, 6.10 .2003

SProUT Architecture



LINGUISTIC

INPUT DATA PROCESSING LEXICAL

RESOURCES RESOURCES



JTFS

STREAM OF

TEXT ITEMS

…. [..] [..] [..] ….







XTDL REGULAR EXTENDED XTDL

GRAMMAR OPTIMIZED STRUCTURED

COMPILER INTERPRETER

FINITE-STATE OUTPUT DATA

NETWORK









FINITE-STATE

MACHINE

TOOLKIT









GRAMMAR DEVELOPMENT ONLINE PROCESSING

ENVIRONMENT









Jakub Piskorski Warszawa, 6.10 .2003

Core Components – FSM Toolkit





 Finite-state Machine Toolkit for building, combining,

and optimizing finite-state devices



 Finite-state Machine model: FSA, WFSA, FST, WFST



 Arbitrary real-valued semirings



 Some new crucial STP-relevant operations

(e.g., incremental construction of minimal deterministic FSAs)



 Various memory models

 Functionality similar to AT&T tools









Jakub Piskorski Warszawa, 6.10 .2003

Core Components – Regular Compiler



 Definition and configuration via XML



 Unicode compatible



 Extendible set of circa 20 operations



 Scanner definitions vs. general regular expressions



 Biasing optimization process



 Various ways of handling ambiguities



 Direct database connection for flexible pattern-based transformation of

linguistic resources into optimized FS representation



 Regular expressions over TFSs (SProUT) with restrictions





Jakub Piskorski Warszawa, 6.10 .2003

Core Components – Typed Feature Structure Package





 JAVA implementation of TFSs



 Efficient unification operations



 Dynamic extension of the type hierarchy



 Other operations: subsumptipon checking, deep copying, path selection,

feature iteration, and various printers









Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism





 Combines typed feature structures (TFS) and regular expressions, including

coreferences and functional application





 XTDL grammar rules – production part on LHS, and output description on RHS





 TDL used for establishment of a type hierarchy of linguistic entities



*top*

morph := sign & [POS atom,

atom *avm* *rule*

STEM atom,

INFL infl] tense sign infl index-avm



present token morph lang tokentype



de en separator url









Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism





 Couple of standard regular operators:

concatenation optionality ?

disjunction | Kleene star *

Kleene plus + n-fold repetition {n}

m-n span repetition {m,n}





 Unidirectional coreference under Kleene star (and restricted iteration)



[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]









Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism



loc-pp :>

morph & [POS Prep & #preposition,

INFL [CASE #1, NUMBER #2, GENDER #3]]

morph & [POS Determiner,

INFL [CASE #1, NUMBER #2, GENDER #3]] ?

morph & [POS Adjective,

INFL [CASE #1, NUMBER #2, GENDER #3]] *

gazetteer & [TYPE general-location,

SURFACE #location]

-> [CAT location-pp,

PREP #preposition

LOCATION #location].





Jakub Piskorski Warszawa, 6.10 .2003

XTDL Interpreter





1. Matching of regular patterns using unifiability (LHS)



2. LHS Pattern instance creation



3. Unfication of the rule instance and matched input





 Longest match strategy





 Ambiguities allowed





 Interpreter generates TFSs as output (cascaded architecture)









Jakub Piskorski Warszawa, 6.10 .2003

XTDL Interpreter





 Matched input sequence “im sonnigen Rom” (in sunny Rome)





 SURFACE im  SURFACE sonnigen  

 STEM  STEM  

  im   sonnig  

 POS  POS  

IN

Prep Adjective SURFACE Rom  

 ,  ,

  CASE nom    CASE case   gazetteer TYPE

 general - location 

 

 INFL  NUMBER plural  INFL  NUMBER number  

       

  GENDER fem    morph  GENDER gender   

morph  infl   infl  



rule 













Jakub Piskorski Warszawa, 6.10 .2003

XTDL Interpreter





 Rule with an instantiated pattern on the LHS





 POS 5 Prep  POS Adjective  

  

  CASE 1   

  CASE 1  

 SURFACE 4  

IN  NUMBER 2 ,  NUMBER 2 ,  



INFL

 

INFL

   gazetteer TYPE general - location  



 

  GENDER 3 

infl    morph  GENDER 3 

infl   

 morph   

 

 CAT location - np 

OUT PREP 5  

   

 LOCATION 4

phrase 



 

rule  









Jakub Piskorski Warszawa, 6.10 .2003

XTDL formalism





 Unified result



 SURFACE im  SURFACE sonnigen  

 STEM  STEM  

  im   sonnig  

 5 Prep 

IN

POS  POS Adjective  SURFACE 4 Rom  

 ,  ,

  CASE 1 dat    CASE 1   gazetteer TYPE

 general - location 

 

 INFL  NUMBER 2 sing   INFL  NUMBER 2  

       

  GENDER 3 neut 

infl    morph  GENDER 3 

infl   

 morph   

 

 CAT location - np 

OUT PREP 5  

   

 LOCATION

phrase 

4 

 

rule  









Jakub Piskorski Warszawa, 6.10 .2003

Linguistic Processing Resources





 Tokenization



 Gazetteer



 Extended Gazetteer



 Morphology



 Sentence Splitter



 Reference Matcher









Jakub Piskorski Warszawa, 6.10 .2003

Tokenization



 Text segmentation into tokens



 Fine-grained token classification (ca. 30 types)



complex_compound_first_capital : AT&T-Chief



 Token postsegmentation



‘’  ‘’



 Token Subclassification



Information

START : 25 

 END : 34 

 

contains_position_sufix: AT&T-Chief  MAIN : first _ capital 

 

  LANG : german   LANG : english  

SUB :  DOM : any ,  DOM : any  

    



 SPEC : has _ noun _ ending SPEC : has_noun_ending

   







Jakub Piskorski Warszawa, 6.10 .2003

Gazetteer/Extended Gazetteer



 for storing static named-entities (eg. locations) or keywords (eg. company|

designators, month names, etc.)



 Extended Gazetteer allows for associating entries with a list of arbitrary

attribute-value pairs (and uses path compression)



...

Warsaw | gaz_type:city | concept:Warsaw

Warszawa | gaz_type:city | concept:Warsaw

Varsovie | gaz_type:city | concept:Warsaw

...



 Case Sensitivie/Insensitive Modus





 Unicode compatibility







Jakub Piskorski Warszawa, 6.10 .2003

Morphology



 compactification of available full-form lexica



 external components implemented as server





Full-form lexica obtained from ‘compactified’ MMORPH:

English 200,000 entries

German 830,000 entries + Shallow Compound Recognition

French 225,000 entries

Spanish 570,000 entries

Italian 330,000 entries

Dutch ? Entries (under development)

Asian Languages:

Chinese – Shanxi

Japanese – Chasen

Other:

Czech – 600,000 entries + HMM-based Part-of-Speech Tagging

Polish – 120,000 lexemes (Morfeusz)

Lithuanian – Lemouklis

Russian – under acquisition





Jakub Piskorski Warszawa, 6.10 .2003

Morphology



 Compound Recognition & Segmentation for German

“Biergartenfest” “Wein“ + “sorten“ (wine types)

[Bier [garten fest]] vs. [[Bier garten] fest] “Wein” + “s“ + “orten“ (wine places)





(„Autoradiozubehör“ – radio car equipment)







Autoradiozubehör Autoradiozubehör

Autoradiozubehör Autoradiozubehör

Autoradiozubehör Autoradiozubehör





 Next: Adoptation for processing Dutch compounds









Jakub Piskorski Warszawa, 6.10 .2003

System Description Language





 Construction of a concrete system instance via definition of a

regular expression of module specifications



M1  M 2 output of M 1 serves as the input to M 2

M* fixpoint computation

M1 M 2 quasi - parallel computation of independent modules



 All lingusitic modules must implement a specific JAVA interface



 Automatic compilation of system description into a single JAVA class









Jakub Piskorski Warszawa, 6.10 .2003

System Description Language



(M1 M2)(input)

M1.clearState();

M1.setInput(input);

M1.setOutput(M1.computeOutput(M1.getInput()));

M2.clearState();

M2.setInput(mediateSeq(M1,M2));

M2.setOutput(M2.computeOutput(M2.getInput()));

return M2.getOutput();





(M*)(input)

M.clearState();

M.setInput(input);

M.setOutput(mediateFix(M));

return M.getOutput();









Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing





 Problem: TFSs treated as symbolic values by FSM Toolkit



 Sorting outgoing transitions from slected states

(transition hierarchy under subsumption)



- flat trees for bad-style grammars

 Extending transition hierarchy via additional nodes



[ TOP ]





[TOKEN]

[MORPH stem: „Prof.‟] [GAZETTEER type: X]









Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing





 Input text consisting of 32 520 words, 157 080 characters, 22 pages

+ English Grammar for NE (circa 700 transitions from the initial state)



 Run-time behaviour with Tokenizer/Gazetter/Morphology:



before: overall: 17.7 seconds candidate pattern selection: 11.6





now: overall: 13.2 seconds candidate pattern selection: 6.9









Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing





 Using restrictions during compilation of XTDL grammars into FS-format





 ‟Determinization under subsumption‟ -> Approximation







 ‟Expansion‟ techniques for highly recursive grammars









Jakub Piskorski Warszawa, 6.10 .2003

Adapting SProUT to processing Polish





 Tokenization – trivial



 Morphology – integration of Morfeusz (Marcin Woliński)



 Part-of-speech Disambiguation - ?



 Gazetteer - several strategies:



- list all inflectional variants with additional morphological information

- interplay between gazetteer and morphology

- component for guessing morphological information of unknown words

 Grammar Adaptation



- provide additional information to control inflection by using

STEM attribute instead of SURFACE





Jakub Piskorski Warszawa, 6.10 .2003

Future Work





 Further work concerning optimization of grammar processing



 Various search strategies



 Additional linguistic processing resources



 Adopting to processing new languages



 Real data testing: large grammars and real-world texts



 Utilization in research and industrial projects









Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names



;; dummy rule for title

title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title.



;; dummy rule for position

position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position.



;; dummy rule for complex position, zB. Dierktor und CEO

complex_position :/

(gazetteer & [GTYPE gaz_position, SURFACE #pos1]

token & [SURFACE "und"]

gazetteer & [GTYPE gaz_position, SURFACE #pos2])



-> #position, where #position = Append(#pos1," ","und"," ",#pos2).









Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names



;; dummy rule for given name

given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name.



;; dummy rule for name-suffix such as "Jr."

name_suffix :/

(token & [ SURFACE ","] ?)

token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ]

(token & [ SURFACE "." ] ?)

-> #suffix.



;; dummy rule for initial "M." and middle name

initial :/

(gazetteer & [GTYPE gaz_initial, SURFACE #initial]

token & [SURFACE "."] ?)

-> #middle, where #middle = Append(#initial, ".").







Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names



;; dummy rule for infix like "van", "van der"

infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix.



;; dummy rule for last name

last_name :/

token & [TYPE first_capital_word, SURFACE #name]

| token & [TYPE mixed_word_first_capital, SURFACE #name]

| token & [TYPE word_with_hyphen_first_capital, SURFACE #name]

| token & [TYPE word_with_apostrophee_first_capital, SURFACE #name]

-> #name.



;; dummy rule for last name with infix

last_name_with_infix :/

@seek(infix) & #infix

@seek(last_name) & #last_name

-> #last, where #last=Append(#infix," ",#last_name).



Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names



;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr.

person :>

((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)?

@seek(title) & #title ?

(@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?)

| (@seek(initial) & #given_name))

@seek(initial) & #middle1 ?

@seek(initial) & #middle2 ?

(@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name)

@seek(name_suffix) & #suffix ?

-> ne-person & [GIVEN_NAME #first_name,

TITLE #title,

SURNAME #last_name,

P-POSITION #position,

NAME-SUFFIX #suffix],

where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).





Jakub Piskorski Warszawa, 6.10 .2003

Examples – Embedding rules



simple_noun_phrase :> .................

-> phrase & [CAT np,

SURFACE #info,

AGR [N #n,

C #c,

G #g]], where #info=..........



simple_event :> @seek(person) & #person

morph & [POS verb, STEM #action]

@seek(simple_noun_phrase) & [SURFACE #info]

-> [PERSON #person, ACTION #action, OBJECT #info].









Jakub Piskorski Warszawa, 6.10 .2003



Other docs by fjzhangxiaoqua...
junburgh women snow boots appeal
Views: 0  |  Downloads: 0
STL _ INDY BOA Trip 2008
Views: 0  |  Downloads: 0
Dear Mr
Views: 0  |  Downloads: 0
The walk to Emmaus
Views: 1  |  Downloads: 0
Planning book - District Develop
Views: 0  |  Downloads: 0
Дефектный акт Утверждаю
Views: 0  |  Downloads: 0
Children's Sleepwear Flammabilit
Views: 0  |  Downloads: 0
Roane County Schools August Brea
Views: 0  |  Downloads: 0
The Marketing Research Report Pr
Views: 0  |  Downloads: 0
The 2011 Import and Export Marke
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!