LING 581 Advanced Computational Linguistics

Document Sample
LING 581 Advanced Computational Linguistics  Powered By Docstoc
					LING 581: Advanced Computational
            Linguistics
          Lecture Notes
           January 26th
Penn Treebank




            Bracketing
            guidelines
    Ungraded Homework Exercise
• Search for NP trace relative clauses as defined below:




                                  Be ready to
                                  compare search
                                  pattern and
                                  number
                                  found next time
                                  in class
Ungraded Homework Exercise




             @NP < @NP < @SBAR
             12038
Ungraded Homework Exercise




                @NP < @NP < @SBAR
                plus WH indices
                10956 down from 12038
Ungraded Homework Exercise




                @NP < @NP < (@SBAR < /^-NONE-/)
                529
                Note
                -NONE- < *ICH*
Ungraded Homework Exercise
     Ungraded Homework Exercise




Not all
@NP < @NP < (@SBAR < /^-NONE-/)
are relative clauses
Ungraded Homework Exercise


@NP < @NP < (@SBAR < /^-NONE-/)
plus *ICH*
count drops from 529 to 166
Ungraded Homework Exercise


              @NP < @NP < (@SBAR < /^-NONE-/)
              plus *ICH*
              Is 166 too low?
              How about other -NONE- nodes?
Ungraded Homework Exercise
    Ungraded Homework Exercise
• Final tally
  Search term                    Frequency
  @NP < @NP < @SBAR              10956
  plus WH indices
  @NP < @NP < (@SBAR < /^-NONE-/) 166
  plus *ICH*
  @NP < @NP < (@SBAR < /^-NONE-/) 8
  plus *RNR*
  TOTAL                          11130
Homework Exercise
         Find allbracketing guides the choose
         Use the occurrences in and WSJ
         PTB “interesting” constructions
         three
           Homework Exercise
• 581 Homework rules
  – Due next lecture
  – Present your findings in class (slides)
                  Parsing
… from Treebank search to stochastic parsers
trained on the WSJ Penn Treebank
                Bikel Collins
• Java re-implementation of Collins’ parser
• Paper
  – Daniel M. Bikel. 2004. Intricacies of Collins’
    Parsing Model. (PS) (PDF) 
in Computational
    Linguistics, 30(4), pp. 479-511.
  – http://www.cis.upenn.edu/~dbikel/papers/collins
    -intricacies.pdf
• Software
  – http://www.cis.upenn.edu/~dbikel/
                     Bikel Collins
• Download and install Dan Bikel’s parser

• File: install.sh
   – Java code
   – but at this point I think Windows won’t work
     because of the shell script (.sh)
   – maybe after files are extracted?
                      Bikel Collins
• Download and install the POS tagger MXPOST




    parser doesn’t actually need a separate tagger…
                      Bikel Collins
• Training the parser with the WSJ PTB
• See guide
   – http://www.cis.upenn.edu/~dbikel/download/dbparser/g
     uide.pdf




    directory:         TREEBANK_3/parsed/mrg/wsj
    chapters 02-21:    create one single .mrg file
    events:            wsj-02-21.obj.gz
              Bikel Collins
• Settings:
                   Bikel Collins
• Parsing
   – Command


    – Input file format (sentences)
                Bikel Collins
• Verify the trainer and parser work on your
  machine
                  Bikel Collins
• File: bin/parse is a shell script that sets up
  program parameters and calls java
Bikel Collins
                  Bikel Collins
• File: bin/train is another shell script
                         Bikel Collins




•   Relevant WSJ PTB files
                  Bikel Collins
• If you have tcl/tk installed, I use a wrapper to call
  Dan Bikel’s code




                                           makes it easy to work
                                           the parser without
                                           memorizing the
                                           command line
                                           options
                 Bikel Collins
• For tree viewing, you can use tregex




                               For demos, I use my own viewer
Bikel Collins
       Unix file descriptors (MXPOST, in
       • POS tagging                               directory jmx)
                – tagger_input (stdin)
       0 Standard input
       1     Standard output      (stdout)
                – $prefix/jmx/mxpost
       2     Standard error       (stderr)
                      $prefix/jmx/tagger.project < /tmp/test.txt 2>
                      /tmp/err.txt
       GUI components
       frame .input
       text Parsing
       • .input.t -height 4 -yscrollcommand {.input.s set}
       scrollbar .input.s -command {.input.t yview}
               – set ddf "wsj-02-21.obj.gz”
                 –    set properties "collins.properties"
       frame .tagged
       text .tagged.t -height 9 -yscrollcommand {.tagged.s set}
                – parser_input
       scrollbar .tagged.s -command {.tagged.t yview}
                 –    $dbprefix/bin/parse 400
       Code           $dbprefix/settings/$properties
                      $dbprefix/bin/$ddf /tmp/test2.txt 2>@
       proc tagger_input {} {
                      stdout
         set lines [.input.t get 1.0 end]
           set infile [open "/tmp/test.txt" w]
           puts -nonewline $infile [string trimright $lines]
       •       Training
           close $infile
       }         –    set mrg "wsj-02-21.mrg”
                 –    set properties "collins.properties"
       proc parser_input {} {
         set lines [.tagged.t get 1.0 end] 800
                – $dbprefix/bin/train
         set infile [open "/tmp/test2.txt" w]
                      $dbprefix/settings/$properties
         puts -nonewline $infile [string trimright $lines]
                      $dbprefix/bin/$mrg 2>@ stdout
         close $infile
       }

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:18
posted:10/29/2013
language:English
pages:29