Docstoc

Natural Language Processing - Download as PDF

Document Sample
Natural Language Processing - Download as PDF Powered By Docstoc
					Lexicon as Data Structure

                                                          a
                                 a                                a
                                 abdykacja                        abdykacja
                                 abdykacyjny                      abdykacyjny
                                 aberracja                        aberracja
                                 abiturient                       abiturient
                                 abiturientka                     abiturientka
                                 ablucja
                                 abnegacja
                                                          b
                                 abnegat
                                 abnegacki
                                 abnegatka                        b


                                                          c
                                 abolicja                         baba
                                 abolicyjny                       badanie
                                 abominacja                       badawczy
                                 abonament                        badminton
                                 abonamentowy                     badmintonista
                                 abonent
                                 abonencki

                                                          z
                                 abonentka



                                 zygzak
                                 zysk
                                 zyskowny                         z
                                 zza                              za
                                                                  zabawa
                                                                  zabawka
                                                                  zabawkarstwo
                                                                  zabawny




Memory: same as + vector                             Memory: n bytes less
Time: O(|w| · log n)                                 Time: O((|w|) · log(n/Σ))
Repeating the procedure brings smaller and smaller economies, and
eventually memory loss. The access time becomes proportional only
to the length of the word being searched for. Result : a trie – an
automaton!
Jan Daciuk, DKE, ETI, GUT   Natural Language Processing                      4. Lexicon – Data Structure   (1 / 12)
Morphological Analysis and Synthesis Using
Transducers



Transducers can cope with both morphological analysis and synthesis.
A path in the transducer ensures proper analysis/synthesis.
          k             o       t             +           N   a        s              v


          k             o       c             i           e

Even though the transducer can be inverted, determinization adapts it
to be used in only one way.




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing           4. Lexicon – Data Structure   (2 / 12)
Morphological Analysis and Synthesis Using
Recognizers



Additional information is put behind the inflected form:
              k         o    c        i         e         +   N   a         s         v



What about the base form? It cannot be put as it is, because it would
inflate the automaton. Instead, a coding is used:
kocie+Dt+Nasv
where +Dt codes the basic form. D means “delete three characters
from the end of the inflected form” (A – zero, B – one,. . . ).




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing               4. Lexicon – Data Structure   (3 / 12)
It Looks Like Our Dictinary Is Too Small. . .



                ´
        biegaj+Bc+V+imp+imper+sg+p2
        ´        ´
        slizgaj+Bc+V+imp+imper+sg+p2
                 ´
        czołgaj+Bc+V+imp+imper+sg+p2
              ´
        łgaj+Bc+V+imp+imper+sg+p2
                ´
        błagaj+Bc+V+imp+imper+sg+p2
        ...
        pelgaj?




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing   4. Lexicon – Data Structure   (4 / 12)
It Looks Like Our Dictinary Is Too Small. . .



        biegaj+B´ +V+imp+imper+sg+p2
                c
        ´
        slizgaj+B´ +V+imp+imper+sg+p2
                 c
        czołgaj+B´ +V+imp+imper+sg+p2
                 c
        łgaj+B´ +V+imp+imper+sg+p2
              c
        błagaj+B´ +V+imp+imper+sg+p2
                c
        ...
        pelgaj?




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing   4. Lexicon – Data Structure   (5 / 12)
It Looks Like Our Dictinary Is Too Small. . .



                c
        jageib+B´ +V+imp+imper+sg+p2
              ´ c
        jagzils+B´ +V+imp+imper+sg+p2
                 c
        jagłozc+B´ +V+imp+imper+sg+p2
              c
        jagł+B´ +V+imp+imper+sg+p2
                c
        jagałb+B´ +V+imp+imper+sg+p2
        ...
        jaglep?




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing   4. Lexicon – Data Structure   (6 / 12)
Traditional Way of Constructing a Minimal Dictionary

                                                                                                 n               t                                       n       t
                                                                                                                                                 e
                                                                                     e                                                               s
                                                                                         s
                                                                                         t                                           i       t
                                                                                                             t
                                                                             i                                                                       e       z
                                                                                                 n                   a       i
                                                                                                                                             o           n       s
                                                                                                     r
                                                                                                      s
                                                                     a               e               z
                                     a           i               m                   i               e                   z
                                                                                 o               o               n               s
                                                                                         n               s
        Constructing a trie
                                                                 n
                                                                                 t
                                                     e
                                                         s
                                                             t
                                                                                                             z
                                             i                                                                       e
                                                                 n               a           i                                           s
                                                                                                                 o               n
                                                                     r
                                                                         s
                                         a           e               z
                            a   i   m
                                                     i
                                                 o


        Minimization
A need to store a huge tree. It can be done better!

Jan Daciuk, DKE, ETI, GUT   Natural Language Processing                                                                                                              4. Lexicon – Data Structure   (7 / 12)
Constructing a Trie




      a        i        m     i           e           z




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing   4. Lexicon – Data Structure   (8 / 12)
Constructing a Trie




                                                              e   z
                                                          i
                                           r

                               e
      a        i        m     i           e           z




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing               4. Lexicon – Data Structure   (9 / 12)
Constructing a Trie




                                                              e   z
                                                          i
                                           r

                               e
      a        i        m     i           e           z




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing           4. Lexicon – Data Structure   (10 / 12)
Constructing a Trie


                                                                           n          t
                                                                   e


                                                               i

                                                                       e       z
                                                   a       i
                                           r

                               e
      a        i        m     i           e            z




Jan Daciuk, DKE, ETI, GUT   Natural Language Processing                        4. Lexicon – Data Structure   (11 / 12)
What Is Minimization?



        M = (Q, Σ, δ, q0 , F ), |M| = |Q|
        M is minimal iff ∀M :L(M )=L(M) |M| ≤ |M |
        −
        →                                     −
                                              →
        L (q) = {w : δ ∗ (q, w) ∈ F }, L(M) = L (q0 )
                  −
                  →        −
                           →
        p ≡ q iff L (p) = L (q).
        M is minimal iff ∀p,q∈Q p ≡ q ⇔ p = q
        −
        →                                 −
                                          →                    ∅   q∈F
        L (q) =             a:δ(q,a)=⊥   a L (δ(q, a)) ∪
                                                               { } q∈F




Jan Daciuk, DKE, ETI, GUT        Natural Language Processing         4. Lexicon – Data Structure   (12 / 12)

				
DOCUMENT INFO