Docstoc

NLP__

Document Sample
NLP__ Powered By Docstoc
					Natural Language Processing

              김유섭
 (http://home.ewha.ac.kr/~yskim01)
                 Contents
   Introduction
   Components in NLP
   State-of-art in NLP
   Application of NLP
   Future of NLP
                   Introduction
   What is NLP???
       Computer와 algorithm을 자연언어 처리에 적용시키는 학
        문 분야
       Linguistics, Computer Science, Cognitive Science,
        Psycholinguistics, etc…
   What is Natural Language???
       ⇔ Formal Language
   Why difficult???
       Many “words”, many “phenomena” → many “rules”
       Ambiguity
   Rationalism vs. Empiricism
       Statistical NLP
       Language Learning
                                              Next
   자연언어 vs 인공언어
       자연언어(Natural Language)
            신문의 글
            지금 우리가 하고 있는 말
            보고 읽는 단어와 문장
            …
       인공언어(Artificial Language)
            컴퓨터 프로그램
            음악악보
            도형, 수식
            …

                                    Next
   자연언어의 특징
     모호성
     부정확하게 사용할 수 있음
     설명을 불충분하게 해도 이해할 수 있음
     실수나 고의로 잘못된 표현을 사용할 수 있음
     복잡한 계산
   자연언어의 장점
     훈련 필요없음
     인지적으로 다루기 쉬움
     유연
     휴대가능
     기억의 용이                 Next
   OED: 400k words; Finnish lexicon (of forms):
    ~2 . 107
   sentences, clauses, phrases, constituents,
    coordination, negation, imperatives/questions,
    inflections, parts of speech, pronunciation,
    topic/focus, and much more!
   irregularity (exceptions, exceptions to the
    exceptions, ...)
   potato -> potato es (tomato, hero,...); photo
    -> photo s, and even: both mango ->
    mango s or -> mango es
   Adjective / Noun order: new book, electrical
    engineering, general regulations, flower
    garden, garden flower, ...: but Governor
    General
                                         Next    6
   ambiguity
        books: NOUN or VERB?
             you need many books vs. she books her flights
              online
        No left turn weekdays 4-6 pm / except
         transit vehicles (Charles Street at Cold Spring)
             when may transit vehicles turn: Always? Never?
        Thank you for not smoking, drinking, eating
         or playing radios without earphones. (MTA
         bus)
             Thank you for not eating without earphones??
             or even: Thank you for not drinking without
              earphones!?
        My neighbor‟s hat was taken by wind. He
         tried to catch it.
             ...catch the wind or ...catch the hat ?
                                                        Next   7
         Components in NLP
   Morphological Analysis
   Part-of-Speech Tagging
   Syntactic Analysis
   Word Sense Disambiguation



                                Next
                                      Next



 형태소 분석의 과정
  전처리 단계                   V+s->3s
  후보 생성 단계       단어인식규칙
                            N+조사
  후보 선택 단계                 V+어미
  후처리 단계
          tries                 Try, V, 3s
          나는
                  형태소 분석기
                                나/N+는/조사
                                날/V+는/어미
                                나/V+는/어미
                   사 전
                         Try:v
                         나:n,v
                         날:v
                         는:조사,어미
                    Tagging Examples
   Word form: A+ ® 2(L,C1,C2,...,Cn) ® T
       He always books the violin concert tickets early.
            MA: books ® {(book-1,Noun,Pl,-,-),(book-
             2,Verb,Sg,Pres,3)}
            tagging (disambiguation): ... ® (Verb,Sg,Pres,3)
       ...was pretty good. However, she did not realize...
            MA: However ® {(however-1,Conj/coord,-,-,-),(however-
             2,Adv,-,-,-)}
            tagging: ... ® (Conj/coord,-,-,-)
       [æ n d] [g i v] [i t] [t u:] [j u:] (“and give it to you”)
            MA: [t u:] ® {(to-1,Prep),(two,Num),(to-2,Part/inf),(too,Adv)}
            tagging: ... ® (Prep)


                                                     Next
                                                                      10
Next
              Phrase Structure Tree
   Example:




      ((DaimlerChrysler‟s shares)NP (rose (three eights)NUMP (to 22)PP-
    NUM )VP )S
                                                                   11
   Problem
       words often have more than one meaning,
        sometimes fairly similar and sometimes
        completely different
   Application
       Machine Translation
           Target word of „bank‟??
       Information Retrieval
           „court‟ → ‘법정’ / ‘테니스 코트’ ??
   Resolution
       Selectional restriction
       Statistical methods
                                           Next
         State-of-art of NLP
   Corpus
   Autonomous Learning




                               Next
           말뭉치 데이터 (Corpus)
 신문, 잡지, 교과서 등에서 추출한 다양한 문
  장들로 구성
 언어에 대한 다양한 표식
       품사, 문장성분, 구문분석 결과
   Korea Information Base System
       http://kibs.kaist.ac.kr
   British National Corpus
       http://info.ox.ac.uk/bnc

                                    Next
          Autonomous Learning NLP


 음성인식
 모호성 해소  분류문제
       구조표지, 품사표지, 중의성 해소, 전치사 접속 결
        정
   언어습득 및 이해
       규칙추론, 정보추출 및 검색, 자동요약, 기계번역

                            Next
           기계학습 기법

   기호적 학습
       사례기반학습, 결정트리, 귀납논리
   비기호적 학습
       신경망, 유전자 알고리즘
   확률적 학습
       베이지안망, 은닉마코프모델, 확률문법
   변형기반학습, 능동학습, 부스팅, 강화학습,
    건설적 귀납
                             Next
               Application of NLP
   Machine Translation
   Information Retrieval
   Bioinformatics
       Information Extraction
           Extract relationships among gene or proteins
            appearing in technical papers



                                            Next
Machine Translation




                      Next
Next
Information Retrieval




                        Next
Next
                 Future of NLP
                         Information                   DataBase
           Information   Systems
                                       Information
           Retrieval
                                       Extraction
           (Filtering)



                                                       Knowledge
                                                       Base
Internet                               Summarization
                                       Text Mining



                                       Machine
                                       Translation

				
DOCUMENT INFO