Document Sample
kaushik Powered By Docstoc
					NL Question-Answering using
   Naïve Bayes and LSA
      Kaushik Krishnasamy
•   Problem
•   Key
•   Methodology
•   Naïve Bayes
•   LSA
•   Results & Comparison
•   Conclusion
             Problem: QA
• QA/discourse is complex.
• Overhead in Knowledge extraction &
  decoding (grammar based systems).
• Restrictions due to inherent language &
  culture constructs.
• Context of word usage
                 Key: IR
• Information is hidden in relevant
• Frequency of a word - Its Importance.
• Neighborhood of a word - Context
• Question posed is a new document.
• How close is this document with respect to
  all documents in KB?
  – Naïve Bayes: Probabilistic Approach (C#)
  – LSA: Dimensionality Reduction (MATLAB)
• Closest document has the answer.
               Naïve Bayes
• vMAP = argmax P(vj | a1, a2, a3,…, an)
           vj ε V

• As all documents are possible target
  documents, P(v1)=P(v2)=…=P(vj)=constant
      vNB = argmax  P(ai|vj)
              vj ε V   i

• Words are independent and identically
       Naïve Bayes - Algorithm
• Pre-process all documents.
• Store number of unique words in each document (Ni).
• Concatenate all documents and store words that occur at
  least 2 times as unique words. Count the number of such
  unique words as the „Vocabulary‟.
• For each of these unique words for each document,
  estimate P(word|document) using the formula,
  (Freq of the word in doc „i‟ + 1)/(Ni + Vocabulary)
• Store (word, doc, probability/frequency) to a file.
• Obtain an input query from the user.
• Retrieve individual words after pre-processing.
• Penalize if words are not one amongst the unique ones.
• For each doc estimate the product of the probabilities of all
  the retrieved words given this document from the file.
• The document having the maximum P(input|vi) is the
  document having the answer.
• WORDNET: Resolve unknown input words
 LSA: Latent Semantic Analysis
• Method to extract & represent contextual-
  usage meaning of words.
• Set of words are points in a very high
  dimensional “semantic space”.
• Uses SVD to reduce dimensionality.
• Application of correlation analysis to arrive
  at results.
               LSA: Algorithm
• Obtain (word, doc, frequency).
• Basic Matrix: Form the (word x doc) matrix with the
  frequency entries.
• Preprocess the input query.
• Query Matrix: Form the (word x (doc+1) ) matrix with the
  query as the last column with individual word frequencies.
• Perform SVD: USVT
• Select the two largest singular values and reconstruct the
• Find the document that is maximally correlated to the
  query document column.
• This is the document having the answer to the query.
• Documents: Basic Electrical Engineering (EXP, Lessons)
• The documents have an average of app. 250 words and
  each deal with a new topic (Cannot partition into training
  and testing docs) – (11 + 46 = 57 docs)
• Naïve Bayes:
   – Automated trivial input testing
   – Real input testing
   – Trivial input testing
   – Real input testing (to be tested for Lesson)
• Naïve Bayes:
  – Automated Trivial Input
   Start Position   No. of words   Accuracy
        10              10          48/57
        30               7           3/11
        20              20           5/11
         5              40          10/11
        30               7          36/46
        20              20          43/46
• Naïve Bayes
  – Real Input
  EXP docs (11 docs):
  Input have less than 10 words: (E.g. “how do i use a dc power supply?”)
  Accuracy: 8/10
  Input 10 to 15 words: (E.g. “what is the law that states that energy is neither
     created nor destroyed, but just changes from one form to another?”)
  Accuracy: 8/10
  Lesson docs (46 docs): 5 to 15 words
  Accuracy: 14/20
• LSA (flawless with trivial input >20 words)
  – Without SVD (For EXP only)
     • Poor accuracy: 4/10 (<10 words)
     • Good accuracy: 8/10 (10 to 15 words)
  – With SVD
     • Very poor accuracy: 1/10 (<10 words)
     • Poor accuracy: 2/10 (10 to 15 words)
• Naïve Bayes
  –   Fails for acronyms and irrelevant queries
  –   Indirect references fail - word context
  –   Keywords determine success.
  –   Discrete concept content perform better (EXP)
  – Fails miserably for small sentences (<15)
  – Very effective for large sentences (>20)
  – Insensitive to indirect references or context
• The Naïve Bayes and the LSA techniques
  were studied.
• Software was written to test these methods.
• Naïve Bayes is found to be very effective
  for short sentences (Q-A) type with an app.
  Accuracy of 80%.
• LSA without SVD is better than with SVD
  for smaller sentences.

Shared By: