# kaushik

Document Sample

```					NL Question-Answering using
Naïve Bayes and LSA
By
Kaushik Krishnasamy
Agenda
•   Problem
•   Key
•   Methodology
•   Naïve Bayes
•   LSA
•   Results & Comparison
•   Conclusion
Problem: QA
• QA/discourse is complex.
• Overhead in Knowledge extraction &
decoding (grammar based systems).
• Restrictions due to inherent language &
culture constructs.
• Context of word usage
Key: IR
• Information is hidden in relevant
documents.
• Frequency of a word - Its Importance.
• Neighborhood of a word - Context
Methodology
• Question posed is a new document.
• How close is this document with respect to
all documents in KB?
– Naïve Bayes: Probabilistic Approach (C#)
– LSA: Dimensionality Reduction (MATLAB)
• Closest document has the answer.
Naïve Bayes
• vMAP = argmax P(vj | a1, a2, a3,…, an)
vj ε V

• As all documents are possible target
documents, P(v1)=P(v2)=…=P(vj)=constant
vNB = argmax  P(ai|vj)
vj ε V   i

• Words are independent and identically
distributed.
Naïve Bayes - Algorithm
• Pre-process all documents.
• Store number of unique words in each document (Ni).
• Concatenate all documents and store words that occur at
least 2 times as unique words. Count the number of such
unique words as the „Vocabulary‟.
• For each of these unique words for each document,
estimate P(word|document) using the formula,
(Freq of the word in doc „i‟ + 1)/(Ni + Vocabulary)
• Store (word, doc, probability/frequency) to a file.
Contd…
• Obtain an input query from the user.
• Retrieve individual words after pre-processing.
• Penalize if words are not one amongst the unique ones.
• For each doc estimate the product of the probabilities of all
the retrieved words given this document from the file.
P(input|vi)=P(w1|vi)*P(w2|vi)*P(w3|vi)*…*P(wn|vi)
• The document having the maximum P(input|vi) is the
• WORDNET: Resolve unknown input words
LSA: Latent Semantic Analysis
• Method to extract & represent contextual-
usage meaning of words.
• Set of words are points in a very high
dimensional “semantic space”.
• Uses SVD to reduce dimensionality.
• Application of correlation analysis to arrive
at results.
LSA: Algorithm
• Obtain (word, doc, frequency).
• Basic Matrix: Form the (word x doc) matrix with the
frequency entries.
• Preprocess the input query.
• Query Matrix: Form the (word x (doc+1) ) matrix with the
query as the last column with individual word frequencies.
• Perform SVD: USVT
• Select the two largest singular values and reconstruct the
matrix.
Contd…
• Find the document that is maximally correlated to the
query document column.
• This is the document having the answer to the query.
Testing
• Documents: Basic Electrical Engineering (EXP, Lessons)
• The documents have an average of app. 250 words and
each deal with a new topic (Cannot partition into training
and testing docs) – (11 + 46 = 57 docs)
• Naïve Bayes:
– Automated trivial input testing
– Real input testing
• LSA
– Trivial input testing
– Real input testing (to be tested for Lesson)
Results
• Naïve Bayes:
– Automated Trivial Input
Start Position   No. of words   Accuracy
10              10          48/57
30               7           3/11
20              20           5/11
5              40          10/11
30               7          36/46
20              20          43/46
Results
• Naïve Bayes
– Real Input
EXP docs (11 docs):
Input have less than 10 words: (E.g. “how do i use a dc power supply?”)
Accuracy: 8/10
Input 10 to 15 words: (E.g. “what is the law that states that energy is neither
created nor destroyed, but just changes from one form to another?”)
Accuracy: 8/10
Lesson docs (46 docs): 5 to 15 words
Accuracy: 14/20
Results
• LSA (flawless with trivial input >20 words)
– Without SVD (For EXP only)
• Poor accuracy: 4/10 (<10 words)
• Good accuracy: 8/10 (10 to 15 words)
– With SVD
• Very poor accuracy: 1/10 (<10 words)
• Poor accuracy: 2/10 (10 to 15 words)
Comparison
• Naïve Bayes
–   Fails for acronyms and irrelevant queries
–   Indirect references fail - word context
–   Keywords determine success.
–   Discrete concept content perform better (EXP)
• LSA
– Fails miserably for small sentences (<15)
– Very effective for large sentences (>20)
– Insensitive to indirect references or context
Conclusion
• The Naïve Bayes and the LSA techniques
were studied.
• Software was written to test these methods.
• Naïve Bayes is found to be very effective
for short sentences (Q-A) type with an app.
Accuracy of 80%.
• LSA without SVD is better than with SVD
for smaller sentences.

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 7 posted: 12/1/2011 language: English pages: 17