A Pattern Taxonomy Model with New Pattern Discovery Model for Text Mining by warse1

VIEWS: 163 PAGES: 5

									K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92
                                                                                                                                   ISSN No. 2278-3083
                                                                     Volume 1, No.3, July – August 2012
                                          International Journal of Science and Applied Information Technology
                                               Available Online at http://warse.org/pdfs/ijsait05132012.pdf




 A Pattern Taxonomy Model with New Pattern Discovery Model for
                        Text Mining

             Mrs.K. Mythili, Professor, Department of Computer Applications, Hindusthan College of Arts and Science,
                                                Coimbatore -6, Tamilnadu. India
                                                     mythiliaru1@gmail.com
         Mrs. K. Yasodha, Research Scholar, Department of Computer Applications, Hindusthan College of Arts and Science,
                                                 Coimbatore-6, Tamilnadu. India
                                                      Yaso194@gmail.com

ABSTRACT                                                                                       loosely characterized as the process of analyzing text to extract
                                                                                               information that is useful for particular purposes. Compared
Most of the mining techniques are proposed for the purpose of                                  with the kind of data stored in databases, text is unstructured,
developing efficient mining algorithms to find particular                                      amorphous, and difficult to deal with algorithmically.
patterns within a reasonable and acceptable time frame. With a                                 Nevertheless, in modern culture, text is the most common
large number of patterns generated by using data mining                                        vehicle for the formal exchange of information. The field of
approaches, how to effectively use and update these patterns is                                text mining usually deals with texts whose function is the
still an open research issue. In existing system, an effective                                 communication of factual information or opinions, and the
pattern discovery technique introduced which first calculates                                  motivation for trying to extract information from such text
discovered specificity patterns and then evaluates the term                                    automatically is compelling even if success is only partial.
weight according to the distribution of terms in the discovered
patterns rather than the distribution in documents for solving                                 A text summarizer strives to produce a condensed
the misinterpretation problem. It also considers the influence                                 representation of its input, intended for human consumption. It
of patterns from the negative training examples to find                                        may condense individual documents or groups of documents.
ambiguous (noisy) patterns and try to reduce their influence                                   Text compression, a related area, also condenses documents,
for the low-frequency problem. The process of updating                                         but summarization differs in that its output is intended to be
ambiguous patterns can be referred as pattern evolution. This                                  human-readable. The output of text compression algorithms is
approach can improve the accuracy of evaluating term weights                                   certainly not human-readable, but neither is it actionable the
because discovered patterns are more specific than whole                                       only operation it supports is decompression, that is, automatic
documents. This technique uses two processes, pattern                                          reconstruction of the original text. As a field, summarization
deploying and pattern evolving, to refine the discovered                                       differs from many other forms of text mining in that there are
patterns in text documents. But they don’t consider the time                                   people, namely professional abstractors, who are skilled in the
series to rank the given sets of documents. In proposed system,                                art of producing summaries and carry out the task as part of
the temporal text mining approach is introduced. The system                                    their professional life. Studies of these people and the way
terms of its ability is evaluated to predict forthcoming events                                they work provide valuable insights for automatic
in the document. Here the optimal decomposition of the time                                    summarization.
period associated with the given document set is discovered,
where each subinterval consists of consecutive time points                                     Text mining is the discovery of interesting knowledge in text
having identical information content. Extraction of sequences                                  documents. It is a challenging issue to find accurate
of events from new and other documents based on the                                            knowledge (or features) in text documents to help users to find
publication times of these documents has been shown to be                                      what they want. In the beginning, Information Retrieval (IR)
extremely effective in tracking past events.                                                   provided many term-based methods to solve this challenge.
                                                                                               There are two fundamental issues regarding the effectiveness
Keywords: Temporal Text Mining, Pattern Deploying, Pattern                                     of pattern-based approaches: low frequency and
Evolution.                                                                                     misinterpretation. Given a specified topic, a highly frequent
                                                                                               pattern (normally a short pattern with large support) is usually
1. INTRODUCTION                                                                                a general pattern, or a specific pattern of low frequency. If we
                                                                                               decrease the minimum support, a lot of noisy patterns would
Text mining is a burgeoning new field that attempts to glean                                   be discovered. Misinterpretation means the measures used in
meaningful information from natural language text. It may be                                   pattern mining (e.g., “support” and “confidence”) turn out to
                                                                                                                                                              88
@ 2012, IJSAIT      All Rights Reserved
K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

be not suitable in using discovered patterns to answer what                                    selective dissemination of information to information
users want. The difficult problem hence is how to use                                          consumers, automated population of hierarchical catalogues of
discovered patterns to accurately evaluate the weights of                                      Web resources, spam filtering, identification of document
useful features (knowledge) in text documents.                                                 genre, authorship attribution, survey coding, and even
                                                                                               automated essay grading. Automated text classification is
There are two fundamental problems regarding the                                               attractive because it frees organizations from the need of
effectiveness of pattern-based approaches: low frequency and                                   manually organizing document bases, which can be too
misinterpretation. Given a specified topic, a highly frequent                                  expensive, or simply not feasible given the time constraints of
pattern (normally a short pattern with large support) is usually                               the application or the number of documents involved. The
a general pattern, or a specific pattern of low frequency. If we                               accuracy of modern text classification systems rivals that of
decrease the minimum support, a lot of noisy patterns would                                    trained human professionals, thanks to a combination of
be discovered. Misinterpretation means the measures used in                                    information retrieval (IR) technology and machine learning
pattern mining turn out to be not suitable in using discovered                                 (ML) technology. This will outline the fundamental traits of
patterns to answer what users want. The difficult problem                                      the technologies involved, of the applications that can feasibly
hence is how to use discovered patterns to accurately evaluate                                 be tackled through text classification and of the tools and
the weights of useful features (knowledge) in text document.                                   resources that are available to the researcher and developer
                                                                                               wishing to take up these technologies for deploying real-world
Objective of this work is to discover the pattern from large                                   applications. A web technology [4] extracts the statistical
database effectively and to help improving the effectiveness of                                information and discovers interesting user patterns, cluster the
pattern based approaches. The pattern evaluation approach can                                  user into groups according to their navigational behavior,
improve the accuracy of evaluating term weights because                                        discover potential correlations between web pages and user
discovered patterns are more specific than whole documents.                                    groups, identification of potential customers for E-commerce,
This technique uses two processes, pattern deploying and                                       enhance the quality and delivery of Internet information
pattern evolving, to refine the discovered patterns in text                                    services to the end user, improve web server system
documents. It can improve the accuracy of evaluating term                                      performance and site design and facilitate personalization.
weights because discovered patterns are more specific than the
whole documents. It also makes the user to track the previous                                  Identifying comparative sentences is also useful in practice
time series from the sets of document.                                                         because direct comparisons are perhaps one of the most
                                                                                               convincing ways of text evaluation, which may even be more
In proposed system, the temporal text mining approach is                                       important than opinions on each individual object. The
used. Temporal text mining combines information extraction                                     comparative sentence identification [7] problem first
and data mining techniques upon texting repositories and                                       categorizes comparative sentences into different types, and
incorporate times. The sequences of events from the sets of                                    then presents a novel integrated pattern discovery and
documents are extracted in order to track the past events                                      supervised learning approach to identifying comparative
effectively. The optimal decomposition of the time period is                                   sentences from text documents. Experiment results using three
constructed associated with the given document set. The                                        types of documents, news articles, consumer reviews of
notion of the compressed level decomposition is introduced                                     products, and Internet forum postings, show a precision of
where each subinterval consists of consecutive time points                                     79% and recall of 81%. Comparison is one of the most
having identical information content. Several documents are                                    convincing ways for text evaluation. Extracting comparative
defined based on the information computed as document sets                                     sentences from text is useful for many applications. For
are combined.                                                                                  example, in the business environment, whenever a new
                                                                                               product comes into market, the product manufacturer wants to
2. REVIEW OF LITERATURE                                                                        know consumer opinions on the product, and how the product
                                                                                               compares with those of its competitors. Much of such
As the volume of electronic information increases, there is                                    information is now readily available on the Web in the form of
growing interest in developing tools to help people better find,                               customer reviews, forum discussions, blogs, etc. Extracting
filter, and manage these resources. Text categorization [9] is                                 such information can significantly help businesses in their
the assignment of natural language texts to one or more                                        marketing and product benchmarking efforts. Clearly, product
predefined categories based on their content which is an                                       comparisons are not only useful for product manufacturers, but
important component in many information organization and                                       also to potential customers as they enable customers to make
management tasks. Machine learning methods, including                                          better purchasing decisions.
Support Vector Machines (SVMs), have tremendous potential
for helping people to effectively organize the electronic                                      A statistical method called Latent Semantic Indexing (LSI) [3]
resources. Text mining often involves the extraction of                                        which models the implicit higher-order structure in the
keywords with respect to some measure of importance.                                           association of words and objects and improves retrieval
Weblog data is textual content with a clear and significant                                    performance by up to 30%. In additional large performance
temporal aspect. Text categorization [1] (also known as text                                   improvements of 40% and 67% can be achieved using
classification or topic spotting) is the task of automatically                                 differential term weighting and iterative retrieval methods.
sorting a set of documents into categories from a predefined                                   These methods include restricting the allowable indexing and
set. This task has several applications, including automated                                   retrieval vocabulary and training intermediaries to generate
indexing of scientific articles according to predefined thesauri                               terms from these restricted vocabularies, hand-crafting
of technical terms, filing patents into patent directories,                                    domain-specific thesauri to provide synonyms for user’s
                                                                                                                                                             89
@ 2012, IJSAIT      All Rights Reserved
K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

search terms, constructing explicit models of domain-relevant                                  3. PROPOSED WORK
knowledge, and automatically clustering terms and documents.
The rationale for restricted or controlled vocabularies is that                                3. 1 Features Selection Metohod
they are by design relatively unambiguous. However they have                                   In this method, documents are considered as an input and the
high costs and marginal (if any) benefits compared with                                        features for the set of documents are collected. Features are
automatic indexing based on the full content of texts. The use                                 selected based on the TFIDF method. Information retrieval has
of a thesaurus is intended to improve retrieval by expanding                                   been developed based on many mature techniques which
terms that are too specific.                                                                   demonstrate the terms which are important features in the text
                                                                                               documents. However, many terms with larger weights (e.g.,
Mining frequent patterns [5] in transaction databases, time-                                   the term frequency and inverse document frequency (tf*idf)
series databases, and many other kinds of databases has been                                   weighting scheme) are general terms because they can be
studied popularly in data mining research. Most of the                                         frequently used in both relevant and irrelevant information.
previous studies adopt an Apriori-like candidate set                                           The features selection approach is used to improve the
generation-and-test approach. However, candidate set                                           accuracy of evaluating term weights because the discovered
generation is still costly for large number of patterns and/or                                 patterns are more specific than whole documents. In order to
long patterns. A novel frequent-pattern tree (FP-tree) structure,                              reduce the irrelevant features, many dimensionality reduction
which is an extended prefix-tree structure for storing                                         approaches have been conducted by the use of feature
compressed, crucial information about frequent patterns, and                                   selection techniques.
develop an efficient FP-tree based mining method, FP-growth,
for mining the complete set of frequent patterns by pattern                                    3.2 Finding Frequent and Closed Sequential Pattern
fragment growth. Efficiency of mining is achieved with three
techniques: (1) a large database is compressed into a                                          When feature selection process is completed, the frequent and
condensed, smaller data structure, FP-tree which avoids costly,                                closed patterns are discovered based on the documents, the
repeated database scans, (2) our FP-tree-based mining adopts a                                 termset ‘X’ in document‘d’,┌X┐ is used to denote the
pattern-fragment growth method to avoid the costly generation                                  covering set of X for d, which includes all paragraph ‘dp’€
of a large number of candidate sets, and (3) a partitioning-                                   PS(d) such that
based, divide-and-conquer method is used to decompose the
                                                                                                                                                         i.e.,
mining task into a set of smaller tasks for mining confined
patterns in conditional databases, which dramatically reduces
the search space. Our performance study shows that the FP-
growth method is efficient and scalable for mining both long                                   Its absolute support is the number of occurrences of X in PS(d)
and short frequent patterns, and is about an order of magnitude                                i.e.,
faster than the Apriori algorithm and also faster than some
recently reported new frequent-pattern mining methods. SVM
[2] can be used to learn a variety of representations, such as                                 Its relative support is the fraction of the paragraphs that
neural nets, splines, polynomial estimators, etc, One of the                                   contain the pattern, that is,
best approaches to data modeling.

A knowledge discovery model is developed to effectively use
and update the discovered patterns [6] and apply it to the field                               Patterns can be structured into taxonomy by using the subset
of text mining. Text mining is the discovery of interesting                                    relation. Smaller patterns in the taxonomy are usually more
knowledge in text documents. It is a challenging issue to find                                 general because they could be used frequently in both positive
accurate knowledge (or features) in text documents to help                                     and negative documents and larger patterns are usually more
users to find what they want. The Rocchio [8] relevance                                        specific since they may be used only in positive documents.
feedback algorithm is one of the most popular and widely                                       The semantic information will be used in the pattern taxonomy
applied learning methods from information retrieval. Here, a                                   to improve the performance of using closed patterns in text
probabilistic analysis of this algorithm is presented in a text                                mining.
categorization framework. The analysis gives theoretical
insight into the heuristics used in the Rocchio algorithm,                                     A sequential pattern X is called frequent pattern if its relative
particularly the word weighting scheme and the similarity                                      support (or absolute support) is a minimum support. Some
metric. It also suggests improvements which lead to a                                          property of closed patterns can be used to define closed
probabilistic variant of the Rocchio classifier. The Rocchio                                   sequential patterns. The algorithm for finding the support
classifier, its probabilistic variant, and a naive Bayes classifier                            count is given as,
are compared on six text categorization tasks. The results
show that the probabilistic algorithms are preferable to the
heuristic Rocchio classifier not only because they are more
well-founded, but also because they achieve better
performance.




                                                                                                                                                              90
@ 2012, IJSAIT      All Rights Reserved
K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

                                                                                               3.5 . IP Evaluation and Shuffling

                                                                                                In inner pattern evaluation, the similarity is measured between
                                                                                               the test document and the concept is estimated using inner
                                                                                               product. The relevance of a document‘d’ to a topic can be
                                                                                               calculated by the function,
                                                                                                                   R (d) = d. V
                                                                                               To assign weights for all incoming documents‘d’ based on
                                                                                               their corresponding weight‘W’ functions the following
                                                                                               formulae is used,



                                                                                                                                                        A
                                                                                                                                                   comput
                                                                                                                                                   er    is
                                                                                               capable of generating a "perfect shuffle", a random
                                                                                               permutation of the set of documents. For a given noise
                                                                                               negative document ‘nd’, its time complexity is O (nm2). The
                                                                                               following algorithms are proposed for IP evaluation and for
                                                                                               shuffling,




3.3 Pattern Taxonomy Model

In PTM method, all the documents’d’ are spilted into
paragraphs ’p’ which yields PS (d). Patterns can be structured
into taxonomy by using the relation (or subset). From the set
of paragraphs in documents the frequent patterns and the
covering sets are discovered for each. Smaller patterns in the
taxonomy, patterns are usually more general because they
could be used frequently in both positive and negative
documents. Larger patterns, in the taxonomy are usually more
specific since they may be used only in positive documents.
The semantic information will be used in the pattern taxonomy
to improve the performance of using closed patterns in text
mining.

3. 4 D-Pattern Discovery

D-pattern mining algorithm is used to discover the D-patterns
from the set of documents. The efficiency of the pattern
taxonomy mining is improved by proposing an SP mining
algorithm to find all the closed sequential patterns, which is
used as the well-known appropriate property in order to reduce
the searching space. The algorithm describes the training
process of finding the set of d-patterns. For every positive
document, the SP Mining algorithm is first called giving rise
to a set of closed sequential patterns. The main focus is the
deploying process, which consists of the d-pattern discovery
and term support evaluation. All discovered patterns in a
positive document are composed into a d-pattern giving rise to
a set of d-patterns .Thereafter, term supports are calculated
based on the normal forms for all terms in d-patterns.




                                                                                                                                                             91
@ 2012, IJSAIT      All Rights Reserved
K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

3.6 Temporal Sequential Pattern Mining                                                         decomposition is introduced. This is used for analyzing
                                                                                               relationship between the decomposition of time period
In pattern discovery model, a dynamic programming algorithm                                    associated with the document set and the significant
is used for finding optimal information preserving                                             information computed for temporal analysis. It quickly finds
decomposition and optimal lossy decomposition. A closed                                        patterns for various ranges of parameters. It focuses on using
relationship is discovered between the decomposition of time                                   information extraction to extract a structured database from a
period associated with the document set and the significant                                    corpus of natural language text and then discover patterns in
information computed for temporal analysis, the problem of                                     the resulting database using traditional KDD tools. It also
identifying suitable time decomposition for a given document                                   concerns record linkage, a form of data-cleaning that identifies
set which does not seem to have received adequate attention.                                   equivalent but textually distinct items in the extracted data
So the time point is defined in interval and decomposition.                                    prior to mining. It is also related with natural language
Time point is given by base granularity such as seconds,                                       learning. Further implementation will be focused on text
minutes, day etc. The time interval between t1 and t2 is                                       mining for bioinformatics and it also includes applying the
defined as t1 ≤ t ≤ t2.                                                                        discovered patterns for various time series analysis domains
                                                                                               such as prediction, serves as the pattern templates for numeric-
Decomposition of time interval T is given as sequence of time                                  to-symbolic (N/S) conversion and summarization of the time
intervals                                                                                      series.
          T1 , T2, T3, T4… Tn
And ‘T’ is computed by                                                                         REFERENCES
          T=T1*T2*T3*T4*……….*Tn.
The information is mapped with the keyword ‘wi’ and                                            1. M.F. Caropreso, S. Matwin, and F. Sebastiani. Statistical
document dataset ‘D’ as,                                                                       Phrases in Automated Text Categorization, Technical
          fm(wi,D) = v where v ∈ R+.                                                           Report IEI-B4-07- 2000, Instituto di Elaborazione
                                                                                               dell’Informazione, 2000.
4. PERFORMANCE EVALUATION
                                                                                               2. C. Cortes and V. Vapnik. Support-Vector Networks,
In this study document collection is used to evaluate the                                      Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
proposed approach. Various common measures are applied for
performance evaluation. This evaluation compares and defines                                   3. S.T. Dumais, Improving the Retrieval of Information
the following parameters such as precision, recall and F-                                      from External Sources, Behavior Research Methods,
measure which combines precision and recall with the existing                                  Instruments, and Computers, Vol. 23, No. 2, pp. 229-236,
system. Thus the experimental results show that the proposed                                   1991.
method better than the existing system (Figure 1). The
proposed system is more reliable and scalable for complex                                      4. J. Han and K.C.-C. Chang. Data Mining for Web
applications. The following fig1 compares the existing and                                     Intelligence, Computer, Vol. 35, No. 11, pp. 64-70, Nov.
proposed work and it shows better results in proposed work                                     2002.
than the existing work by evaluating the parameters precision,
recall and f-measures.                                                                         5. J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns
                                                                                               without Candidate Generation, Proc. ACM SIGMOD Int’l
                                                                                               Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000.

                                                                                               6. Y. Huang and S. Lin. Mining Sequential Patterns Using
                                                                                               Graph Search Techniques, Proc. 27th Ann. Int’l Computer
                                                                                               Software and Applications Conf., pp. 4-9, 2003.

                                                                                               7. N. Jindal and B. Liu. Identifying Comparative Sentences
                                                                                               in Text Documents, Proc. 29th Ann. Int’l ACM SIGIR Conf.
                                                                                               Research and Development in Information Retrieval (SIGIR
                                                                                               ’06), pp. 244-251, 2006.

                                                                                               8. T. Joachims. A Probabilistic Analysis of the Rocchio
                                                                                               Algorithm with tfidf for Text Categorization, Proc. 14th
                                                                                               Int’l Conf. Machine Learning (ICML ’97), pp. 143-151, 1997.

         Figure 1: Comparison of existing and proposed work                                    9. T. Joachims. Text Categorization with Support Vector
                                                                                               Machines: Learning with Many Relevant Features, Proc.
5. CONCLUSION                                                                                  European Conf. Machine Learning (ICML ’98),, pp. 137-142,
                                                                                               1998.
A PTM model with new pattern discovery model for text
mining mainly focuses on the implement of temporal text
pattern. Here a dynamic programming algorithm for optimal
information preserving decomposition and optimal lossless
                                                                                                                                                             92
@ 2012, IJSAIT      All Rights Reserved

								
To top