Document Image Segmentation Based On Gray Level Co- Occurrence Matrices and Feed Forward Neural Network

Document Sample
Document Image Segmentation Based On Gray Level Co- Occurrence Matrices and Feed Forward Neural Network Powered By Docstoc
					                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                           Vol. 8, No. 5, August 2010

 Document Image Segmentation Based On Gray
Level Co- Occurrence Matrices and Feed Forward
                Neural Network
                            S.Audithan                                                         Dr. RM. Chandrasekaran
              Research Scholar, Dept of CSE,                                                   Professor, Dept of CSE
                   Annamalai University,                                                        Annamalai University,
             Annamalai NagarTamil Nadu, India                                              Annamalai Nagar Tamil Nadu, India

Abstract—This paper presents a new method for extracting text                the distribution patterns of wavelet coefficients in high
region from the document images employing the combination of                 frequency bands.
gray level co-occurrence matrices (GLCM) and artificial neural                        Probabilistic latent semantic analysis (pLSA) model
networks (ANN). We used GLCM features quantitatively to                      is presented in [4]. The pLSA model is originally developed
evaluate textural parameters and representations and to
determine which parameter values and representations are best
                                                                             for topic discovery in text analysis using “bag-of-words”
for extracting text region. The detection of text region is achieved         document representation. The model is useful for image
by extracting the statistical features from the GLCM of the                  analysis by “bag-of-visual words” image representation. The
document image and these features are used as an input of neural             performance of the method depends on the visual vocabulary
network for classification. Experimental results show that our               generated by feature extraction from the document image.
method gives better text extraction than other methods.                      Kernel-based methods have demonstrated excellent
                                                                             performances in a variety of pattern recognition problems. The
   Keywords-component;Document segmentation, GLCM, ANN,                      kernel-based methods and Gabor wavelet to the segmentation
Haralick features
                                                                             of document image is presented in [5]. The feature images are
                                                                             derived from Gabor filtered images. Taking the computational
                       I.     INTRODUCTION                                   complexity into account, the sampled feature image is
         The extraction of textual information from document                 subjected to Spectral Clustering Algorithm (SCA). The
images provides many useful applications in document                         clustering results serve as training samples to train a Support
analysis and understanding, such as optical character                        Vector Machine (SVM).
recognition, document retrieval, and compression. To-date,                            The steerable pyramid transform is presented in [6].
many effective techniques have been developed for extracting                 The features extracted from pyramid sub bands serve to locate
characters from monochromatic document images. The                           and classify regions into text and non text in some noise
document image segmentation is an important component in                     infected, deformed, multilingual, multi script document
the document image understanding.                                            images. These documents contain tabular structures, logos,
         An efficient and computationally fast method for                    stamps, handwritten text blocks, photos etc. A novel scheme
segmenting text and graphics part of document images based                   for the extraction of textual areas of an image using globally
on textural cues is presented in [1]. The segmentation method                matched wavelet filter is presented in [7]. A clustering based
uses the notion of multi scale wavelet analysis and statistical              technique has been devised for estimating globally matched
pattern recognition. M band wavelets are used which                          wavelet filters using a collection of ground truth images. This
decompose an image into M x M band pass channels.                            work extended to text extraction for the segmentation of
Information from the table of contents (TOC) pages can be                    document images into text, background, and picture
extracted to use in                                                          components.
         document database for effective retrieval of the                             A classical approach in the segmentation of
required pages. Fully automatic identification and                           Canonical Syllables of Telugu document images is proposed
segmentation of table of contents (TOC) page from scanned                    in [8]. The model consists of zone separation and component
document is discussed in [2].                                                extraction phases as independent parts. The relation between
         Character segmentation is the first step of OCR                     zones and components is established in the segmentation
system that seeks to decompose a document image into a                       process of canonical syllable. The segmentation efficiency of
sequence of sub images of individual character symbols.                      the proposed model is evaluated with respect to the canonical
Segmentation of monochromatic document images into four                      groups.
classes are presented in [3]. They are background, photograph,                        It [9] presents a new method for extracting characters
text, and graph. Features used for classification are based on               from various real-life complex document images. It applies a

                                                                                                        ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                     Vol. 8, No. 5, August 2010

multi-plane segmentation technique to separate homogeneous                          B. ANN
objects including text blocks, non-text graphical objects, and                          An artificial neuron is a computational model inspired in
background textures into individual object planes. It consists                      the natural neurons. Natural neurons receive signals through
of two stages - automatic localized multilevel thresholding,                        synapses located on the dendrites or membrane of the neuron.
and multi-plane region matching and assembling. Two novel                           When the signals received are strong enough (surpass a certain
approaches for document image segmentation are presented in                         threshold), the neuron is activated and emits a signal though
[10]. In text line segmentation a Viterbi algorithm is proposed                     the axon. This signal might be sent to another synapse, and
while an SVM-based metric is adopted to locate words in each                        might activate other neurons. The complexity of real neurons
text line.                                                                          is highly abstracted when modeling artificial neurons. These
           Gray-level co-occurrence matrices (GLCM) to                              basically consist of inputs (like synapses), which are
quantitatively evaluate textural parameters and representations                     multiplied by weights (strength of the respective signals), and
and to determine which parameter values and representations                         then computed by a mathematical function which determines
are best for mapping sea ice texture. In addition, it [11]                          the activation of the neuron. Another function (which may be
presents the three GLCM implementations and evaluated them                          the identity) computes the output of the artificial neuron
by a supervised Bayesian classifier on sea ice textural                             (sometimes in dependence of a certain threshold). ANNs
contexts. Texture is one of the important characteristics used                      combine artificial neurons in order to process information.
in identifying objects or region of interest in an image,                               A single layer network has severe restrictions the class of
whether the image to be a photomicrograph, an aerial                                tasks that can be accomplished is very limited. The limitation
photograph, or a satellite image. Some easily computable                            is overcome by the two layer feed forward network. The
textural features are presented in [12].                                            central idea behind this solution is that the errors for the units
                                                                                    of the hidden layer are determined by back propagating the
                        II.    METHODOLOGY
                                                                                    errors of the units of the output layer. For this reason the
A. GLCM                                                                             method is often called the back propagation learning rule.
                                                                                    Back propagation can also be considered as a generalization of
    Gray level co occurrence matrix (GLCM) is the basis for                         the delta rule for nonlinear activation functions and multi layer
the Haralick texture features [12]. This matrix is square with                      networks.
dimension Ng, where Ng is the number of gray levels in the                              A feed forward network has a layered structure. Each layer
image. Element [i,j] of the matrix is generated by counting the                     consists of units which receive their input from units from a
number of times a pixel with value i is adjacent to a pixel with                    layer directly below and send their output to units in a layer
value j and then dividing the entire matrix by the total number                     directly above the unit. There are no connections within a layer.
of such comparisons made. Each entry is therefore considered                        The Ni inputs are fed into the first layer of Nh,i hidden units.
to be the probability that a pixel with value i will be found                       The input units are merely “fan out” units no processing takes
adjacent to a pixel of value j.                                                     place in these units. The activation of a hidden unit is a
                                                                                    function Fi of the weighted inputs plus a bias. The output of the
                                                                                    hidden units is distributed over the next layer of Nh,2 hidden
                                                                                    units until the last layer of hidden units of which the outputs are
                                                                                    fed into a layer of No output units as shown in Figure 2. In our
                                                                                    proposed method a one input layer, two hidden layers and one
                                                                                    output layer feed forward neural network is used.

Since adjacency can be defined to occur in each of four
directions in a 2D, square pixel image (horizontal, vertical, left
and right diagonals as shown in Figure 1, four such matrices
can be calculated.

                                                                                          Ni                                                       No
   Figure: 1 Four directions of adjacency as defined for calculation of the                                Nh,m-1           Nh,m-2       Nh,1
                          Haralick texture features.                                              Figure 2. A multi layer networks with m layers of input
   The Haralick statistics are calculated for co-occurrence
matrices generated using each of these directions of adjacency.                                          III.   PROPOSED SYSTEM
In our proposed system, based on the gray level occurrence                                   The block diagram of the proposed text extraction
matrix, 10 features are selected for text extraction.                               from document images is shown in Figure 3, where Figure

                                                                                                                    ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                 Vol. 8, No. 5, August 2010

3(a) depicts the Feature Extraction Phase while classification
Phase is shown in the Figure 3(b).                                                                                                                               (2)

       Document Image                  Document Image
                                                                                N is the number of gray levels in the image.

         Preprocessing                                                            1) Contrast
                                                                                       To emphasize a large amount of contrast, create weights
                                                                                so that the calculation results in a larger figure when there is
        GLCM Feature                    GLCM Feature
                                                                                great contrast.
         Extratcion                      Extratcion
        Feature Vector                   Compare the
                                        Features with
                                                                                 When i and j are equal, the cell is on the diagonal and (i-j)
                                        Feature Vector
                                                                                 =0. These values represent pixels entirely similar to their
                                                                                 neighbor, so they are given a weight of 0. If i and j differ by
                                        Extracted Text                                          small contrast, and the weight is 1. If i and j
                                                                                 1, there is a (i)
                                           Region                                differ by 2, contrast is increasing and the weight is 4. The
                                                                                 weights continue to increase exponentially as (i-j) increases.
           (a)                                 (b)                                 2) Cluster prominence
                                                                                       Cluster Prominence, represents the peakedness or
              Figure 3: Our Proposed Text extraction method                     flatness of the graph of the co-occurrence matrix with respect
         (a) Feature Extraction Phase (b) Classification Phase
                                                                                to values near the mean value.
A. Preprocessing
    Image pre-processing is the term for operations on images
at the lowest level of abstraction. The aim of pre-processing is
an improvement of the image data that suppresses undesired
distortions or enhances some image features relevant for
further processing and analysis task. First, the given document                   3) Cluster Shade
image is converted into gray scale image. Then Adaptive                               Cluster Shade represents the lack of symmetry in an
Histogram Equalization (AHE) is applied to enhance the                          image and is defined by (5).
contrast of the image. AHE computes the histogram of a local
window centered at a given pixel to determine the mapping for                                                                                                    (5)
that pixel, which provides a local contrast enhancement.
B. Feature Extraction                                                            
    Feature extraction is an essential pre-processing step to                     4) Dissimilarity
pattern recognition and machine learning problems. It is often                          In the Contrast measure, weights increase exponentially
decomposed into feature construction and feature selection. In                  (0, 1, 4, 9, etc.) as one moves away from the diagonal.
our approach, GLCM features are used as a feature vector to                     However in the dissimilarity measure weights increase linearly
extract the text region from the document images. In the                        (0, 1, 2,3 etc.).
following section gives the overview of feature extraction of
the text region and the graphics /image part.                                                                                                                    (6)
    The input to the feature extraction module is the document
images having text and graphics/image part. A 20X20 non
overlapping window is used to extract the features. The                            5) Energy
GLCM features are extracted from the text region and the                               Angular second moment (ASM) and Energy uses each
graphics parts and stored separately for training phase in the                  Pij as a weight for itself. High values of ASM or Energy occur
classifier stage.                                                               when the window is very orderly.
    The following 10 GLCM features are selected for the
feature extraction phase. Let         be the     the entry in a                                                                                                 (7)
normalized GLCM. The mean and standard deviations for thr
rows and columns of the matrix are
                                                                                      The square root of the ASM is sometimes used as a
                                                                                texture measure, and is called Energy.

                                                                                                            ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 8, No. 5, August 2010

   6) Entropy
         Entropy is a notoriously difficult term to understand;
the concept comes from thermodynamics. It refers to the
quantity of energy that is permanently lost to heat ("chaos")
every time a reaction or a physical transformation occurs.
Entropy cannot be recovered to do useful work. Because of
this, the term is used in non technical speech to mean
irremediable chaos or disorder. Also, as with ASM, the
equation used to calculate physical entropy is very similar to
the one used for the texture measure.


                                                                                            Figure 4: (a) Input image
  1) Maximum Propability

  2) Sum Entropy


  3) Difference variance

                                                                                          Figure 4: (b) Segmented Result

  4) Difference entropy


In the classification stage, the extracted GLCM features are
given to the two layers feed forward neural network to classify
text and non text region.
                IV.   EXPERIMENTAL RESULTS
                                                                                            Figure 5: (a) Input Image
   The performance of the proposed system is evaluated in this
section. A set of 50 document images are employed for
experiments on performance evaluation of text extraction.
These test images are scanned from the newspapers and
magazines. They are transformed into gray scale images and
preprocessed by adaptive histogram equalization for feature
extraction. Document image and the text extracted from the
given input image are shown in Figure 4(a), 5(a) and 4(b),5(b)

                                                                                          Figure 5: (b) Segmented Result

                                                                                                   ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 8, No. 5, August 2010

   In table 1, we present the feature vector for text samples of
20x20 windows and Figure 6. Shows the Pictorial
representations of feature set for sample text regions. In table
2, we present the feature vector for non text samples of 20x20
windows and Figure 7 Shows the Pictorial representations of
feature set for sample non text regions.


                    Sample         Sample        Sample         Sample
     Feature        Text 1         Text 2        Text 3         Text 4

       F1            4.6819         5.9833        4.9583         5.6819                             Figure 7: Pictorial representations of feature set for sample
       F2            227.69         254.88        222.69         191.23
       F3            -25.84         -24.57        -23.46         -18.05                                                  Non text regions
       F4            1.2986         1.5277        1.3333         1.5652
       F5            0.2308         0.1573        0.2210         0.1274                                          V.     CONCLUSION
       F6            2.5679         2.8408        2.6287         2.9532
       F7            0.4694         0.3777        0.4597         0.3319                 In this paper, we have presented a novel technique for
       F8            1.8594         2.0552        1.8912         2.0778              extracting the text part based on feed forward neural network
       F9            1.8594         5.9833        4.9583         5.6819              by using GLCM. The GLCM features are serving as training
       F10           1.4964         1.5856        1.4843         1.6306              samples to train the neural network. The performance depends
                                                                                     on the representation generated by feature extraction from the
                                                                                     images. The experimental results indicate the proposed
                                                                                     algorithm can effectively process a variety of document
                                                                                     images with fonts, structures and components. When the
                                                                                     method is used to select engines for document image
                                                                                     recognition, it is preferable that the computation time is much
                                                                                     faster with minimal error rate.

                                                                                     [1]   Mausumi Acharyya and Malay K.Kundu, “Document Image
                                                                                           Segmentation UsingWavelet Scale–Space Features”, IEEE transactions
                                                                                           on circuits and systems for video technology,Vol.12, No.12,
                                                                                           December2002J. Clerk Maxwell, A Treatise on Electricity and
             Figure 6: Pictorial representations of feature set for
                                                                                           Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
                             Sample text regions
                                                                                     [2]   S.Mandal, S.P.Chowdhury, A.K.Das and Bhabatosh Chanda,
                                                                                           “Automated Detection and Segmentation of Table of Contents Page
                                                                                           from Document Images”, IEEE Seventh International Conference on
                     Sample        Sample         Sample         Sample                    Document Analysis and Recognition on 2003.
                    non Text      non Text       non Text       non Text             [3]   JiaLi and Robert M.Gray, “Context Based Multiscale Classification of
     Feature           1             2              3              4                       Document Images Using Wavelet Coefficient Distributions”, IEEE
       set                                                                                 transactions on image processing, Vol.9, No.9, September2000.R.
                                                                                           Nicole, “Title of paper with only first word capitalized,” J. Name Stand.
                                                                                           Abbrev., in press.
        F1           0.8333         3.9083         1.1416        2.7263              [4]   Takuma Yamaguchi, and Minoru Maruyama, “Feature extraction for
                                                                                           document image segmentation by pLSA model”, The Eighth IAPR
        F2           35.725         64.613         27.882        88.857                    Workshop on Document Analysis Systems.2008
        F3           -6.3919        1.1596         3.0661        -2.0649             [5]   Yu-Long Qiao, Zhe-Ming Lu, Chun-Yan Song and Sheng-He Sun,
                                                                                           “Document Image Segmentation Using Gabor Wavelet and Kernel-
        F4           0.4305         1.5611         0.6833        1.2152                    based Methods”, Systems and Control in Aerospace and Astronautics,
        F5           0.4239         0.0335         0.1259        0.0560
                                                                                     [6]   Mohamed Benjelil, Slim Kanoun, Rémy Mullot and Adel M. Alimi,
        F6           1.5641         3.5657         2.3313        3.2723                    “Steerable pyramid based complex documents images segmentation”,
                                                                                           IEEE 10th International Conference on Document Analysis and
        F7           0.6361         0.0611         0.2236        0.1166
                                                                                           Recognition on 2009.
        F8           1.2107         2.2177         1.6982        2.1497              [7]   SunilKumar, RajatGupta, NitinKhanna and SantanuChaudhury, “Text
                                                                                           Extraction and Document Image Segmentation Using Matched Wavelets
        F9           0.8333         3.9083         1.1416        2.7263
                                                                                           and MRF Model”, IEEE transactions on image processing, vol.16,no.8,
       F10           0.8674         1.5356         1.0457        1.3863                    August2007.

                                                                                                                       ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                     Vol. 8, No. 5, August 2010

[8]    L. Pratap Reddy, A.S.C.S. Sastry and A.V. Srinivasa Rao, “Canonical
       Syllable Segmentation of Telugu Document Images”, TENCON 2008-
       IEEE Region 10 Conference .
[9]    Yen-Lin Chen and Bing-Fei Wu, “Text Extraction from Complex
       Document Images Using the Multi-plane Segmentation Technique”,
       IEEE International Conference on Systems, Man, and Cybernetics
       October 8-11, 2006.
[10]   Themos Stafylakis, Vassilis Papavassiliou, Vassilis Katsouros and
       George Carayannis, “Robust text-line and word segmentation for
       handwritten documents images”, IEEE International Conference on
       Acoustics, Speech and Signal Processing, 2008. ICASSP 2008
[11]   ]. Leen-Kiat Soh and Costas Tsatsoulis, “Texture Analysis of SAR Sea
       Ice Imagery Using Gray Level Co-Occurrence Matrices”, IEEE
       transactions on geosciences and remote sensing, vol.37, no.2,
[12]   Robert M.Haralick, K.Shanmugam and Hak Dinstein, “ Textural
       Features for image Classification”, IEEE transcation on system, man and
       cybernetics vol no 6, November 1973.
[13]   Robert M. Haralick,``Statistical and structural approaches to texture,''
       Proc. IEEE, vol. 67, no. 5, pp. 786-804, 1979.

                                AUTHORS PROFILE
S.Audithan is currently a reseaech scholar at the Department of Computer
                                    Science and Engineering, Annamalai
                                    University,        Annamalai    Nagar,
                                    Tamilnadu, India.. He received his BE
                                    degree from BharathiDhasan University
                                    in 2000 and ME degree from Annamalai
                                    University in 2006. He worked as a
                                    network engineer in RBCOMTEC at
                                    Hydrabad from 2000 to 2003. Ha has
                                    presented and published more than 5
                                    papers in conferences and journals. His
                                    research interests include Image
processing,Network Security, and Artificial Intelligence.

Dr.RM.Chandrasekaran is currently working as a Professor at the Department
                                     of Computer Science and Engineering,
                                     Annamalai      University,    Annamalai
                                     Nagar, Tamilnadu, India. From 1999 to
                                     2001 he worked as a software consultant
                                     in Etiam, Inc, California, USA. He
                                     received his Ph.D degree in 2006 from
                                     Annamalai University, Chidambaram.
                                     He has conducted workshops and
                                     conferences in the area of Multimedia,
                                     Business Intelligence, Analysis of
                                     Algorithms and Data Mining. Ha has
                                     presented and published more than 32
                                     papers in conferences and journals and is
the co-author of the book Numerical Methods with C++ Program( PHI,2005).
His research interests include Data Mining, Algorithms,Image processing and
Mobile Computing. He is life member of the Computer Society of India,
Indian Society for Technical Education, Institute of Engineers and Indian
Science Congress Assciation.

                                                                                                                ISSN 1947-5500

Description: Vol. 8 No. 5 August 2010 International Journal of Computer Science and Information Security Publication August 2010, Volume 8 No. 5 (Download Full Journal) (Archive)