Document Image Segmentation Based On Gray Level Co- Occurrence Matrices and Feed Forward Neural Network
W
Description
Vol. 8 No. 5 August 2010 International Journal of Computer Science and Information Security Publication August 2010, Volume 8 No. 5 (Download Full Journal) (Archive)
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
Document Image Segmentation Based On Gray
Level Co- Occurrence Matrices and Feed Forward
Neural Network
S.Audithan Dr. RM. Chandrasekaran
Research Scholar, Dept of CSE, Professor, Dept of CSE
Annamalai University, Annamalai University,
Annamalai NagarTamil Nadu, India Annamalai Nagar Tamil Nadu, India
sarabar36@rediffmail.com aurmc@sify.com
Abstract—This paper presents a new method for extracting text the distribution patterns of wavelet coefficients in high
region from the document images employing the combination of frequency bands.
gray level co-occurrence matrices (GLCM) and artificial neural Probabilistic latent semantic analysis (pLSA) model
networks (ANN). We used GLCM features quantitatively to is presented in [4]. The pLSA model is originally developed
evaluate textural parameters and representations and to
determine which parameter values and representations are best
for topic discovery in text analysis using “bag-of-words”
for extracting text region. The detection of text region is achieved document representation. The model is useful for image
by extracting the statistical features from the GLCM of the analysis by “bag-of-visual words” image representation. The
document image and these features are used as an input of neural performance of the method depends on the visual vocabulary
network for classification. Experimental results show that our generated by feature extraction from the document image.
method gives better text extraction than other methods. Kernel-based methods have demonstrated excellent
performances in a variety of pattern recognition problems. The
Keywords-component;Document segmentation, GLCM, ANN, kernel-based methods and Gabor wavelet to the segmentation
Haralick features
of document image is presented in [5]. The feature images are
derived from Gabor filtered images. Taking the computational
I. INTRODUCTION complexity into account, the sampled feature image is
The extraction of textual information from document subjected to Spectral Clustering Algorithm (SCA). The
images provides many useful applications in document clustering results serve as training samples to train a Support
analysis and understanding, such as optical character Vector Machine (SVM).
recognition, document retrieval, and compression. To-date, The steerable pyramid transform is presented in [6].
many effective techniques have been developed for extracting The features extracted from pyramid sub bands serve to locate
characters from monochromatic document images. The and classify regions into text and non text in some noise
document image segmentation is an important component in infected, deformed, multilingual, multi script document
the document image understanding. images. These documents contain tabular structures, logos,
An efficient and computationally fast method for stamps, handwritten text blocks, photos etc. A novel scheme
segmenting text and graphics part of document images based for the extraction of textual areas of an image using globally
on textural cues is presented in [1]. The segmentation method matched wavelet filter is presented in [7]. A clustering based
uses the notion of multi scale wavelet analysis and statistical technique has been devised for estimating globally matched
pattern recognition. M band wavelets are used which wavelet filters using a collection of ground truth images. This
decompose an image into M x M band pass channels. work extended to text extraction for the segmentation of
Information from the table of contents (TOC) pages can be document images into text, background, and picture
extracted to use in components.
document database for effective retrieval of the A classical approach in the segmentation of
required pages. Fully automatic identification and Canonical Syllables of Telugu document images is proposed
segmentation of table of contents (TOC) page from scanned in [8]. The model consists of zone separation and component
document is discussed in [2]. extraction phases as independent parts. The relation between
Character segmentation is the first step of OCR zones and components is established in the segmentation
system that seeks to decompose a document image into a process of canonical syllable. The segmentation efficiency of
sequence of sub images of individual character symbols. the proposed model is evaluated with respect to the canonical
Segmentation of monochromatic document images into four groups.
classes are presented in [3]. They are background, photograph, It [9] presents a new method for extracting characters
text, and graph. Features used for classification are based on from various real-life complex document images. It applies a
263 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
multi-plane segmentation technique to separate homogeneous B. ANN
objects including text blocks, non-text graphical objects, and An artificial neuron is a computational model inspired in
background textures into individual object planes. It consists the natural neurons. Natural neurons receive signals through
of two stages - automatic localized multilevel thresholding, synapses located on the dendrites or membrane of the neuron.
and multi-plane region matching and assembling. Two novel When the signals received are strong enough (surpass a certain
approaches for document image segmentation are presented in threshold), the neuron is activated and emits a signal though
[10]. In text line segmentation a Viterbi algorithm is proposed the axon. This signal might be sent to another synapse, and
while an SVM-based metric is adopted to locate words in each might activate other neurons. The complexity of real neurons
text line. is highly abstracted when modeling artificial neurons. These
Gray-level co-occurrence matrices (GLCM) to basically consist of inputs (like synapses), which are
quantitatively evaluate textural parameters and representations multiplied by weights (strength of the respective signals), and
and to determine which parameter values and representations then computed by a mathematical function which determines
are best for mapping sea ice texture. In addition, it [11] the activation of the neuron. Another function (which may be
presents the three GLCM implementations and evaluated them the identity) computes the output of the artificial neuron
by a supervised Bayesian classifier on sea ice textural (sometimes in dependence of a certain threshold). ANNs
contexts. Texture is one of the important characteristics used combine artificial neurons in order to process information.
in identifying objects or region of interest in an image, A single layer network has severe restrictions the class of
whether the image to be a photomicrograph, an aerial tasks that can be accomplished is very limited. The limitation
photograph, or a satellite image. Some easily computable is overcome by the two layer feed forward network. The
textural features are presented in [12]. central idea behind this solution is that the errors for the units
of the hidden layer are determined by back propagating the
II. METHODOLOGY
errors of the units of the output layer. For this reason the
A. GLCM method is often called the back propagation learning rule.
Back propagation can also be considered as a generalization of
Gray level co occurrence matrix (GLCM) is the basis for the delta rule for nonlinear activation functions and multi layer
the Haralick texture features [12]. This matrix is square with networks.
dimension Ng, where Ng is the number of gray levels in the A feed forward network has a layered structure. Each layer
image. Element [i,j] of the matrix is generated by counting the consists of units which receive their input from units from a
number of times a pixel with value i is adjacent to a pixel with layer directly below and send their output to units in a layer
value j and then dividing the entire matrix by the total number directly above the unit. There are no connections within a layer.
of such comparisons made. Each entry is therefore considered The Ni inputs are fed into the first layer of Nh,i hidden units.
to be the probability that a pixel with value i will be found The input units are merely “fan out” units no processing takes
adjacent to a pixel of value j. place in these units. The activation of a hidden unit is a
function Fi of the weighted inputs plus a bias. The output of the
hidden units is distributed over the next layer of Nh,2 hidden
units until the last layer of hidden units of which the outputs are
fed into a layer of No output units as shown in Figure 2. In our
proposed method a one input layer, two hidden layers and one
output layer feed forward neural network is used.
Since adjacency can be defined to occur in each of four
directions in a 2D, square pixel image (horizontal, vertical, left
and right diagonals as shown in Figure 1, four such matrices
can be calculated.
Ni No
Figure: 1 Four directions of adjacency as defined for calculation of the Nh,m-1 Nh,m-2 Nh,1
Haralick texture features. Figure 2. A multi layer networks with m layers of input
The Haralick statistics are calculated for co-occurrence
matrices generated using each of these directions of adjacency. III. PROPOSED SYSTEM
In our proposed system, based on the gray level occurrence The block diagram of the proposed text extraction
matrix, 10 features are selected for text extraction. from document images is shown in Figure 3, where Figure
264 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
3(a) depicts the Feature Extraction Phase while classification
Phase is shown in the Figure 3(b). (2)
Document Image Document Image
N is the number of gray levels in the image.
Preprocessing 1) Contrast
Preprocessing
To emphasize a large amount of contrast, create weights
so that the calculation results in a larger figure when there is
GLCM Feature GLCM Feature
great contrast.
Extratcion Extratcion
(3)
Feature Vector Compare the
Features with
When i and j are equal, the cell is on the diagonal and (i-j)
Feature Vector
=0. These values represent pixels entirely similar to their
neighbor, so they are given a weight of 0. If i and j differ by
Extracted Text small contrast, and the weight is 1. If i and j
1, there is a (i)
Region differ by 2, contrast is increasing and the weight is 4. The
weights continue to increase exponentially as (i-j) increases.
(a) (b) 2) Cluster prominence
Cluster Prominence, represents the peakedness or
Figure 3: Our Proposed Text extraction method flatness of the graph of the co-occurrence matrix with respect
(a) Feature Extraction Phase (b) Classification Phase
to values near the mean value.
A. Preprocessing
Image pre-processing is the term for operations on images
(4)
at the lowest level of abstraction. The aim of pre-processing is
an improvement of the image data that suppresses undesired
distortions or enhances some image features relevant for
further processing and analysis task. First, the given document 3) Cluster Shade
image is converted into gray scale image. Then Adaptive Cluster Shade represents the lack of symmetry in an
Histogram Equalization (AHE) is applied to enhance the image and is defined by (5).
contrast of the image. AHE computes the histogram of a local
window centered at a given pixel to determine the mapping for (5)
that pixel, which provides a local contrast enhancement.
B. Feature Extraction
Feature extraction is an essential pre-processing step to 4) Dissimilarity
pattern recognition and machine learning problems. It is often In the Contrast measure, weights increase exponentially
decomposed into feature construction and feature selection. In (0, 1, 4, 9, etc.) as one moves away from the diagonal.
our approach, GLCM features are used as a feature vector to However in the dissimilarity measure weights increase linearly
extract the text region from the document images. In the (0, 1, 2,3 etc.).
following section gives the overview of feature extraction of
the text region and the graphics /image part. (6)
The input to the feature extraction module is the document
images having text and graphics/image part. A 20X20 non
overlapping window is used to extract the features. The 5) Energy
GLCM features are extracted from the text region and the Angular second moment (ASM) and Energy uses each
graphics parts and stored separately for training phase in the Pij as a weight for itself. High values of ASM or Energy occur
classifier stage. when the window is very orderly.
The following 10 GLCM features are selected for the
feature extraction phase. Let be the the entry in a (7)
normalized GLCM. The mean and standard deviations for thr
rows and columns of the matrix are
The square root of the ASM is sometimes used as a
texture measure, and is called Energy.
(8)
265 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
6) Entropy
Entropy is a notoriously difficult term to understand;
the concept comes from thermodynamics. It refers to the
quantity of energy that is permanently lost to heat ("chaos")
every time a reaction or a physical transformation occurs.
Entropy cannot be recovered to do useful work. Because of
this, the term is used in non technical speech to mean
irremediable chaos or disorder. Also, as with ASM, the
equation used to calculate physical entropy is very similar to
the one used for the texture measure.
(9)
Figure 4: (a) Input image
1) Maximum Propability
(10)
2) Sum Entropy
(11)
3) Difference variance
(12)
Figure 4: (b) Segmented Result
4) Difference entropy
(13)
In the classification stage, the extracted GLCM features are
given to the two layers feed forward neural network to classify
text and non text region.
IV. EXPERIMENTAL RESULTS
Figure 5: (a) Input Image
The performance of the proposed system is evaluated in this
section. A set of 50 document images are employed for
experiments on performance evaluation of text extraction.
These test images are scanned from the newspapers and
magazines. They are transformed into gray scale images and
preprocessed by adaptive histogram equalization for feature
extraction. Document image and the text extracted from the
given input image are shown in Figure 4(a), 5(a) and 4(b),5(b)
respectively.
Figure 5: (b) Segmented Result
266 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
In table 1, we present the feature vector for text samples of
20x20 windows and Figure 6. Shows the Pictorial
representations of feature set for sample text regions. In table
2, we present the feature vector for non text samples of 20x20
windows and Figure 7 Shows the Pictorial representations of
feature set for sample non text regions.
TABLE I. FEATURE SET FOR TEXT REGION SAMPLES
Sample Sample Sample Sample
Feature Text 1 Text 2 Text 3 Text 4
set
F1 4.6819 5.9833 4.9583 5.6819 Figure 7: Pictorial representations of feature set for sample
F2 227.69 254.88 222.69 191.23
F3 -25.84 -24.57 -23.46 -18.05 Non text regions
F4 1.2986 1.5277 1.3333 1.5652
F5 0.2308 0.1573 0.2210 0.1274 V. CONCLUSION
F6 2.5679 2.8408 2.6287 2.9532
F7 0.4694 0.3777 0.4597 0.3319 In this paper, we have presented a novel technique for
F8 1.8594 2.0552 1.8912 2.0778 extracting the text part based on feed forward neural network
F9 1.8594 5.9833 4.9583 5.6819 by using GLCM. The GLCM features are serving as training
F10 1.4964 1.5856 1.4843 1.6306 samples to train the neural network. The performance depends
on the representation generated by feature extraction from the
images. The experimental results indicate the proposed
algorithm can effectively process a variety of document
images with fonts, structures and components. When the
method is used to select engines for document image
recognition, it is preferable that the computation time is much
faster with minimal error rate.
REFERENCES
[1] Mausumi Acharyya and Malay K.Kundu, “Document Image
Segmentation UsingWavelet Scale–Space Features”, IEEE transactions
on circuits and systems for video technology,Vol.12, No.12,
December2002J. Clerk Maxwell, A Treatise on Electricity and
Figure 6: Pictorial representations of feature set for
Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
Sample text regions
[2] S.Mandal, S.P.Chowdhury, A.K.Das and Bhabatosh Chanda,
“Automated Detection and Segmentation of Table of Contents Page
TABLE II. FEATURE SET FOR NON TEXT REGION SAMPLES
from Document Images”, IEEE Seventh International Conference on
Sample Sample Sample Sample Document Analysis and Recognition on 2003.
non Text non Text non Text non Text [3] JiaLi and Robert M.Gray, “Context Based Multiscale Classification of
Feature 1 2 3 4 Document Images Using Wavelet Coefficient Distributions”, IEEE
set transactions on image processing, Vol.9, No.9, September2000.R.
Nicole, “Title of paper with only first word capitalized,” J. Name Stand.
Abbrev., in press.
F1 0.8333 3.9083 1.1416 2.7263 [4] Takuma Yamaguchi, and Minoru Maruyama, “Feature extraction for
document image segmentation by pLSA model”, The Eighth IAPR
F2 35.725 64.613 27.882 88.857 Workshop on Document Analysis Systems.2008
F3 -6.3919 1.1596 3.0661 -2.0649 [5] Yu-Long Qiao, Zhe-Ming Lu, Chun-Yan Song and Sheng-He Sun,
“Document Image Segmentation Using Gabor Wavelet and Kernel-
F4 0.4305 1.5611 0.6833 1.2152 based Methods”, Systems and Control in Aerospace and Astronautics,
2006.
F5 0.4239 0.0335 0.1259 0.0560
[6] Mohamed Benjelil, Slim Kanoun, Rémy Mullot and Adel M. Alimi,
F6 1.5641 3.5657 2.3313 3.2723 “Steerable pyramid based complex documents images segmentation”,
IEEE 10th International Conference on Document Analysis and
F7 0.6361 0.0611 0.2236 0.1166
Recognition on 2009.
F8 1.2107 2.2177 1.6982 2.1497 [7] SunilKumar, RajatGupta, NitinKhanna and SantanuChaudhury, “Text
Extraction and Document Image Segmentation Using Matched Wavelets
F9 0.8333 3.9083 1.1416 2.7263
and MRF Model”, IEEE transactions on image processing, vol.16,no.8,
F10 0.8674 1.5356 1.0457 1.3863 August2007.
267 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
[8] L. Pratap Reddy, A.S.C.S. Sastry and A.V. Srinivasa Rao, “Canonical
Syllable Segmentation of Telugu Document Images”, TENCON 2008-
IEEE Region 10 Conference .
[9] Yen-Lin Chen and Bing-Fei Wu, “Text Extraction from Complex
Document Images Using the Multi-plane Segmentation Technique”,
IEEE International Conference on Systems, Man, and Cybernetics
October 8-11, 2006.
[10] Themos Stafylakis, Vassilis Papavassiliou, Vassilis Katsouros and
George Carayannis, “Robust text-line and word segmentation for
handwritten documents images”, IEEE International Conference on
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008
[11] ]. Leen-Kiat Soh and Costas Tsatsoulis, “Texture Analysis of SAR Sea
Ice Imagery Using Gray Level Co-Occurrence Matrices”, IEEE
transactions on geosciences and remote sensing, vol.37, no.2,
march1999.
[12] Robert M.Haralick, K.Shanmugam and Hak Dinstein, “ Textural
Features for image Classification”, IEEE transcation on system, man and
cybernetics vol no 6, November 1973.
[13] Robert M. Haralick,``Statistical and structural approaches to texture,''
Proc. IEEE, vol. 67, no. 5, pp. 786-804, 1979.
AUTHORS PROFILE
S.Audithan is currently a reseaech scholar at the Department of Computer
Science and Engineering, Annamalai
University, Annamalai Nagar,
Tamilnadu, India.. He received his BE
degree from BharathiDhasan University
in 2000 and ME degree from Annamalai
University in 2006. He worked as a
network engineer in RBCOMTEC at
Hydrabad from 2000 to 2003. Ha has
presented and published more than 5
papers in conferences and journals. His
research interests include Image
processing,Network Security, and Artificial Intelligence.
Dr.RM.Chandrasekaran is currently working as a Professor at the Department
of Computer Science and Engineering,
Annamalai University, Annamalai
Nagar, Tamilnadu, India. From 1999 to
2001 he worked as a software consultant
in Etiam, Inc, California, USA. He
received his Ph.D degree in 2006 from
Annamalai University, Chidambaram.
He has conducted workshops and
conferences in the area of Multimedia,
Business Intelligence, Analysis of
Algorithms and Data Mining. Ha has
presented and published more than 32
papers in conferences and journals and is
the co-author of the book Numerical Methods with C++ Program( PHI,2005).
His research interests include Data Mining, Algorithms,Image processing and
Mobile Computing. He is life member of the Computer Society of India,
Indian Society for Technical Education, Institute of Engineers and Indian
Science Congress Assciation.
268 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Related docs
Other docs by ijcsis
Comparative Analysis between Split and HierarchyMap Treemap Algorithms for Visualizing Hierarchical Data
Views: 15 | Downloads: 0
Non-Preemptive Multi-Constrain Scheduling for Multiprocessor with Hopfield Neural Network
Views: 5 | Downloads: 0
Reliable Multipath Routing Protocol (RMRP) For Mobile Ad Hoc Networks Using Adaptive Video Compression
Views: 10 | Downloads: 1
Single CCTA-Based Four Input Single Output Voltage-Mode Universal Biquad Filter
Views: 36 | Downloads: 0
A Cloud Computing Architecture for E-Learning Platform, Supporting Multimedia Content
Views: 42 | Downloads: 0
Get documents about "