Document Sample
50120130404055 Powered By Docstoc
					International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME
                                 TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)                                                      IJCET
Volume 4, Issue 4, July-August (2013), pp. 556-565
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)                   ©IAEME

                   TEXT RECOGNITION

                                     Vilas Naik1, Sagar Savalagi2
                  Department of CSE, Basaveshwar Engineering College, Bagalkot, India
                  Department of CSE, Basaveshwar Engineering College, Bagalkot, India


        With growing popularity of sites like YouTube, video sharing and recording has obtained
popularity in last several years. Unlike text documents, these multimedia contents are difficult to
searched and index. Hence content based video retrieval systems are need of the hour. Content-Based
Video Retrieval (CBVR) is an active research discipline focused on computational strategies to
search for relevant videos based on multimodal content analysis in video such as visual, audio, text
to represent and index video. In recent research on Content Based Video Retrieval has presented
many such solutions based on these features. The textual content in the video in the form of
embedded and scene text. They are quite helpful for indexing the videos. Proposed work is a content
based video retrieval system based on textual ques. Text based video retrieval is an approach that
enables search based on the textual information present in the video. Regions of textual information
are identified within the frames of the video. Video is then annotated with the textual content present
in the images. Then traditionally, OCRs are used to extract the text within the video. It also enables
applications such as keyword based search in multimedia databases. With help of this video indexing
and retrieval is done. A result shows that the system is quite efficient with an accuracy of around
90%. A textual query returns higher accuracy than visual queries which proves the concept.


         With the development of various multimedia compression standards and significant increases
in desktop computer performance and storage, the widespread exchange of multimedia information
is becoming a reality. Video is arguably the most popular means of communication and entertain-
ment. With this popularity comes an increase in the volume of video and an increase need for the
ability to automatically sift through the search for relevant material stored in large video databases.
Even with increase in hardware capabilities, which make video distribution possible, factors such as
algorithms and speed and storage costs are concerns that must still be addressed. Considering this, a
first step should be therefore an attempt to increase speed when using existing compression stan-

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

dards. Performing analysis in the compressed domain reduces the amount of efforts involved in de-
compression and providing a means of abstracting the data keeps the storage costs of the resulting
feature set low. Both of these problems are active areas of research. The aim of this proposed work is
to develop a new detection algorithm which has the ability of boosting the speed of search and in due
reduces the cost of the storage. Every day, both military and civilian equipment generates giga-bytes
of images. A huge amount of information is out there. However, it is impossible access or makes use
of the information unless it is organized so as to allow efficient browsing, searching, and retrieval.
Image retrieval has been a very active research area since the 1970s, with the thrust from two major
research communities, database management and computer vision. These two research communities
study image retrieval from different angles, one being text-based and the other visual-based. Many
advances, such as data modelling, multidimensional indexing, and query evaluation, have been made
along this research direction. There exist two major difficulties, especially when the size of image
collection is large (tens or hundreds of thousands) and vast amount of labour requirement in manual
image annotation. Other difficulty, which is more essential, results from the rich content in the im-
ages and the subjectivity of human perception. That is, for the same image content different people
may perceive it differently. The perception subjectivity and annotation impreciseness may cause un-
recoverable mismatches in later retrieval processes.
        The proposed mechanism is unique scheme in the direction of alleviating these hurdles with a
new detection algorithm with boosting that offer a retrieving system which is based on text. The
work is folded in following steps: Initially frames are collected from video clip. From these frames
text part is segmented. Further, character segmentation identifies the characters. These characters are
recognized by the character recognition process carried by Optical Character Recognition (OCR). In
order to increase the accuracy of identification Color features are additionally extracted from video
clip. These color features are combined with text features and are stored in the database. When user
feeds text query it will be matched against stored characters and displays matching videos.


         The video retrieval is important in multimedia search engine related applications. Recogniz-
ing the text is a crucial task in such applications. In last decade’s most of the researchers proposed
different methods for video retrieval some of the related work are summarized in the following.
         An approach that enables search based on the textual information present in the video is in-
troduced in [1]. In this method a Regions of textual information are identified within the frames of
the video. Video is then annotated with the textual content present in the images. An approach that
enables matching at the image-level and thereby avoiding an OCR is also addressed. Videos contain-
ing the query string are retrieved from a video database and sorted based on the relevance. Results
are shown from video collections in English, Hindi and Telugu. In [2] a method to automatically
localize captions in JPEG compressed images and the I-frames of MPEG compressed videos is pro-
posed. In this method a Caption text regions are segmented from background images using their dis-
tinguishing texture characteristics. Unlike previously published methods which fully decompress the
video sequence before extracting the text regions, this method locates candidate caption text regions
directly in the DCT compressed domain using the intensity variation information encoded in the
DCT domain. Therefore, only a very small amount of decoding is required. A method in [3] is a
news video retrieval solution that target specific news videos based on their contents described by
overlay text is addressed. This approach is based on use of overlay text that conveys direct meaning
of video as a source of complementary information. The whole process is divided in to two steps.
Firstly, they build the “metadata labels” by detecting and extracting the overlay text. Secondly, these
labels are then used to index the news videos. The experiments are carried on the news videos from
NDTV News and large data set of video images containing artificial text developed at Image

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

Processing Centre (IPC) a research facility at National University of Sciences and Technology
(NUST), Pakistan. FFMPEG Library is used to extract the frames form news videos. Overlay scene
is also inserted on the video scene like the overlay text is, the transition region is also observed at. In
[4] the authors proposed three main factors, 1. The integration of the image and audio analysis re-
sults in identifying news segments. 2. The video OCR technology to detect text from frames, which
provides a good source of textual information for story classification when transcripts and close cap-
tions are not available. 3. Natural language processing (NLP) technologies which are used to per-
form automated categorization of news stories based on the texts obtained from close caption or vid-
eo OCR process. Based on these video structure and content analysis technologies, two advanced
video browsers are developed for home users: intelligent highlight player and HTML-based video
browser. Author has proposed a annotation-based indexing method which allows user to retrieve
video using textual annotations in [5]. This takes a text based query and compares it with tags used
for the indexing the event based video is retrieved from cricket video database. Experiment shows
that annotation based event retrieval based methods can potentially improve retrieval accuracy using
different searching techniques like binary search or indexing when database is very large and hereby
the video retrieval can be efficiently carried out with this type of retrieval system. A technique has
been proposed to address problems regarding extracting text from a video and to design algorithms
for each phase of extracting text from a video using java libraries and classes. In this first the input
video is framed into stream of images using the Java Media Framework (JMF) with the input being a
real time or a video from the database. Then pre processing algorithms are applied to convert the
image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity
removal, and dot removal then continue with the algorithms for localization, segmentation and rec-
ognition for which uses the neural network pattern matching technique. The performance of an ap-
proach is demonstrated by presenting experimental results for a set of static images. Improving Mul-
timedia Retrieval with a Video OCR a set of experiments with a video OCR system (VOCR) tailored
for video information retrieval and establishes its importance in multimedia search in general and for
some specific queries in particular. By the method in [7] analysis of video frames producing candi-
date text regions is detailed. The text regions are then binaries and sent to a commercial OCR result-
ing in ASCII text that is finally used to create search indexes. The system is evaluated using the
TRECVID data. The effectiveness of various textual sources is evaluated on multimedia retrieval by
combining the VOCR outputs with automatic speech recognition (ASR) transcripts. For general
search queries, the VOCR system coupled with ASR sources outperforms the other system by a very
large extent. For search queries that involve named entities, especially people names, the VOCR sys-
tem even outperforms speech transcripts, demonstrating that source selection for particular query
types is extremely essential.
         Another important consideration is the quality and complexity of pictures containing text for
evaluation. Some methods consider large fonts in images, advertisements and video clips . The me-
thods also have some limitations as method in [8] does not detect low contrast text and small fonts.
The techniques in [9] use text with deferent complex motions. The method in [10] as well as in [11]
detect only caption text in news video clips.
         The work proposed extracts text from video frames by separating text region from back-
ground and employs conventional OCR for text recognition.


In this section, overview and detail description of all the blocks of the proposed system is given.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976      0976-
                                                         July August
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

3.1 Overview of the Approach
        The proposed mechanism is unique scheme that offers a video retrieval system which is
based on embedded text the method uses the information conveyed to embedded text to recognize
the video to be retrieved from collection based on text query .the mechanism matches que the text
presented in video frame based on feature explained . First extract frames from video. Text part is
segmented. Character segmentation extracts the characters. Character recognition recognizes the
characters. Color features from video scene are extracted. Color features combined with text features
are stored in the database. User can input either text query. If query is in text form, then that is
matched against stored characters and displays matched videos. The over all flow is as in the
Figure 1.

                 Fig. 1 Proposed algorithm for Video retrieval by aText Query

3.2 The Text Query Based Video Retrieval Algorithm.

This proposed algorithm is summarized into following steps.
Step 1. Input a video and Convert it in to frames.
Step 2.Apply Median Filter to each frame and perform sobel Edge Detection for detecting an
       text region edge from the frame then Calculate Sumgraph. i.e. Adding rows and column
       of binay image.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

Step 3.Text region segmentation is performed by applying Threshold as
       Threshold = (sum(sum(B'))/prod(size(sum(B')))*50 + max(max(sum(B')))*30)/100
               Where B`= input image.
Step 4. Apply OCR to recognize the text characters from frames and color feature are stored
        in database as text features. Normalize characters to size 32x32.
Step 5. Given a text query, extract characters. Match with character set associated with videos
         in one direction. Calculate total character match with respect to each video.
Step 6. Retrieve the videos with highest matches.

3.3 Text region localization
        As a first step, extract frames from that are taken from video collection on individual bases.
Convert an video frame into image because an video frame will be compressed format so when it
processes the frame it will be an image, then convert it into greyscale image as show. Now apply an
Median filter to an image the output of median filter is shown in fig 4.2. The median filter considers
each pixel in the image in turn and looks at its nearby neighbours to decide whether or not it is repre-
sentative of its surroundings. Instead of simply replacing the pixel value with the mean of neighbour-
ing pixel values, it replaces it with the median of those values. The median is calculated by first sort-
ing all the pixel values from the surrounding neighborhood into numerical order and then replacing
the pixel being considered with the middle pixel value. Now an sobel operator is used, Its an edge
detection algorithm technique which is applied to an greyscale image that detects an text region
edge from an greyscale image.

3.3 Text detection and Segmentation
        After the text region is localized. Text area is to be segmented for further reorganization the
output of this step is a binary image where black text characters appear on a white background. This
stage included extraction of actual text regions as follows. Here again a median filter to an edge de-
tected image that will give us a smooth image now take the vertical and horizontal histogram. The
horizontal and vertical histogram, this represents the column-wise and row-wise histogram respec-
tively. These histograms represent the sum of differences of gray values between neighbouring pix-
els of an image, column-wise and row-wise. In the above step, first the horizontal correction is cal-
culated. To find a horizontal correction, the algorithm traverses through each column of an image. In
each column, the algorithm starts with the second pixel from the top. The difference between second
and first pixel is calculated. If the difference exceeds certain threshold, it is added to total sum of
differences. Then, algorithm will move downwards to calculate the difference between the third and
second pixels. So on, it moves until the end of a column and calculate the total sum of differences
between neighboring pixels. At the end, an array containing the column-wise sum is created. The
same process is carried out to find the vertical correction. In this case, rows are processed instead of
columns .Then calculate an threshold value with normalize sum as shown below.

         Threshold= (sum(sum(B'))/prod(size(sum(B')))*50+max(max(sum(B')))*30)/100;
         Where B`= input image.

        The rows and column which satisfies the threshold value then those column are considered.
And this will gives us the rows and column where an text is appeared, then extraction of an text
block as shown in figure.2 (d) and storing that image into an result folder. Extract all regions sepa-
rately. Perform Sum graph. Extract Maxima to extract the characters and Normalize characters to
size 32x32.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976      0976-
                                                         July August
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

                            (a)                                        (b)

                            (c)                                         (d)

  Fig. 2 Overview of text detection and segmentation (a) original frame. (b) gray scale image
 with noise reduction and edge detection.(c ) feature vector graph when text detected in frame
                                        (d) detected text

3.4 Text Reorganization with Optical character reorganization (OCR)
        This stage includes actual recognition of extracted characters by combining various fe
                                  ive                                                     co
extracted in previous stages to give actual text. The output of the segmentation stage is considered
and given as a input to this stage. Here an Optical Character recognition (OCR) is used takes an iin-
put image and recognizes character’s. An When a text image is given input to OCR then a i      image
                                                  processing,                                   Post-
undergoes above 4 stage processing they are Pre-processing, Feature Extraction, Classification, Post
            .                                                                                     ex-
processing. In above four stages an important stage is an feature extraction, On basis of feature e
traction an OCR ia possible to recognize. We have used an template matching feature extraction, this
is one of the simplest approaches to patter recognition.

Template matching: This process involves the use of a database of characters or templates. There
   ists                                                                                    input charac-
exists a template for all possible input characters. For recognition to occur, the current i
ter is compared to each template to find either an exact match, or the template with the closest r   re-
presentation of the input character. If I(x, y) is the input character, Tn(x, y) is the templ n, then
the matching function s(I, Tn) will return a value indicating how well template n matches the input
character. The generated outputs from the OCR are ASCII characters, which are used as keywords
                                       Fig                            fied                          sepa-
for future indexing and retrieval. In Figure. 3 (a) shows an identified as a text block. This it is sep
rated out from the rest of the image and binarized. When this detected block is given as input to the
                              SCII                      Fig        (c).
OCR, the corresponding ASCII output is shown in Figure. 3.(c). It is observed that while the text
extraction part system detects the text blocks accurately even in a complex background, the OCR
                         t                       Fig
also recognize 90% text correctly. As seen in Figure. 3 (d), the some word was miss recognized due
to the presence of noise. Extract mean, standard deviation of R,G,B components of frames, color
feature extracted is also store in with text database as text feature.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976      0976-
                                                         July August
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

                              (a)                                         (b)

                   (c)                                                      (d)

         Fig 3 (a) Frame contaning text. (b)Original frame (c) Text extraction by done using
                                OCR (d) text recognization by OCR

3.4 Text querying
        A text query which is entered by an user is processed as shown in figure 4. in which an query
text is extracted and recognized and sent to an matching process which is next stage as shown in fig
                                                                             reco nized       OCR.
3.3. In that database an individual video has its own character set which is recognized by an OC In
the matching process which has an direct access to database as shown in fig 3.3. The video character
set associated with a videos which are stored in database with an color feature extracted with std
           iation,                           extraction.                                         cha-
mean deviation, at first level while frame extraction. The process will start matching an query ch
racter with an of character set that takes place in one direction. The matching process will match an
character ‘C’ followed by ‘R’, like this it matches character form query to character from video text
dataset. Then Calculate total character matches with respect to each video and Display the videos
names with highest matches result as shown in figure 5.

       Query text reorganization

                  Matching process                             Recognized text from video

                    Videos names

                             Fig 4. Block diagram of query processing

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

                                        Fig 5. Result of query


        In this section, it presents quantitative results on the performance of the text extraction sys-
tem. The performance can be measured in terms of true positives (TP) - text regions identified cor-
rectly as text regions, false positives (FP) non-text regions identified as text regions and false nega-
tives (FN) - text regions missed by the system. Using these basic definitions, recall and precision of
retrieval can be defined as follows:

                                      Recall = TP/(TP+FN) and
                                       Precision = TP/(TP+FP)

        While the above definitions are generic, different researchers use different units of text for
calculating recall and precision. Wong and Chen consider the number of characters while some of
the other authors count the number of text boxes or text regions. Jain and Yu calculate recall and
precision by considering either characters or blocks depending on the type of image. It has adopted
the second definition in which it consider the text regions as units for counting. The ground-truth is
obtained by manually marking the correct text regions. Having calculated recall and precision on a
large number of text-rich images. For video processing, testing the system on different types of mpeg
videos such as news clips, sports clips and commercials. The videos contain both caption texts as
well as scene texts of different font, color and intensity. Table 1 shows the performance of our pro-
posed method on four types of video. It is seen that our method has an overall average recall of 82%
and precision of 87%. The method is able to detect text under a large number of different conditions
like text with small fonts, low intensity, deferent color and cluttered background, text from noisy
video, News caption with horizontal scrolling and both caption text and scene.

                      Table 1 Recall and precision of text block extraction
                      No. of text  TP       FP         FN      Recall %     Precession%
       SPORTS                780       624       60        24          80%             92%

       Where TP= True positive, FP= False positive, FN= False negative

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

                              Table 2 Execution time for retrieval
       Videos with               Text             OCR         Retrieval          Total Time
      different back-         extraction                                           in sec
         Complex          57 sec for 100 frames      20 sec        1.55 sec       1:08:55 sec

           Plain            23.78 sec for 60         10 sec        1.20 sec      00:34:98 sec

         The primary advantage of the proposed method is that it is very fast since most of the compu-
tationally intensive algorithms are applied only on the regions of interests. Table 2 shows processing
time for different types of video clips using a 1.83 GHZ Intel’s core 2 duo machine. As show com-
parative time required by the algorithms including retrieval is 1:08:55 sec for complex background
and for simple it is nearly half a sec. An average is taken over a number of different image sizes..
Since by process every frame which occurs at the rate of about 5.6 per second, and OCR takes about
20 sec for complex background and 10 sec for simple’s per retrieval concern it is with an 1:55 sec.
So it is seen that algorithm requires the least time for processing each frame and Retrieval.


        The proposed work uses a textual contents to present a comprehensive video i.e used as con-
tent for retrieval system that is based on extracting text from video, recognition of text from image
and then matching text from database with query text. Beside this matching, system performs a
matching based on color features, such that irrelevant videos are not extracted. The proposed work
uses Median filter and soble operator for text region localization, an histogram for text segmentation
and on OCR is used for recognition embedded text from sports video. Result shows significant effi-
ciency in detection with a 80 % recall and 92% precession for an text region. Time taken for a re-
trieval for complex background will be 1.55 sec and for simple background will be an 1.20 sec Sys-
tem can be further improved by implementing better OCR technique for 100% accuracy in text rec-
ognition from videos. That will significantly improve the quality of the process.


[1]. C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri, Nataraj Jammalamadaka,2006
     “Video Retrieval Based on Textual Queries”
[2]. Yu Zhong, Hongjiang Zhang, and Anil K. Jain, April 2000. “Automatic Caption Localization
     in Compressed Video” IEEE transactions on pattern analysis and machine intelligence
[3]. Nilesh Bhojne, Pravinkumar Kamde and Dr. S. P. Algur , 2012 “News Video Indexing and
     Retrieval using Overlay Text”.
[4]. Wei Qi, Lie Gu, Hao Jiang, Xiang-Rong Chen and Hong-Jiang Zhang, 1998 “Integrating Vis-
     ual, Audio and Text analysis for news video”.
[5]. Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua, 1998 “Video Retrieval using
     High Level Features: Exploiting Query Matching and Confidence-based Weighting”.
[6]. Pranali Kosamkar, Vikram Wathodkar,Rajendra Shinde , April 2012 “Annotation Based
     Event Retrieval in Cricket Video”, International Journal of Advances in Computing and In-
     formation Researches
[7]. Jayshree Ghorpade, Raviraj Palvankar, Ajinkya Patankar and Snehal Rathi, June 2011 “Ex-
     tracting Text From Video” Signal & Image Processing An International Journal (SIPIJ).

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

[8]    D. Xu and Shih-Fu Chang, 2007 “Visual Event Recognition in News Video using Kernel Me-
       thods with Multi-Level Temporal Alignment”, IEEE Conference. on Computer Vision and
       Pattern Recognition.
[9]    H-K. Kim, , Dec 1996 “Efficient Automatic Text Location Method and Content-Based Index-
       ing and Structuring of Video Database”. Journal of Visual Communication and Image Repre-
[10]   H. Li, D. Doerman and O. Kia, Jan. 2000 “Automatic Text Detection and Tracking in Digital
       Video” IEEE Transactions on Image Processing.
[11]   T. Sato, T. Kanade, E. Hughes and M. Smith, 1999 “Video OCR Indexing Digital News Li-
       braries by Recognition of Superimposed Captions”. Multimedia Systems, Vol. 7,pp. 385-394.
[12]   Vilas Naik, Prasanna Patil and Vishwanath Chikaraddi, “Action Event Retrieval from Cricket
       Video using Audio Energy Feature for Event Summarization”, International Journal of
       Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 267 - 274,
       ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[13]   Vilas Naik, Vishwanath Chikaraddi and Prasanna Patil, “Query Clip Genre Recognition using
       Tree Pruning Technique for Video Retrieval”, International Journal of Computer Engineering
       & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 257 - 266, ISSN Print: 0976 – 6367,
       ISSN Online: 0976 – 6375.
[14]   Vilas Naik and Raghavendra Havin, “Entropy Features Trained Support Vector Machine
       Based Logo Detection Method for Replay Detection and Extraction from Sports Videos”,
       International Journal of Graphics and Multimedia (IJGM), Volume 4, Issue 1, 2013,
       pp. 20 - 30, ISSN Print: 0976 – 6448, ISSN Online: 0976 –6456.


Shared By: