Docstoc

scan-ocr

Document Sample
scan-ocr Powered By Docstoc
					   Imaging and OCR

     K.T.Anuradha
     National Centre for Science Information
     Indian Institute of Science
     Bangalore – 560 012
     (E-Mail: anu@ncsi.iisc.ernet.in)



15-20 April 2002    Imaging and OCR      PI-3   1
   Goals of This Presentation
       To give an overview of Imaging and
        Optical Character Recognition process




15-20 April 2002      Imaging and OCR     PI-3   2
           What Will You Learn?
   You will get an overview of Imaging and
    OCR process
   What you need to do in the lab:
        Scan some specific documents and using a few
         OCR software installed, convert the scanned
         images to text




        15-20 April 2002   Imaging and OCR        PI-3   3
           Historical Perspective

    M. Sheppard's invention, GISMO - A Robot Reader-
    Writer in 1951
   J. Rainbow developed a prototype machine in 1954
        able to read uppercase typewritten output at the
        “fantastic” speed of one character per minute
   IBM, Recognition Equipment, Inc., Farrington,
    Control Data, and Optical Scanning Corp, marketed
    OCR systems by 1967
   NASA used imaging system to enhance and
    manipulate satellite images
        15-20 April 2002       Imaging and OCR              PI-3   4
            Historical Perspective
   Several standards were developed
       Character Set for Optical Character Recognition (OCR-A).
        ANSI X3.17-81
       Character Set for Optical Character Recognition (OCR-B).
        ANSI X3.49-75
       Paper Used in Optical Character Recognition Systems.
        ANSI X3.62-87. Several standards were developed
       Optical Character Recognition (OCR) Inks. ANSI X3.86-80.
       Optical Character Recognition (OCR) Character Position.
        ANSI X3.93-81

        15-20 April 2002      Imaging and OCR            PI-3   5
            Applications

   Industries and Institutions in which control of
    large amounts of paper work is critical
       Banking, Credit cards, Insurance industries
   The medical community
       To capture, store and transmit radiology images
   Libraries and archives
       For conservation and preservation of vulnerable
        documents and for the provision of access to
        source documents
        15-20 April 2002    Imaging and OCR           PI-3   6
         Glossary
   Glyph – the image of a character rendered in pixels.
   Raster – the scanned image created by a kinescope (a
    CRT, Cathode Ray Tube, such as that used in computer
    displays)
   Text image – the content of a text record, often the
    contents of a page of text.
   Pixel – (Picture ELements) or pels (Picture ELements), an
    image sample area that is almost always square. Arranged
    in a grid, pixels form a raster image. A scanned page of a
    paper or microform document creates a digital image that
    is a raster of pixels.
      15-20 April 2002        Imaging and OCR              PI-3   7
           More about Pixels
   All pixels are identical in size and arrangement.
   All pixels are processed the same way.
   All pixels are scanned, displayed, and printed the
    same way.
   Each pixel has a location and a colour.
       Both given as numbers.
       Location: latitude and longitude
       Color: Amount of Red Green and Blue
          Max on all 3 is white, minimum on all 3 is black


        15-20 April 2002       Imaging and OCR                PI-3   8
       Bit-Mapped Images
   A bit-mapped image is a raster of
    pixels.
      Printed as a raster.

      Can be created by raster scanning.

      Can be created by a RIP (Raster

       Image Processor) in a printer.


    15-20 April 2002   Imaging and OCR      PI-3   9
           How many shades
   Five main types of image shades
        One-bit black and white or bi-tonal: no shades
         between black and white
        4 bit gray scale: 16 shades of gray
        8 bit gray scale: 256 shades of gray
        8 bit colour: each bit can be one of 256 colours
        24 bit colour: 16.8 million colours
        32 & 42 bit colours: not used much; opted by
         photographers
        15-20 April 2002    Imaging and OCR          PI-3   10
        Resolution
   Number of dots per inch (dpi) determines the
    resolution
   Higher the dpi, larger is the size
   1 bit black and white image at 100 dpi
    requires 10 Kb of storage and 24 bit colour
    image at 400 dpi requires 475 Kb of storage


     15-20 April 2002   Imaging and OCR    PI-3   11
       Image trasmission and Access
   On the Net via standard protocol such as TCP/IP
   Transferring a single archival image over 56 Kbps
    line require about 18 minutes, thumb nail within
    seconds. LAN should support 10 Mbps to 100 Mbps
   Colour Monitor of 19 inch size that support 1024 by
    768 line resolution is ideal.
   Desktop laser printers for monochrome with 300 to
    600 dpi to the more expensive gray scale and colour
    laser printers
     15-20 April 2002    Imaging and OCR        PI-3   12
           Types of images
   Thumbnail
       Allows to judge in viewing the image; requires about 10-
        35 Kb of storage space for each image
   Service
       Designed to convey information; typically are
        compressed, requires up to 300 Kb for each image
   Archival
       Uncompressed image free of the artifacts resulting from
        compression; highest quality images requires several Mb
        each
        15-20 April 2002       Imaging and OCR             PI-3   13
           Indexing of Images

   Images are indexed to identify and retrieve
    images
       Eg. Purchace order number, Policy number,
        account number, profile number, ISSN number
   MARC format for bibliographic records has
    some limitations in indexing images
   Two alternatives to MARC are Dublin Core
    and EAD (Encoded Archival Description)
        15-20 April 2002   Imaging and OCR      PI-3   14
         Image formats
   Raster                                     Vector
        bit mapped graphics and is                 mathematically defined with
         composed of coloured dots.
                                                     coded instructions that
        Common formats include .tiff                define the angles and
         (tagged image file format:
                                                     relationships between every
         basis for all image files), .jpg
         (joint photo- graphic experts               line in the image.
         group for gray line images),               Common vector formats
         .gif (for colour images), mpg               include .wmf and .cgm
         (motion picture experts                    images are edited in drawing
         group), .bmp, .pdf
                                                     programs like Adobe
        images are edited in paint and              Illustrator and CorelDraw.
         photo programs like Adobe
         PhotoShop and Metacreations
         Painter
    15-20 April 2002               Imaging and OCR                       PI-3   15
         Image formats: uses
         and advantages
   Raster                                  Vector
        In continuous tone images             Logos with a few solid

         eg photographs; on the web             colours and need to be
         where there are no vector              shown at a variety of sizes;
         formats currently supported            Creating specialized text
        Only format that will show             effects; 3D and CAD
         smooth gradients and subtle            programs
         detail necessary in                   Resolution independent;

         photographic images; Allow             Smooth curves; Small file
         for color correction much              sizes
         easier then vector images



    15-20 April 2002             Imaging and OCR                      PI-3   16
           Image capture interfaces

   IDE
       Widely used, low cost, poorest seek time
   SCSI
       Faster seek time, costs more, 40Mb-160Mb/sec
   USB (Universal Serials Bus)
       Ease of setup, 15Mb/sec
   IEEE 1394
       Initially developed by Apple, 3.2Gb/sec, not all pcs
        support

        15-20 April 2002        Imaging and OCR                PI-3   17
           Image Drivers

An image driver is required for an image capture
  device to communicate with software applications.
  Two standards are available
 ISIS

       Proprietary product developed by Pixel Translation
   TWAIN
       Developed and designed by TWAIN Working Group in
        1999 adopted TWAIN 1.7 as the current standard


        15-20 April 2002       Imaging and OCR               PI-3   18
            Selecting Imaging System
   Imaging systems selection depends on the type of
    application
       Workflow or transaction processing system: Focus on
        processing of documents and automating the process;
        Capturing and storing images without alteration. Eg.
        Purchase orders, invoices, credit card charges and
        insurance policies
       Storage and retrieval systems: Store and retrieve large
        number of documents in a variety of types and formats.
        Capturing and inhancing them to facilitate readability Eg.
        Medical, Library community
        15-20 April 2002       Imaging and OCR              PI-3   19
         Types of Imaging System

   Drum Scanners: High-end scanners
      Use photo multipliers

      Expensive and sensitive devices

   Flatbed Scanners
      Ideal for odd-sized images

   Sheetfed Scanners
      Can scan only loose sheets

      Compact in size and easy to install

   Handheld scanners
      Provide portability and functionality at the low cost

    15-20 April 2002         Imaging and OCR              PI-3   20
    What, Why and When of OCR
   Allows to scan printed, typewritten or hand
    written text (numerals, letters or symbols)
    and/or convert scanned image to a
    computer process able format, either in the
    form of a plain text or a word document or
    an excel spread sheet, which can be edited,
    used or reused in other documents
   It uses raster images
    15-20 April 2002   Imaging and OCR    PI-3   21
    What, Why and When of OCR


   OCR is used when recreating a document in
    electronic form takes more time
   The converted text files take less space than
    the original image file and can be indexed
   Bridges the gap between the paperless and
    the papered

    15-20 April 2002   Imaging and OCR     PI-3   22
            How of OCR

   It has three components:
       Image scanner, OCR hardware/software, Output
        interface




    15-20 April 2002    Imaging and OCR       PI-3   23
        How of OCR




15-20 April 2002   Imaging and OCR   PI-3   24
          How of OCR

   Scanner has 4 components:
       A detector, An illumination source, A scan lens
        and a document transport
   OCR hardware/software performs three
    operational steps:
       Document analysis, Character recognition,
        Contextual processing


    15-20 April 2002      Imaging and OCR           PI-3   25
             How of OCR

   Output Interface
       Allows character recognition results to be
        electronically transferred into the domain that
        uses the results




    15-20 April 2002       Imaging and OCR         PI-3   26
               Types of OCRs
   Two types of OCRs
        Task specific readers
        General purpose readers
   Task specific readers
        Reads only specific documents: bank cheques, mail
         address
        used primarily for high-volume applications which
         require high system throughput: Assigning ZIP Codes to
         letter mail, Reading data entered in forms, e.g., tax
         forms, Automatic accounting procedures used in
         processing utility bills
        15-20 April 2002      Imaging and OCR            PI-3   27
             Types of OCRs

   General purpose page readers
        High end OCR (usually for offices)
               Speed and Accuracy are important
               Format preservation
               Good proof reading solutions
        Low end OCR (usually for house use)
               Speed is not required
               Proof reading is done manually

        15-20 April 2002           Imaging and OCR   PI-3   28
      Factors affecting OCR quality

   Scanner quality
   Scan resolution
   Type of printed documents, whether laser printer
    outputs or photocopied
   Paper quality
   Fonts used in the text
   Linguistic complexities
   Dictionary used

    15-20 April 2002    Imaging and OCR        PI-3   29
            Evaluating OCRs

   Neat interface
   Easy-to-use wizards
   Accurate recognition
   Scan resolution setting (600 dpi is advisable)
   Time taken from scanning to deliver the final
    product
   Enhanced usability of the product
   Ability to modify the scan setting
    15-20 April 2002    Imaging and OCR          PI-3   30
        Summarizing

   We learnt basics of imaging system and images
   Different steps involved in OCR technique and
    scanning
   Conversion of raster image to text using OCR
    techniques
   Types of imaging system and OCR software
   Evaluation of imaging system and OCR software

     15-20 April 2002   Imaging and OCR        PI-3   31
        References

   Web Sites:
       www.archivebuilders.com
       Sunsite.berkeley.edu
       www.cedar.buffalo.edu/Publications/TechReps/OCR/ocr.htm
       navigatela.lacity.org/samples/start/
   Journals
       Chip July 2000
       Pcquest Product review column



     15-20 April 2002           Imaging and OCR               PI-3   32
                           Questions?
                          Comments?
                          Discussions?
                   (Pl. fill the feedback form)
                          Thank You!



15-20 April 2002             Imaging and OCR      PI-3   33

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:5/19/2012
language:English
pages:33
fanzhongqing fanzhongqing http://
About