Docstoc

Optical character recognition

Document Sample
Optical character recognition Powered By Docstoc
					From Wikipedia, the free encyclopedia

Optical character recognition

Optical character recognition
This article contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

History
In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by Handel who obtained a US patent on OCR in USA in 1933 (U.S. Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329). Tauschek’s machine was a mechanical device that used templates. A photodetector was placed so that when the template and the character to be recognized were lined up for an exact match and a light was directed towards them, no light would reach the photodetector. In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency in the

United States, was asked by Frank Rowlett, who had broken the Japanese PURPLE diplomatic code, to work with Dr. Louis Tordella to recommend data automation procedures for the Agency. This included the problem of converting printed messages into machine language for computer processing. Shepard decided it must be possible to build a machine to do this, and, with the help of Harvey Cook, a friend, built "Gismo" in his attic during evenings and weekends. This was reported in the Washington Daily News on 27 April 1951 and in the New York Times on 26 December 1953 after his U.S. Patent 2,663,758 was issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the world’s first several OCR systems used in commercial operation. While both Gismo and the later IMR systems used image analysis, as opposed to character matching, and could accept some font variation, Gismo was limited to reasonably close vertical registration, whereas the following commercial IMR scanners analyzed characters anywhere in the scanned field, a practical necessity on real world documents. The first commercial system was installed at the Readers Digest in 1955, which, many years later, was donated by Readers Digest to the Smithsonian, where it was put on display. The second system was sold to the Standard Oil Company of California for reading credit card imprints for billing purposes, with many more systems sold to other oil companies. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Shepard’s OCR patents. In about 1965 Readers Digest and RCA collaborated to build an OCR Document reader designed to digitize the serial numbers on Reader Digest coupons returned from advertisements. The font used on the documents were printed by an RCA Drum printer using the OCR-A font. The reader was connected directly to an RCA 301 computer (one of the

1

From Wikipedia, the free encyclopedia
first solid state computers). This reader was followed by a specialized document reader installed at TWA where the reader processed Airline Ticket stock (a task made more difficult by the carbonized backing on the ticket stock). The readers processed document at a rate of 1500 documents per minute and checked each document rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process "Turn around Documents" such as those Utility and insurance bills returned with payments. The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office or GPO. In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanized sorting center, and print a routing bar code on the envelope based on the postal code. After that the letters need only be sorted at later centers by less expensive sorters which need only read the bar code. To avoid interference with the human-readable address field which can be located anywhere on the letter, special ink is used that is clearly visible under ultraviolet light. This ink looks orange in normal lighting conditions. Envelopes marked with the machine readable bar code may then be processed. In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first omni-font optical character recognition system—a computer program capable of recognizing text printed in any normal font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to understand written text by having a computer read it to them out loud. However, this device required the invention of two enabling technologies—the CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. Called the Kurzweil Reading Machine, the device covered an entire

Optical character recognition
tabletop, but functioned exactly as intended. On the day of the machine’s unveiling, Walter Cronkite used the machine to give his signature soundoff, "And that’s the way it was, January 13, 1976." While listening to The Today Show, musician Stevie Wonder heard a demonstration of the device and personally purchased the first production version of the Kurzweil Reading Machine. In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-tocomputer text conversion. Kurzweil Computer Products thus became a subsidiary of Xerox known as Scansoft (now Nuance).

Current state of OCR technology
The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem. Typical accuracy rates exceed 99%, although certain applications demanding even higher accuracy require human review for errors. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those with a very large number of characters)--are still the subject of active research. Note that accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, without the use of word context (basically a dictionary of words) to correct "spelling" errors, an error rate of 1% (or 99% accuracy) measured letter-by-letter may result in an error rate of 5% or more (or 95% accuracy), if the measurement is based instead on whether each whole word was recognized with no incorrect letters[1]. Optical Character Recognition (OCR) is sometimes confused with on-line character recognition[2] (see Handwriting recognition). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while online character recognition instead recognizes the dynamic motion during handwriting. For

2

From Wikipedia, the free encyclopedia
example, on-line recognition, such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. Online character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR. On-line systems for recognizing hand-printed text on the fly have become well-known as commercial products in recent years (see Tablet PC history). Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.

Optical character recognition
For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations.[3]

OCR software OCR software language support See also
• • • • • • • • • • • • • • Automatic number plate recognition CAPTCHA Computational linguistics Computer vision Machine learning Music OCR OCR SDK Optical mark recognition Raster to vector Raymond Kurzweil Speech recognition Book scanning Institutional Repository Digital Library

References
[1] Suen, C.Y., et al (1987-05-29), Future Challenges in Handwriting and Computer Applications, 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987, http://users.erols.com/ rwservices/pens/biblio88.html#Suen88, retrieved on 2008-10-03 [2] Tappert, Charles C., et al (1990-08), The State of the Art in On-line Handwriting Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 12 No 8, August 1990, pp 787-ff, http://users.erols.com/ rwservices/pens/ biblio90.html#Tappert90c, retrieved on 2008-10-03 [3] LeNet-5, Convolutional Neural Networks

External links
• ICDAR’07, ICDAR’09, a comprehensive conference on all aspects of document recognition

3

From Wikipedia, the free encyclopedia
Name License Operating systems Notes

Optical character recognition

ExperVision Commercial Windows,Mac OS ExperVision Inc. was founded in 1987, its TypeReader & X,Unix,Linux,OS/2 OCR technology and product won the OpenRTK highest marks in the independent testing performed by UNLV for the consecutive years that ExperVision participated. "ExperVision’s OCR has one big advantage: speed. This corporate-level OCR application processes faster than any product of its type we’ve ever tested: It converted a scanned image of a 700-page book into an editable Word file in a startling 6 minutes!" Said Gary Berline, PC Magazine, 08.12.08 ABBYY FineReader OCR OmniPage Commercial Windows For working with localized interfaces, corresponding language support is required.

Commercial Windows, Mac OS Product of Nuance Communications (Nuance EULA) Commercial Windows, Mac OS Product of I.R.I.S. Group of Belgium. Asian and Middle Eastern editions. SmartZone is the process by which Optical Character Recognition (OCR) applications "read" specifically zoned text from a scanned image. Document Management system Enterprise-class system, multi language, can save text formatting and recognizes complicated tables of any structure Early development

Readiris

SmartZone Commercial Windows (formerly known as Zonal OCR) Computhink’s ViewWise CuneiForm Commercial Windows BSD variant Windows, Linux, BSD, MacOSX. GPL Many (open source)

GOCR

Microsoft Of- Commercial Windows, Mac OS fice Document X Imaging Microsoft Office OneNote 2007 Commercial Windows

NovoDynamics Commercial? ? VERUS Ocrad Brainware GPL Unix-like, OS/2 Commercial Windows

Specializes in languages of the Middle East

Template-free data extraction and processing of data from documents into any backend system; sample document types include invoices, remittance statements, bills of lading and POs Hebrew OCR

HOCR

GPL

Linux

4

From Wikipedia, the free encyclopedia
OCRopus ReadSoft Apache Linux

Optical character recognition
Pluggable framework which can use Tesseract Scan, capture and classify business documents such forms, invoices and POs. Multi-language OCR Plug-in is used to convert faxed pages into editable document formats (doc, pdf, etc...) in many different languages. For working with localized interfaces, corresponding language support is required.

Commercial Windows

Alt-N Techno- Commercial Windows logies’ RelayFax Network Fax Manager Scantron Cognition SimpleOCR Commercial Windows Freeware Windows and commercial versions

SmartScore Tesseract

Commercial Windows, Mac OS For musical scores Apache Windows, Mac OS Under development by Google X, Linux, OS/2 • Unicode OCR - Hex Range: 2440-245F Optical Character Recognition in Unicode

• Linux OCR: A review of free optical character recognition software • 17 Things Explanation of basic handwriting recognition principles and history

5

From Wikipedia, the free encyclopedia
Name Latest version

Optical character recognition
Dictionaries

Release Recognition languages year 2007 English, French, German, Italian, Spanish, Portuguese, Danish, Dutch, Swedish, Norwegian, Hungarian, Polish, Simplified Chinese, Traditional Chinese, Russian, Finnish and Polynesian Abkhaz, Adyghian, Afrikaans, Agul, Albanian, Altai, Armenian (Eastern, Western, Grabar), Avar, Aymara, Azerbaijani (Cyrillic), Azerbaijani (Latin), Bashkir, Basic, Basque, Belarusian, Bemba, Blackfoot, Breton, Bugotu, Bulgarian, Buryat, C/C++, COBOL, Catalan, Cebuano, Chamorro, Chechen, Chinese Simplified, Chinese Traditional, Chukchee, Chuvash, Corsican, Crimean Tatar, Croatian, Crow, Czech, Dakota, Danish, Dargwa, Dungan, Dutch (Netherlands and Belgium), English, Eskimo (Cyrillic), Eskimo (Latin), Esperanto, Estonian, Even, Evenki, Faroese, Fijian, Finnish, Fortran, French, Frisian, Friulian, Gagauz, Galician, Ganda, German (Luxemburg), German (new and old spelling), Greek, Guarani, Hani, Hausa, Hawaiian, Hebrew, Hungarian, Icelandic, Ido, Indonesian, Ingush, Interlingua, Irish, Italian, JAVA, Japanese, Jingpo, Kabardian, Kalmyk, Karachay-balkar, Karakalpak, Kasub, Kawa, Kazakh, Khakass, Khanty, Kikuyu, Kirghiz, Kongo, Koryak, Kpelle, Kumyk, Kurdish, Lak, Latin, Latvian, Lezgi, Lithuanian, Luba, Macedonian, Malagasy, Malay, Malinke, Maltese, Mansy, Maori, Mari, Maya, Miao, Minangkabau, Mohawk, Moldavian, Mongol, Mordvin, Nahuatl, Nenets, Nivkh, Nogay, Norwegian (nynorsk and bokmal), Nyanja, Occidental, Ojibway, Ossetian, Papiamento, Pascal, Polish, Portuguese (Portugal and Brazil), Provencal, Quechua, Rhaeto-romanic, Romanian, Romany, Rundi, Russian, Russian (old spelling), Rwanda, Sami (Lappish), Samoan, Scottish Gaelic, Selkup, Serbian (Cyrillic), Serbian (Latin), Shona, Simple chemical formulas, Slovak, Slovenian, Somali, Sorbian, Sotho, Spanish, Sunda, Swahili, Swazi, Swedish, Tabasaran, Tagalog, Tahitian, Tajik, Tatar, Thai, Tok Pisin, Tongan, Tswana, Tun, Turkish, Turkmen, Tuvinian, Udmurt, Uighur (Cyrillic), Uighur

ExperVision 7.0 TypeReader & OpenRTK

ABBYY FineReader OCR

9.0

2007

Armenian (Eastern, Western, Grabar), Bashkir, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch (Netherlands and Belgium), English, Estonian, Finnish, French, German (new and old spelling), Greek, Hebrew, Hungarian, Italian, Latvian, Lithuanian, Norwegian (nynorsk and bokmal), Polish, Portuguese (Portugal and Brazil), Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tatar, Thai, Turkish, Ukrainian

6

From Wikipedia, the free encyclopedia

Optical character recognition

(Latin), Ukrainian, Uzbek (Cyrillic), Uzbek (Latin), Welsh, Wolof, Xhosa, Yakut, Zapotec, Zulu OmniPage 16 2007 Afrikaans, Albanian, Aymara, Basque, Bemba, Blackfoot, Breton, Bugotu, Bulgarian, Byelorussian, Catalan, Chamorro, Chechen, Corsican, Croatian, Crow, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Fijian, Finnish, French, Frisian, Friulian, Gaelic (Irish), Gaelic (Scottish), Galician, Ganda/Luganda, German, Greek, Guarani, Hani, Hawaiian, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Italian, Inuit, Kabardian, Kasub, Kawa, Kikuyu, Kongo, Kpelle, Kurdish, Latin, Latvian, Lithuanian, Luba, Luxembourgian, Macedonian, Malagasy, Malay, Malinke, Maltese, Maori, Mayan, Miao, Minankabaw, Mohawk, Moldavian, Nahuatl, Norwegian, Nyanja, Occidental, Ojibway, Papiamento, Pidgin English, Polish, Portuguese (Brazilian), Portuguese, Provencal, Quechua, Rhaetic, Romanian, Romany, Ruanda, Rundi, Russian, Sami Lule, Sami Northern, Sami Southern, Sami, Samoan, Sardinian, Serbian (Cyrillic), Serbian (Latin), Shona, Sioux, Slovak, Slovenian, Somali, Sorbian, Sotho, Spanish, Sundanese, Swahili, Swazi, Swedish, Tagalog, Tahitian, Tinpo, Tongan, Tswana, Tun, Turkish, Ukrainian, Visayan, Welsh, Wolof, Xhosa, Zapotec, Zulu American English, British English, Afrikaans, Albanian, Aymara, Balinese, Basque, Bemba, Bikol, Bislama, Brazilian, Breton, Bulgarian, Byelorussian, Catalan, Cebuano, Chamorro, Corsican, Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Faroese, Fijian, Finnish, French, Frisian, Friulian, Galician, Ganda, German, Greek, Greenlandic, Haitian (Creole), Hani, Hiligaynon, Hungarian, Icelandic, Ido, Ilocano, Indonesian, Interlingua, Irish (Gaelic), Italian, Javanese, Kapampangan, Kicongo, Kinyarwanda, Kurdish, Latin, Latvian, Lithuanian, Luxemburgh, Macedonian, Madurese, Malagasy, Malay, Maltese, Manx (Gaelic), Maori, Mayan, Minangkabau, Nahuatl, Norwegian, Numeric, Nyanja, Nynorsk, Occitan, Pidgin English, Polish, Portuguese, Quechua,

Readiris

12 Pro & Corporate

2009

7

From Wikipedia, the free encyclopedia

Optical character recognition

Rhaeto-Roman, Romanian, Rundi, Russian, Samoan, Sardinian, Scottish (Gaelic), Serbian, Serbian (Latin), Shona, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tahitian, Tok Pisin, Tonga, Tswana, Turkish, Ukrainian, Waray, Wolof, Xhosa, Zapotec, Zulu, Bulgarian English, Byelorussian - English, Greek English, Macedonian - English, Russian English, Serbian - English, Ukrainian English + Moldovan, Bosnian (Cyrillic and Latin), Tetum, Swiss-German and Kazak Readiris 12 Pro & 2009 Corporate Middle-East 12 Pro & Corporate Asian v2 2009 Arabic, Farsi and Hebrew

Readiris

Simplified Chinese, Traditional Chinese, Japanese and Korean English, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, and Swedish

SmartZone

2008

Computhink’s ViewWise CuneiForm

6.1 12

2008 2007 English, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Mixed Russian-English, Ukrainian, Danish, Swedish, Finnish, Serbian, Croatian, Polish and others Language availability is tied to the installed proofing tools. For languages not included in your version of MS Office you’d need the corresponding Proofing Tools kit (separate purchase).

GOCR

0.47

2009

Microsoft Of- Office 2007 2007 fice Document Imaging

Microsoft Office OneNote 2007 NovoDynamics Middle East 2005 VERUS Professional Arabic, Persian (Farsi, Dari), Pashto, Urdu, including embedded English and French. It also recognizes the Hebrew language, including embedded English. Simplified and Traditional Chinese, Korean and Russian languages, including embedded English

NovoDynamics Asia 2009 VERUS Professional Ocrad Brainware HOCR 0.10.13 2008

Hebrew

8

From Wikipedia, the free encyclopedia
OCRopus 0.3.1 2008

Optical character recognition

All the languages and scripts that Tesseract supports through the Tesseract plugin, and it supports Latin script and English for its native recognizers

ReadSoft Alt-N Technologies’ RelayFax Network Fax Manager Scantron Cognition SimpleOCR SmartScore Tesseract 2.03 2008 Can recognize 6 languages, is fully UTF8 capable, and is fully trainable 3.5 2008 English and French

Retrieved from "http://en.wikipedia.org/wiki/Optical_character_recognition" Categories: Artificial intelligence applications, Applications of computer vision, Automatic identification and data capture, Computational linguistics, Optical character recognition, Unicode, Symbols This page was last modified on 20 May 2009, at 02:34 (UTC). All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.) Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) taxdeductible nonprofit charity. Privacy policy About Wikipedia Disclaimers

9


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:372
posted:5/24/2009
language:English
pages:9