A Preliminary Look at the Effects of Optical Character

Document Sample
A Preliminary Look at the Effects of Optical Character Powered By Docstoc
					           Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

    A Preliminary Look at the Effects of Optical Character Recognition (OCR) and Keying on the Quality of
                               Industry and Occupation Coding in Census 2000

                     Thomas Scopp, Kevin Haley, and Donald Dalzell, U.S. Census Bureau1
                   Thomas Scopp, HHES Division, U.S. Census Bureau, Washington, DC 20233

Keywords: Data capture, industry, occupation, optical        looked for marks on microfilm images of the paper
character recognition (OCR).                                 questionnaires, and converted these marks into input
                                                             that the computer could understand.
Abstract: In Census 2000 the Census Bureau used
optical character recognition (OCR) technology for the       For Census 2000, the Census Bureau continued this
first time to capture information written onto census        tradition of innovation by contracting with the
questionnaires.     Lockheed    Martin     Corporation       Lockheed Martin Corporation to develop a process that
developed the OCR software under contract for Census         would not only read marks on the questionnaires, but
2000. The write-in responses on the census forms             also read the actual words that respondents provided in
included information about people’s jobs, that is, their     the census. This process, usually known as Optical
industries and occupations.                                  Character Recognition (OCR), was supplemented by
                                                             OMR software to interpret the data provided in check-
This paper examines a sample of write-in responses that      boxes, and by the more traditional data keying from
were first captured by the OCR. Responses that did not       images (KFI), to capture the information that could not
match certain pre-specified standards were then              be successfully captured by either OMR or OCR. The
recaptured by data keying. The responses captured            success and ultimate contribution of the OCR and KFI
before and after keying were assigned industry and           data capture methodologies to the processing of
occupation (I&O) codes. The I&O codes were assigned          industry and occupation (I&O) data from Census 2000
each time by an automated coding system (called the          is the subject of this paper.
“autocoder”) and by clerical staff.
                                                             Description of the Industry           and Occupation
This paper presents results showing the effects of OCR       Responses in Census 2000
alone and OCR combined with keying on the
production rates and quality of the I&O codes assigned       Census 2000 used two different questionnaire formats
by both the autocoder and the clerical coders.               to collect information. These two questionnaires were
                                                             simply called the “short forms” and the “long forms.”
History and Background                                       The short forms contained only the six basic questions
                                                             needed to count the population as required by the
The Census Bureau has a long tradition of using              Constitution of the United States. These questions,
innovative technologies for processing the data it           called the “100% data” because they were asked of
collects in decennial censuses, particularly for             everyone, included each person’s name, relationship,
“capturing” the handwritten information from paper           sex, age/date of birth, ethnicity (Hispanic or not), and
questionnaires. This tradition goes back over a century.     race. The long forms, which approximately 17 percent
For the 1890 census, Dr. Herman Hollerith, then an           of the households received nationwide, contained all
employee of the Census Bureau, developed the first           these questions, plus many more on a variety of
form of mechanized data processing when he invented          subjects. Among these “sample subjects” were
punch cards and machines for reading the information         questions about people’s jobs, or more specifically,
contained on them. For the 1950 census, the Census           about the kind of industry or business they worked for,
Bureau purchased the Univac I, the very first electronic     and the kind of work they performed (their occupation).
computer used for non-military mass data processing.
For the 1960 census, the Census Bureau invented the          Some of the questions on both the short and long forms,
first optical data capture methodology, called Film          including those for industry and occupation (I&O),
Optical Sensing Device for Input to Computers                required the respondents to write their answers in detail,
(FOSDIC). In today’s terminology this technology             rather than simply to mark a box. Respondents were
would be called Optical Mark Recognition (OMR). It           supposed to place these “write-in entries” within a

  This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a more
limited review than official Census Bureau publications. This report is released to inform interested parties of
research and to encourage discussion. A more detailed version of this report is available on request from the authors.
confined area shown as a set of segmented boxes on the      Description of Census 2000 Data Capture
                                                            The data capture system designed by Lockheed Martin
Four of the questions in the I&O series required write-     for Census 2000 (DCS2000) included six major
in answers needing OCR interpretation. For overall          components:
design considerations, each of these questions provided
three rows of boxes for answers, with blank space              1. Checking in all the forms
separating each row. For questions to be read by OCR,          2. Preparing the documents for scanning
respondents were supposed to enter their answers so            3. Scanning the forms
that only one character appeared in each segment.              4. Passing the scanned images through OMR to
Unfortunately, when some respondents saw the three             convert the marked boxes into a computer-readable
unconnected rows instead of one continuous section,            format
they may have believed that three separate answers             5. Passing the scanned images through OCR to
were expected. This belief could have caused them to           convert the write-ins into a computer-readable
try to squeeze their one response into one row or to           format
write outside the rows. Responses frequently ran across        6. Keying (KFI) any data that the OCR software
the segments within the answer boxes, and sometimes            could not successfully interpret.
went outside the boxes altogether. These forays outside
the lines, combined with the multiple number of ways        Data capture of census forms took place in four Data
that people can write and print letters and numbers,        Capture Centers (DCCs). Preliminary testing of the
provided many challenges to the system designed for         system showed that keying was taking approximately
capturing and interpreting the answers from the paper       twice as long as originally projected. This fact caused
and getting that information into the computers. This       concern that trying to capture all the census information
lack of consistency in the way answers were provided        in one sweep, including the sample write-ins, might
on the questionnaires should not be surprising because      delay the determination of the 100% census counts.
no directions were provided in the census packages          Therefore, the Census Bureau decided to process the
mailed to the respondents.                                  images of the questionnaires in two passes. The first
                                                            pass operations (Pass 1) included the check-in of all
The ultimate goal of the I&O information processing is      forms, scanning to create images of all forms, and KFI
to organize the multiple ways that people describe their    of the 100% data when needed. The second pass
jobs into a specified and finite set of industry and        operation (Pass 2) included a rerun of the long form
occupation categories. These categories are then            images originally scanned during Pass 1 through the
tabulated for data files and publications. The process of   OMR/OCR interpretive processes, and KFI for the
assigning the open-ended responses into categories is       sample data. Only the sample write-ins were eligible to
called I&O coding. In Census 2000 the Census Bureau         go through KFI if needed.
used two methods for coding the I&O entries. The first
was an automated coding system that interpreted the         During the check-in phase, the different form types
I&O responses based on a programmed set of rules and        were sorted, barcodes on the envelopes that identified
previously coded samples, and assigned codes without        the type of form were read, the envelopes were opened,
human intervention. Responses that the automated            and files were created that contained the results of the
software (the “autocoder”) could not successfully code      check-in. In the document preparation phase, the
went to a staff of clerks who assigned the remaining, or    questionnaires were removed from the envelopes and
“residual,” I&O codes.                                      checked for damage, staples were removed from the
                                                            long forms, and all the forms were placed in short or
Whether assigned by a computer or human beings, the         long form batches by type. There was no clerical review
quality of all these I&O codes was directly dependent       of the questionnaire entries, nor any grooming of the
on the quality of the data going into the coding process.   entries to correct their format or position on the
In other words, if the information coming out of the        documents.
data capture process was so garbled and distorted that
neither the automated nor clerical coders could             In the next phase, the questionnaire sheets were fed
accurately interpret into which category a response         through cameras. This scanning process created a
should go, then the quality of the data output for          digital image of each side of each sheet. These digital
tabulation purposes would suffer. Put still more simply,    images were the input to all the remaining steps in the
“Garbage in, garbage out.”                                  process that interpreted (or attempted to interpret) the
                                                            information provided in the census. The paper
questionnaires were not touched again until the final        Pass 1, that is, data captured only by the OCR
checkout after all data capture was done.                    technology without any human interpretation, and
                                                             coded all the responses using both the autocoder and
The first data interpretation step was the OMR process.      clerical coders. We then compared the autocoder and
This software read the check box marks. Three I&O            clerical results based on the Pass 1 (OCR) data to
items contained check boxes. The second data                 autocoder and clerical results based on a sample of
interpretation step was the OCR process. This software       cases from Pass 2, which included the effect of KFI on
attempted to read the handwritten entries, one character     OCR rejected entries.
at a time, in the fields that accepted open-ended
answers. Early examination of the output from this step      The sample of Pass 1 data consisted of 923,611 long-
indicated that OCR was having greater difficulty than        form records.2 Because it was chosen systematically
expected interpreting alphanumeric fields like the I&O       across the country we assumed the sample included a
responses. For example, the software could not               proportional representation of all industries and
distinguish well between the upper-case letter “I,” the      occupations. The sample of Pass 2 data used for
lower case letter “l,” and the number “1.” Sometimes         comparison was called the “Validation Sample.” It
the software could use the context of the entry to make      consisted of approximately 200 cases for each I&O
distinctions between characters, but not always. For         code, and was used originally to validate the accuracy
each character seen, the OCR software calculated a           of the autocoder for the entire census. The Validation
“confidence level” which was determined by the               Sample consisted of 152,936 cases.
original entry’s clarity, position, and context. But these
confidence scores did not seem to be consistent with the     To measure the accuracy of the codes assigned by the
observed lack of quality of the output needed to assign      autocoder, they were compared to a standard we
I&O codes. In other words, too much “garbage” seemed         considered the “truth.” This kind of standard is
to come through OCR as “acceptable.”                         traditionally determined by clerically coding the I&O
                                                             responses independently three times. That is, three
To help OCR better determine the acceptability of its        different clerks assign the I&O codes without knowing
output, the Census Bureau developed dictionaries of          what the others have assigned. When the three codes
words commonly found in the world of work. These             are compared, the majority rules. If all three clerks
dictionaries included the words that appeared most           assign the same code, or if two out of the three clerks
frequently in samples of keyed I&O responses. The            assign the same code, the code agreed upon is
intent of this exercise was to provide a means for OCR       considered correct, or the “truth.” If the autocoder
to compare each word in its output to the dictionaries to    assigned the “truth” code to the same response, then the
be sure that the output consisted of real (readable)         autocoder was also considered correct. Both the Pass 1
language. Even with the help of these dictionaries, more     sample and the Validation (Pass 2) sample were triple-
than half of the OCR interpreted I&O responses were          coded by a staff of about 100 clerks in our National
considered to be of low confidence, and were sent to         Processing Center (NPC) in Jeffersonville, Indiana. The
KFI.                                                         Pass 1 sample got the nickname “Three-way sample.”

When write-in entries were sent to KFI the keyers did        Besides accuracy, production is also important.
not see the entire questionnaire, but only “snippets” of     Responses that were so garbled or vague that the
the entries they were to key. Using their best               autocoder could not assign a code at all would have to
judgement, the keyers captured all the characters as         be coded clerically. Clerical coding is much more
they understood them from the image. Some keying             expensive and time-consuming than automated coding.
rules were provided to assist the keyers in interpreting     Therefore, more codes that the autocoder can assign at
the information.                                             an acceptable level of accuracy results in fewer codes
                                                             having to be assigned by clerks, thus reducing the
Data for This Study and Results                              overall cost of the processing.

As described above, the data capture steps were              Whenever the autocoder assigned a code, it also
completed in two passes. For the I&O data, Pass 1            assigned a “confidence score,” similar in concept to the
produced the OMR output and the preliminary OCR              “confidence level” assigned by the OCR software. In
output with no keying of unacceptable OCR entries.
Pass 2 produced the final output of acceptable OCR           2
                                                               Throughout this paper the words “record,” “case,” and
interpretations and KFI entries of unacceptable OCR          “person”     are    used     interchangeably.     Each
reads. For the study described in this paper, we took a      record/case/person needs two codes (industry and
sample of the I&O responses as interpreted by OCR in         occupation).
general, the higher the confidence score, the higher the     the scores and the accuracy rates was not always linear.
probability that the code assigned is correct. We            To account for this phenomenon, we allowed a 20
confirmed this relationship by comparing the codes           percent “offset” between the accuracy rate for a given
assigned by the autocoder to the codes assigned by the       confidence score and the target accuracy rate, as long as
majority of the clerical coders in both the 3-Way and        the overall accuracy rate including the next lower score
Validation samples. For each confidence score assigned       rose again above the acceptable level. The various
by the autocoder, we calculated the percent of correct       percentages quoted above provided the name for the
codes assigned that confidence score. By tracking down       method used for determining the cutoff score for each
from the highest possible score, we could determine a        I&O code in the census: we called it the “96-82-m2-20
confidence level below which the accuracy dropped to         method.”
an unacceptable level for each code. The score that
defined this drop-off point became the “Cutoff score.”       From this method, the overall accuracy rates, that is, the
                                                             average accuracy rates across all codes chosen for
In actual production coding we accept only those codes       Census 2000, were 94.0 percent for industry and 92.3
assigned by the autocoder that are associated with           percent for occupation. For the 3-Way sample in this
confidence scores at or above the cutoff score. The          study we applied the same cutoff scores that we used
cases to which the autocoder assigns a code with a           for the Validation sample, thereby maintaining the same
score below the cutoff, or to which it assigns no code at    standard accuracy rates. This matching of accuracy
all, go to clerical coding. This methodology enables us      rates allowed us to use the observed production rates as
to determine in advance what the accuracy of the             the criterion for determining whether there was an
autocoder will be. The autocoder actually attempts to        improvement in the autocoder’s performance when Pass
code everything, but we decide what will be accepted         2 data were used instead of Pass 1 data. We determined
from the autocoder’s output before we assign the             the percentage of sample cases successfully coded at
residual work to the clerical staff.                         the same accuracy rate for both samples, as shown in
                                                             Table 1.
The combination of the two goals described above -
high accuracy and high production - defined the overall      Table 1. Comparison of the Autocoder Results for the
autocoder performance from both samples in our study.        Validation (Pass 2) Sample to the 3-Way (Pass 1)
We could determine in advance an acceptable accuracy         Sample.
rate and the resultant overall production at that accuracy
level, or we could pre-set the production rate and                 Autocoder Criterion        Pass 1       Pass 2
determine the resultant accuracy. Either way, the two            Accuracy (fixed):
criteria balance in opposite directions: as production             Industry                  94.0 %        94.0 %
goes up, accuracy goes down, and vice versa.                       Occupation                92.3 %        92.3 %
For full production I&O processing in Census 2000 we               Industry:
used the Validation sample to determine the                          Assigned a code         77.8 %        86.4 %
autocoder’s performance. We chose to fix the                         (gross rate)
autocoder’s accuracy rates to match those of the clerks              Accepted the code       51.0 %        67.9 %
coding the same cases. Some limits on the rates were                 (acceptance rate)
necessary. For example, in the interest of production,               Net production (gross   39.7 %        58.6 %
we did not force the autocoder’s accuracy rate to be                 x acceptance)
higher than 96 percent for any code, even if the                   Occupation:
corresponding clerical rate for the same code was
                                                                     Assigned a code         70.2 %        80.8 %
higher. In the interest of overall accuracy, we did not
                                                                     (gross rate)
allow the autocoder’s accuracy rate to be lower than 82
                                                                     Accepted the code       47.9 %        69.3 %
percent for any code, even if the corresponding clerical
                                                                     (acceptance rate)
rate for the same code was lower. Within this 96 to 82
percent range, our goal was to produce an autocoder                  Net production (gross   33.6 %        56.0 %
accuracy rate for each code within two percentage                    x acceptance)
points of the corresponding clerical accuracy rate.              Sample size, gross rates3   923,611     22,498,251
Finally, as we tracked down from the highest possible            Sample size, acceptance     236,595      152,936
confidence score for any given code, we found that the           rates
accuracy rate associated with a score could at times dip
below the target rate, but then rise again at the next       3
                                                               Note: the gross production rates shown in this table
lower score. In other words, the relationship between        for the Pass 2 data are from the entire census.
                                                            Table 2. Comparison of actual and projected clerical
Table 1 clearly shows that the ability of the autocoder     coding workloads, by the source of the data to be
to assign I&O codes was enhanced by taking advantage        coded.
of the keyed responses instead of those captured only
by the OCR. The keyed data increased the number of               Workloads             Projected           Actual
responses to which the autocoder was able to assign a                                   Pass 1             Pass 2
code (the gross production rates), and improved the          Total I&O Workload:
proportion of the coded cases whose codes were               Records/people       22.5 million          22.5 million
accepted based on the cutoff scores that determined          Codes                45.0 million          45.0 million
what the accuracy would be. Net production, as               Clerical Workload:
measured by the combination of gross production and          Records/people       21.1 million          15.3 million
acceptance, improved by 18.9 percentage points for           needing at least one
industry and 22.4 percentage points for occupation.          code
                                                             Codes needed,        30.6 million          20.6 million
These net production rates have a direct impact on cost.     including QA
In the I&O coding sequence, every case that does not         Estimated cost       $11.8 million         $8.6 million
get an acceptable industry and occupation code from
the autocoder has to go to clerical coding. The lower       We estimated that on average the I&O clerical coding
quality of the Pass 1 data capture, if not improved by      costs $0.56 per case. As pointed out above, some of
the Pass 2 keying, would result in a larger proportion of   these cases needed only one code, while others needed
cases that would need at least one code assigned            both codes. Both kinds of cases were batched together
clerically.                                                 into work units, so it is not possible to calculate a direct
                                                            cost per code assigned.
The total I&O coding workload (the number of
responses that needed both I&O codes) for Census 2000       At a cost of $0.56 per case, the clerical cost of coding
was 22.5 million cases. Using the Pass 2 data,              15.3 million cases is estimated to be $8.6 million. If we
approximately 57.3 percent of the I&O codes (the            had used Pass 1 data instead, and as a result ended up
average of the 58.6 percent for industry and 56.0           sending 21.1 million cases to clerical coding (even
percent for occupation shown in Table 1) were assigned      though some of these cases would need only one code),
an acceptable code by the autocoder. Of the residual        the total clerical costs would have risen to $11.8
workload that went to clerical coding, approximately        million. In other words, using Pass 2 keyed data may
two-thirds needed only one code (either industry or         have saved over $3 million in clerical coding costs.
occupation), and one-third needed both I&O codes. An
additional 7.5 percent quality assurance sample was         Time is also an important factor to consider for
clerically coded, resulting in a total workload of 15.3     determining the most efficient data processing method.
million records (people) and 20.6 million codes in the      The increased workload that would have resulted from
clerical coding phase of the process.                       using Pass 1 data would have required either a longer
                                                            time period to complete the coding, or more clerks,
Table 1 shows that, if we had used Pass 1 instead of        computer work stations, and workspace to keep within
Pass 2 data for coding, the autocoder would have            the planned schedule.
successfully assigned only 36.7 percent instead of 57.3
percent of the codes (the average of industry and           Estimating workloads, time, and cost includes more
occupation). Therefore, 63.3 percent of the coding          than just the residual coding operation. Some cases are
workload (28.5 million of the total 45 million codes)       so difficult that even a human coder cannot easily
would have gone to clerical coding instead of 42.7          assign a code. Sometimes neither the autocoder nor the
percent (19.2 million codes), an increase of almost 50      residual coders are able to determine the correct
percent. After adding the 7.5 percent QA to both            category for a response. These cases become “problem
figures, over 30 million codes might have needed            referrals” that must go to a second set of clerks who
clerical assignment instead of 20.6 million. Even worse,    have additional coding reference materials to help them.
this increase in the number of codes needing                They are paid at a higher grade and are allowed to use
assignment could have conceivably resulted in over 21       more judgement than the first set of clerks.
million records (which is virtually the entire sample)
being sent to clerical coding, since almost 6 million       The difficult cases in Census 2000, therefore, went
more records would have needed at least one code. See       through three coding steps: automated, residual (first
Table 2.                                                    clerk), and referral (second clerk). Among the coding
difficulties encountered, referral cases included entries    Table 3. Comparison of clerical referral and minority
that the first clerk could not understand either because     rates.
they were not in English or were garbled beyond
recognition. Since the Pass 1 OCR output included                       Measure               Pass 1        Pass 2
more response interpretations that were unrecognizable           Referral rates:
to the autocoder than Pass 2 OCR/KFI, it is reasonable            Industry                    8.1 %          4.8 %
to assume that many of the same responses would be                Occupation                  6.4 %          3.9 %
unrecognizable to the residual coders as well. A larger           Average                     7.2 %          4.3 %
proportion of unreadable entries from OCR would most             Minority /error rates4:
likely result in higher clerical referral rates than those        Industry                    9.2 %         8.6 %
we experienced with Pass 2 data that benefited from               Occupation                 11.4 %        10.4 %
keying.                                                           Average                    10.3 %         9.5 %
                                                                 Sample size                 236,595       152,936
The clerical coding of the Validation and 3-Way
samples confirmed this expectation. For the Validation
Pass 2 sample, the referral rates averaged 4.8 percent
for industry and 3.9 percent for occupation, or 4.3
                                                             The results shown in the previous section demonstrate
percent on average for both. For the 3-Way (Pass 1
                                                             clearly the advantages of using the Pass 2 output, which
OCR only) sample the referral rates were 8.1 percent
                                                             combined OCR with keyed data, over the Pass 1 OCR
for industry and 6.4 percent for occupation, or 7.2
                                                             output by itself. For Census 2000 the clerical workload
percent on average for both. The referral rates increased
                                                             was reduced substantially, clerical problem referrals
by almost 67 percent when coding Pass 1 data instead
                                                             were reduced, and overall coding quality was improved.
of Pass 2 data. See Table 3 below.
                                                             All these improvements contributed to reduced coding
Higher referral rates result in higher coding costs.
Although the clerical coding cost per case of $0.56
                                                             Unfortunately, the Census 2000 system for reporting
quoted previously for coding Pass 2 data was used for
                                                             the data processing costs was not detailed enough to
all the computations shown in Table 2 above, the cost
                                                             provide an estimate of the specific cost of keying the
of coding Pass 1 data would be even higher than this
                                                             I&O write-in entries in Pass 2. Intuitively, however, we
rate. Therefore, the total cost (for the entire workload)
                                                             do not expect that the cost of keying the I&O data
would be even higher than the figure shown in the Pass
                                                             exceeded the cost savings derived from coding Pass 2
1 column of Table 2.
                                                             instead of Pass 1 data. We do not believe the keying
                                                             costs for the four I&O write-in responses were more
As explained above, a quality assurance check was
                                                             than the $3 million saved in coding the same responses.
done by coding both samples independently three times.
                                                             Choosing OCR with KFI support over OCR alone,
When two out of three coders agreed, the majority
                                                             therefore, should result in a net cost savings for all data
codes were considered to be correct, while the minority
                                                             processing. Even if the keying versus coding costs
code was considered incorrect. The “minority rate,”
                                                             merely balanced out, the savings in coding time and the
therefore, was a measure of clerical coding error. Given
                                                             improvement in data quality would make the OCR/KFI
that the Pass 1 data were more difficult to understand, it
                                                             combination data choice more worthwhile.
is reasonable to assume that clerical accuracy would
also suffer when coding Pass 1 data instead of Pass 2
                                                             Given these advantages for I&O coding, any future
                                                             census or survey data capture should continue to
                                                             include keying until the OCR technology has clearly
Table 3 shows minority rates for both the Validation
                                                             demonstrated a substantial improvement in the quality
and 3-Way samples. These rates confirm our
                                                             of its output for write-in responses.
assumption about increased errors with the Pass 1 data.
The minority rates for the Pass 1 sample were 9.2
percent for industry and 11.4 percent for occupation, or
10.3 percent for both. The minority rates for the Pass 2
sample were 8.6 percent for industry and 10.4 percent        4
                                                               For the computation of the minority rates, three way
for occupation, or 9.5 percent for both.                     differences (three different codes assigned to the same
                                                             response) were excluded from the sample because no
                                                             coder could be considered in error. The three way
                                                             difference rates were 4.8 percent for both samples
                                                             described in this paper.