Chari Paul by nikeborome


									     Chari Paul
     English 516
     Research Project

      Automated Essay Scoring: Equal to Human Graders?
     “AES systems can be a great assistance to teachers in responding to large number
 of essays and assign frequent writing assignments wit hout worrying about scoring the
                      first and subsequent drafts,” (Dikli, par. 50)

     The Reality

     Teaching English is a job that differs from teaching in other

disciplines for one main reason—the stacks and stacks of essays that need

constant attention, but never seem to ever get done. These essays cannot

be graded by student aides or anyone else but the teacher because the

teacher is the person who is trained to grade them. This task adds hours to

the work week of an English teacher. How can they get around this? There

are probably many ways, like giving credit just for doing the assignment,

giving fewer writing assignments, or skimming the essays. But would this

be helping students? Isn’t it important that students are given meaningful

feedback on their essays?

     Semire Dikli in her online article titled “Automated Essay Scoring,”

says “Revision and feedback are essential aspects of the writing process.

Students need to receive feedback in order to increase their writing

quality,” (Dikli par. 2). Scholars in the field like Tom Romano and Peter

Elbow agree that the sooner students receive feedback, the better chance

of a revision with meaningful changes (Bush par. 5).

      This fact leads to the problem of turn around time for teachers

grading essays. For example, if a teacher has 175 students and assigns an

essay between 1-4 pages, how much time will he/she spend grading and

making comments on essays? How long will it take for the students to get

their papers back with comments? Plus, in addition to the initial stack of

essays, there are always other works in progress since writing is a part of

everyday life in an English classroom. The writing never stops coming in.

These are issues I have personally dealt with for several years. Unless I

commit to several hours a night within the couple days after essays are

turned in, the papers might not be returned for weeks. And this is only for

a first draft.

      Skepticism vs. Opening the Mind

      The scoring of student essays requires meticulous attention and

hours of manual labor for English teachers. Recent advances in technology

have enabled the automated scoring of student essays. But what does this

mean? It’s hard to imagine a paper being fed into a machine, a computer,

to be corrected. Many people would probably be horrified at the idea of a

computer grading human writing, complete with emotions and

personality. After all, a computer is emotionless and can’t speak back

about the writing, right? Not according to some of these systems.

      Some automated essay scoring systems (AES) are claiming to be

extremely accurate and say that within seconds not only is the essay

graded, but graded with feedback. This is the reality of technology right

now and many of the AES claim that that they exceed the accuracy of

human scorers in a fraction of the time it takes for manual scoring. This

has created a robust automated essay scoring market. The K-12 testing

and assessment segment alone is estimated at $2 billion. (Content Analyst


     I have heard of automated scoring systems before, but have never

looked into them in great detail mostly because I would never consider

using one. Has this been a close minded way to move through the world

as a teacher of writing? Definitely. So, I have opened my mind, thrown my

preconceived notions aside to research these systems, which claim to be as

accurate as humans. Often times people shy away from things they don’t

understand and deem them useless when sometimes the fear is worse

than the reality. Because what if automated essay scoring systems were

useful to me? The opportunity to look deeper into them presented itself

and I decided to at least give them a chance.

     In my newly open state of mind I wondered, how do these systems

work? Where did they come from? How long have they been around? Are

they reliable? These are just a few of the important questions teachers

need to know in order to decide if they are an appropriate measure to

bring into their own classrooms.

     The Systems

     One of the earliest mentions of computer grading of essays was

revealed in an article written by Ellis Page in 1966 where he described

Project Essay Grade (PEG) and since this, many more powerful essay

grading systems have emerged. In their web article, “Automated Essay

Grading Systems Applied to a Firsrt Year University Subject: How Can

We do it Better?” Dreher, Palmer, and Williams say among the most

serious contenders in the field aside from PEG are E-rater and Intelligent

Essay Assessor (IEA) (1222). Others include Criterion and IntelliMetric.

     One of the most common confusions of the general public is how

computers could possibly grade an essay; it is a machine, not a person.

The general way these systems work is that the computers are given the

assigned topic (a writing prompt) and then are fed a few hundred essays

that have been previously graded by humans. The ungraded essays are

then ready to be entered into the system for grading (PG News).

     Gregory Chung & Harold O’Neil in their online article,

“Methodological Approaches to Online Scoring of Essays,” say that each

system is different in ways, focusing on different aspects of the essays, but

all claim to be relatively accurate when compared to human graders (3).

One fact is that these systems are improving yearly, if not monthly. Attali

and Burnstein in their online article, “Automated Essay Scoring With e-

rater V.2,” say there have been revised versions of several of the systems

already and they continue to address the problematic issues people have

with the systems such as accuracy, reliability, determining bogus essays,

and more (par. 7).

     Project Essay Grade

     Project Essay Grade, created by Ellis Page and others in 1966, was

developed to make the large-scale essay scoring process more practical

and effective. PEG takes a sample of the essays to be graded, which are

marked by a number of human judges. PEG relies primarily on style

analysis of surface linguistic features of blocks of text and uses proxy

measures to predict the intrinsic quality of the essays such as average

word length, essay length, and the number of semicolons or commas

(Dikli par. 7). So an essay graded using PEG would be graded

predominantly on the basis of writing quality over writing content. The

design approach is based on the concept of “proxes,” (computer

approximations of trins, and intrinsic variables of interest within the

essay) which a human grader would look for but the computer can’t

directly measure. As mentioned, these proxes include essay length

(numbers of words), and counts of prepositions, relative pronouns, and

other parts of speech (to indicate complexity of sentence structure). Proxes

are then calculated from a set of training essays that are transformed and

used in a standard multiple regressions, along with the already assigned

human grades for the training essay to calculate the regression

coefficients. PEG relies on a statistical approach based on the assumption

that the quality of the essays are reflected by the measurable proxes

(Cucchiarelli, Neri, & Valenti 321).

     These multiple regression techniques are used to compute (from the

proxes) an equation to predict a score for each student essay. The goal is to

identify those variables that would prove effective in predicting human

rater’s scores (Dreher et al. 1222).

     Among the strengths of PEG is that the predicted scores are

comparable to human raters and the system is computationally tractable;

it is able to track the writing errors made by the users. PEG has an 87%

correlation with human graders (Cucchiarelli et al. 321).

     From the information gathered on PEG, I believe it is obviously an

accurate system, however, I found that the feedback isn’t as much as far as

quantity as I would like in my own classroom. This is one of the systems

that is being upgraded on a continual basis and better feedback is one of

its aims for future versions. It seems its only downfall is its lack of student

feedback, meaning it doesn’t point directly to what should be changed

beyond surface errors (Koul 229). However, with an 87% correlation to

human graders, it’s still a very reliable system.


     The Electronic Essay Rater (e-rater) was developed by the

Educational Testing Service (ETS) in the late 1990s to evaluate an essay by

identifying linguistic features in the text (Dikli par. 22). E-rater uses a

combination of statistical and natural language processing (NLP)

techniques to bring out the linguistic features of the essays to be graded.

E-rater student essays are evaluated against a benchmark set of human

graded essays. E-rater has modules that extract essay vocabulary,

discourse structure information, as well as syntactic information. Multiple

linear regression techniques are used to calculate a score for the essay,

based upon the features extracted. The system is run to extract

characteristic features from human scored essay responses for each new

essay. Fifty-seven features of the benchmark essays are used to build the

regression model, based on six score points in an ETS scoring guide for

manual grading. Using step-by-step regression techniques, the important

predictor variables are determined. The scores from these variables from

the student essays next are substituted into the particular regression

equation to obtain the predicted score. Ratios are tabulated for each

syntactic type on a per essay and per sentence basis (Dreher et al. 1223).

     Another scoring guide criteria of e-rater relates to having well-

developed arguments in the essay. Discourse analysis techniques to

examine the essay are used for discourse units by looking for surface cue

words and non-lexical cues. Based upon individual content arguments,

the cues are used to break the essay up into partitions (Burnstein &

Shermis 116).

     E-rater has been evaluated and has been found to achieve a level of

agreement with human raters of between 87% and 94%, which is claimed

to be comparable to human graders with each other (Dreher et al. 1223).

     E-rater is currently used by ETS for operational scoring of the

Graduate Management Admissions Test (GMAT). Before this, GMAT

essays were scored by two human graders on a 6 point holistic scale (6

being highest) and if the graders were off by more than one point, a third

rater would be brought in. The same is true for e-rater; one human grader

is used along with the system and if there is a discrepancy of one point,

another human rater is brought in. Since e-raters employment, the

discrepancy rate between e-rater and human raters has been less than 3

percent (Dikli par. 23).

     E-rater is a well respected AES, considering it is primarily used for

important admissions essays to graduate schools and others (GMAT,

GRE). I like how e-rater looks for complex sentence structure as part of its

criteria and the reliability is arguably as good as human graders. This is a

system that has been making progress, also and its improvements are

noticeable. The innovations have been a small, intuitive, and meaningful

set of features used for scoring. Attali and Burnstein claim a single scoring

model and standards can be used across all prompts of an assessment and

that modeling procedures are transparent and flexible. This new e-rater

has made many improvements based on previous research of where

AES’s are lacking (Attali & Burnstein 2).

     Intelligent Essay Assessor

     Intelligent Essay Assessor is an LSA based system (Latent Semantic

Analysis) that represents essays and their word contents in a large, two-

dimensional matrix semantic space. Using a matrix algebra technique

known as Singular Value Decomposition, new relationships between

words and essays are uncovered and current relationships are changed to

more accurately reflect their true significance (Dreher et al. 1224). In other

words, LSA measures of similarity are considered highly correlated with

human meaning similarities among words and texts. It accurately imitates

human word selection and the idea is that the meaning of a passage is

very much dependent on its words. Changing just one word can result in

meaning differences in the passage (Dikli par. 11). To say this more

simply, LSA captures the important relationships between text documents

and word meaning, which must be accessed to evaluate the quality of

content. It is a mathematical/statistical technique for extracting and

representing the similarity of meaning of words and passages by analysis

of large bodies of text (Deerwester, Dumais, Furnas, Landauer &

Harshman 398). LSA improves automatic information retrieval

significantly by allowing user requests to find pertinent text on a desired

topic even when the text contains none of the words used in the query

(Shermis 12).

     It is claimed that IEA’s main focus is more on the content related

features over the form related ones, but this does not mean that IEA

doesn’t provide feedback on grammar and punctuation. Though the

system does use an LSA based approach to evaluate the quality of an

essay, it also includes scoring and feedback on style and mechanics (Dikli

par. 13).

     An example of IEA’s success is illustrated in its implementation at

New Mexico State University. Students submitted essays on a particular

topic and within 20 seconds received feedback with an estimated grade.

Along with the grade was a set of questions and statements of additional

subtopics that were missing from their essays, according to the system.

With this instant computer feedback, students were able to revise their

essays and immediately resubmit. It was found that student essays

improved with each revision and final grades were raised by at least 7

percentage points (Foltz et al. par. 7).

     This system has been evaluated by Landauer and Foltz (who helped

create the LSA approach) and both agree that this system is comparable to

human graders. Landauer reports on the GMAT saying the agreement

between human graders and the system were between 85% - 91% (Dreher

et al. 1228).


     Criterion is a web-based system that relies on other ETS technologies

called e-rater and Critique Writing Analysis Tools. Criterion uses these two

complementary applications based on NLP (Burnstein & Marcu par. 8).

     During the 2002-2003 school year, Criterion was used by thousands

of students in grades 6-12 nationally for the purpose of evaluating the

effectiveness of the automated feedback and revision features of Criterion.

One of the main issues was figuring out if students could use feedback

given to them by a computer (Attali 1).

     Feedback for the Criterion system covers five major areas of writing

quality: organization & development, grammar, usage, mechanics, and

style. From the study conducted by Yigal Attali, it showed that students

were able to significantly lower the rate of most of the 30 specific error

types that were identified by the system and reduced their error rates by

about ¼ (Attali 3). Students also increased the rate of background and

conclusion elements and increased the number of main points and

supporting ideas following the feedback from this particular AES (Attali

4). The results of this study show that students are able to understand and

revise their papers positively based on feedback from Criterion. The

places where the system fell a little short was in helping students with

thesis statements and it didn’t significantly help their grammar scores.

     This system in general is proven to be effective in that students

improve in the development of their essays due to the feedback received

from Criterion. The system has been improved and boasts its detection of

bogus essays by saying its innovative system detects them. This system

discusses the advanced capabilities of automated scoring systems. In their

article “Advanced Capabilities for Evaluating Student Writing: Detecting

Off-Topic Essays Without Topic-Specific Training,” Burnstein and Higgins

say Criterion has gotten “smarter” as time has gone on (par. 12).

     They say sometimes students try to trick the computer by going off

topic and possibly using other tactics to fool the automatic grader. They

describe an algorithm that detects when a student’s essay is off topic

without requiring a set of topic specific essays for training. It also

distinguishes between two types of off topic writing, the first being a well

formed and well written essay on a topic that does not respond to the

expected test question (unexpected topic essay) and this happens when a

student accidentally cuts and pastes the wrong essay she/he has prepared

beforehand. The second type is a bad faith essay. The example in the

article gives is an essay like, “You are stupid. You can’t add,” basically

gibberish. The way the article describes it as that in the past, sometimes

the automatic scoring system won’t catch off topic essays if the topics are

similar like “school” and “teachers” as topics will use similar vocabulary

and won’t be distinguished. This system sets out to correct that problem.

Again, this article shows the quick advancement and improvements these

systems are making (Burnstein & Higgins par. 12-16).


       IntelliMetric, developed by Vantage Learning, is the first AES based

on artificial intelligence. Like e-rater, IntelliMetric depends on NLP, which

determines the meaning of a text “by parsing the text in known ways

according to known rules conforming to the rules of English Language,”

(Dikli par. 31). IntelliMetric internalizes the shared knowledge of human

raters by using a mix of artificial intelligence, NLP, and statistical

technologies. This system is trained with essays with previously scored

responses containing “known score” marker papers for each score point.

These papers are used as a basis for the system to infer the rubric in the

collective judgments of human scorers (Burnstein & Shermis 71).

       IntelliMetric’s accuracy to humans is equal. It is between 97%-99% as

human raters. It is also capable of evaluating essays in multiple languages

including Dutch, French, German, Spanish, and many more (Dikli par.


       Who Trusts or Mistrusts Them?

       The automated essay scoring systems have been used for the GRE

and the GMAT since 1999. Some argue that a machine can’t give as good

of feedback as humans, because humans are irreplaceable (Matthews par.

11), but others disagree saying that the machines are even better than

humans (Burstein & Shermis 12). Obviously these systems have gained

credibility if they are being used for important admissions examinations.

     During the course of researching, I have asked many other English

teachers what they thought of automated essay scoring systems. A

common theme emerged: distrust. Many of them feel that it is impossible

for a computer to assess writing. Some laughed, others just shook their

heads. However, I believe these people, like me, didn’t know how they

worked or what was involved in the process. When we’re uninformed,

we’re skeptical and distrustful.

     Personal Experimentation

     There is a system called the Holt Automated Online Scoring System

available for my own use at the high school where I teach. This is a

resource I never considered using since I was of the school of thought that

I was to grade the papers. I believed it was my job and wondered how

could a computer possibly help me. Multiple choice tests went through a

Scantron machine; essays came to me. There was no alternative in my

mind. The thought of a computer acting as an essay Scantron machine

made me wince. But as I researched, it didn’t seem quite so preposterous.

     In the midst of my research I decided to put my insecurities aside

and try it. I looked into the specifics of this particular scoring system, since

it was available to me. I found out that Holt uses the Intelligent Essay

Assessor (IEA), a popular AES. For the Holt Online Scoring System,

writing prompts are available in eight modes: expository, persuasive,

how-to, descriptive, writing about literature, writing about nonfiction,

narrative, and biographical narrative. Every year, their development team

identifies which prompts and writing modes are most popular with their

users while they also take the time to review current state writing

assessments for newly released writing prompts, as well as changes in

scoring rubrics. Using this research, they develop prompts designed to

best meet their users’ needs.

     The speed is pretty unbelievable; most essays are scored within a few

seconds and along with the score is feedback Students receive several

types of feedback, including a holistic score and an analytic assessment in

each of five different writing traits: content and development; focus and

organization; effective sentences; word choice; and grammar, usage, and

mechanics. The system also provides level-specific writing activities to

help students revise their writing, interactive model essays for each

writing prompt, and special advisories to alert teachers and students to

unusual writing styles. In addition, the system offers graphic organizers,

tips on how to revise, pre-writing activities, and much more (Holt Online

Scoring System).

     One negative of the system is that teachers can’t develop their own

writing prompts because for each new prompt, it would require training

on hundreds of student essays scored by humans in order to “learn” how

to score other essays written on that topic. Only writing prompts that have

been reviewed by a writing assessment expert and have gone through a

rigorous training process are posted to their site.

     I decided to test this system with my students. I took a group of 32

10th graders down to the computer lab to write a comparison/contrast

essay on two stories we read in class. This essay topic (prompt) was pre-

determined by the scoring system. I had my students write rough drafts

in class, do peer responses, and then type them in Microsoft Word before

transferring them to the program.

     Once entered into the system, they received scores, comments (very

specific ones at times) and were excited to try again and improve. I was

amazed. I had one student say, "Can I try this again? I think I know what I

have to do to make it better."

     To test myself as a grader, I asked five students for their papers. I

graded them according to my rubric I use for most essays (a 4 point rubric,

just like Holt—there is also a 6 point option) and compared my scores to

those that Holt gave the students. I graded each of the five with the exact

score as the automated system did. Even many of the comments were

similar. For example, on one of the essays that wasn’t really staying

focused, the Holt said, “attempts to address problem, but frequently loses

focus,” Where I had written something very similar in the margins.

     Paradigm Shift

     With my newfound knowledge about automated scoring systems in

general, have changed my mind about them. I came into this research

believing I would walk away knowing they were unreliable and just a tool

to use if teachers were short on time or to try something different. But

now I believe in each system I’ve researched and would use them in my

own classroom. Each system has its downfalls, some being that they give

less feedback than I would prefer, others being that they focus too much

on surface errors. But in reality, that is the truth for English teachers

anyway. I know students run into some teachers who are focused on

grammar and others who focus on content and ignore surface errors. So

maybe these systems are more human than not, in that way.

     Since I found out that Holt uses Intelligent Essay Assessor, which

I’ve learned to be between 87% and 91% accurate when compared to

human graders, I will use it often. In fact, I wish I would have taken

advantage of this resource long ago. Now that I know how the system

works and how much time and energy and research is put into these types

of systems, I’ve gained a new respect for them.

     I give many writing assignments in my classroom that I could not

use this system for simply because I use many writing prompts that Holt

does not offer. But for the ones that I use with my textbook or ones I see

that look interesting (you can browse prompts and the variety is

impressive), I think they will benefit my students. P.W. Foltz says in his

article, Automated Essay Scoring: Applications to Educational

Technology, “By providing instantaneous feedback about the quality of

their essays, as well as indications of information missing from their

essays, students can use the IEA as a tool to practice writing content-based

essays,” (Foltz et al. par. 11). Because of this and everything else I’ve

learned, I now see this as a valuable tool.


     What I see for the future of AES is that they will continue to improve

and I’m sure at some point, the writing prompts will be much more

adaptable. These systems aren’t without their problems, and according to

Cucchiarelli, et al., “The most relevant problem in the field of automated

essay grading is the difficulty of obtaining a large corpus of essays each

with its own grade on which experts agree,” (328). I believe as time goes

on, this problem will dissipate because the systems will be used more as

people understand how they work and what’s involved in using them, in

turn gaining respect for them and seeing them as reliable. Because of this,

I think more essays will be tested and more writing prompts will emerge.

     After doing this research, though it seemed much of the literature

was scientific and not geared toward English teachers, I was able to

understand the literature and realize that these AES are indeed a resource

that more high school teachers and teachers at other levels should be

using. They do work and when you think of the inherent nature of

grading as being subjective, the scoring systems seem almost more

reliable. Grading student essays is a subjective, controversial situation as it

is. Many researchers say the subjective nature of essay assessment leads to

variation in grades given by different human graders (teachers), which is

seen by some students as a source of skepticism and unfairness

(Cucchiarelli et al. 319).

      Though I plan to use an AES in my classroom more often, I will not

use it exclusively. I think its purpose in my classroom will be evident,

however, I will also continue grading essays on my own, especially the

ones that don’t have Holt prompts. We will continue to draft essays in

multiple stages and use peer responses. I see this system as an

enhancement in my classroom and I have gained a new trust in a system

that a couple months ago I had my mind made up that they had no use for


      Although I understand that I’ve only scratched the surface of

automated essay scoring systems in this assessment of them, this

opportunity has opened my mind and given me a new resource to believe

in—I had no idea how much research had been done and it extends well

beyond this essay. I am enlightened to new possibilities and I believe the

Holt system available to me will help my students grow as writers—and

as they grow I look forward to watching the AES market in general

continue to improve and grow as well.

                                       Works Cited

       Holt Online Scoring System:

      PG News:

       Attali, Y. & Burstein, J. (2006). “Automated Essay Scoring With e-rater V.2.”
Journal of Technology, Learning, and Assessment, 4(3 ) 1-29. Available from

      Attali, Yigal. “Exploring the Feedback and Revision Features of Criterion.” 20
pages. Paper presented at the National Council on Measurement in Education (NCME) 1-
20. April 2004. San Diego, CA.

      Burstein, Jill and Derrick Higgins. “Advanced Capabilities for Evaluating
Student Writing: Detecting Off-Topic Essays Without Topic-Specific Training.”
Proceedings of the International Conference on Artificial Intelligence in Education,
July 2005, Amsterdam, The Netherlands.

      Burstein, Jill C. and Mark D. Shermis, eds. “Automated Essay Scoring: A Cross-
Disciplinary Perspective.” Mahwah, New Jersey: Lawrence Erlbaum Associates,

       Burstein, J., & Marcu, D. (2000 July). “Benefits of Modularity in an automated
scoring system” (PDF). In Proceedings of the Workshop on Using Toolsets and
Arch itectures to Build NLP Systems, 18 th International Conference on Computational
Linguistics, Luxembourg.

       Bush, John. “A Free Conversation.” Interview with Peter Elbow.2004.

       Chung, Gregory K. W. K., and Harold F. O’Neil, Jr. “Methodological
Approaches to Online Scoring of Essays.” (1997) 1-31. National C enter for Research on
Evaluation, Standards, and Student Testing.

    Content Analyst Company, 2005.

      Cucchiarelli, Alessandro, Neri, Francesca, and Valenti, Salvatore. “An
Overview of Current Research on Automated Essay Grading.” Journal of Information
Technology Education Vol. 2 3003.

      Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., & Harshman
R.A. (1990). “Indexing by latent semantic analysis.” Journal of the American Society for
Information Science, 41.6, 391-407.

       Dreher, Heinz, John Palmer and Robert Williams. “Automated Essay Grading
Systems Applied to a First Year University Subject: How Can We do it Better?”
Informing Science, 1221-1229. June 2002.

      Dikli, Semire. “Automated Essay Scoring.” Turkish Online Journal of Distance
Education (TOJDE) 7.1 article 5 (2006): 13 pages. 3-10-06 .

      Foltz, P.W., Laham, D. & Landauer, T.K. (1999). “Automated Essay Scoring:
Applications to Educational Technology.” Proceedings of EDMedia ’99. Retrieved on
3-15-06 from

      Koul, Ravinder, Roy B. Clariana, and Salehi Roya. “Comparing Several Human
and Computer- Based Methods for Scoring Concept Maps and Essays.” Journal of
Educational Computing Research. V32.3 p. 227-239 (2005).

       Matthews, Jay. "Computers Weighing in on the Elements of Essay." Washington
Post 1 Aug. 2004: AO1.

    Page, Ellis B. Grading Essays by Computer: Prog ress report in Notes from the 1966
Invitational Conference on Testing Problems, 1966, 87-100.

   Shermis, Mark D. “Trait Ratings of Automated Essay Grading.” Educational and
Psychological Measurement. 62.1 p. 5-18 (2002).

                         Additional Works Consulted

      Burstein, J., Wolff, S., & Lu, C., & Chodorow, M. (1998 August). “ Enriching
Automated Scoring Using Discourse Marking” (PDF). In proceedings of the Annual
Meeting of the Association of Computational Linguistics . Montreal, Canada.

         Dexter, Sara L., Doering, Aaron, Riedel, Eric, and Cassandra Scharber.
“Experimental Evidence on the Effectiveness of Automated Essay Scoring in Teacher
Education Cases.” Paper prepared for the 86th Annual Meeting of the American
Education Research Association, April 11-15, 2005, Montreal, CA.

         Kelly, Adam P. “Computerized Scoring of Essays for Analytical Writing
Assessments: Evaluating Score Validity.” Paper presented at the Annual Meeting of
the National Council on Measurement in Education (Seattle, WA, April 11 -13, 2001).

         Laham, Darrell, MacCuish, Don, Psotka, Joseph, and Lynn Streeter. “The
Credible Grading Machine: Automated Essay Scoring in the DoD.” http://www.k-a-

   Lee, Yong-Won. “The Essay Scoring and Scorer Reliability in TOEFL CBT.” Paper
presented at the Annual Meeting of the National Council on Measurement in
Education (Seattle, WA, April 11-13, 2001).

      Rudner, Lawrence and Phill Gagne. "An Overview of Three Approaches to
Scoring Written Essays by Computer." Eric Digest (2001): 1-6.

   Rudner, L.M. & Liang, T. (2002). Automated Essay Scoring Using Bayes’
Theorem. Journal of Technology, Learning, and Assessment, 1(2). Available from

   Saba, Rizavi and Stephen Sireci. “Comparing Computerized and Human Scoring
of Students’ Essays.” Paper presented at the Annual Meting of the National Council
on Measurement in Education (Seattle, WA, April 11-13, 2001).

      Wall, Janet E. “Technology-Delivered Assessment: Diamonds or Rocks?” Eric
Digest (2000):

  Wang, Chih-yen. “How to Grade Essay Examinations.” Performance
Improvement. 39.1 p. 12-15 (2000).


To top