Docstoc

qol-in-cities-likert-scales-2000

Document Sample
qol-in-cities-likert-scales-2000 Powered By Docstoc
					Cummins, R.A. & Gullone, E. (2000). Why we should not use 5-point Likert scales: The case
for subjective quality of life measurement. Proceedings, Second International Conference on
      Quality of Life in Cities (pp.74-93). Singapore: National University of Singapore.




 Why we should not use 5-point Likert scales: The case
      for subjective quality of life measurement


            Robert A. Cummins                                   Eleonora Gullone

           School of Psychology              and           Department of Psychology
            Deakin University                                 Monash University




                  Key words: Likert scale, history, reliability, sensitivity.




Correspondence:

Robert A. Cummins
Professor of Psychology
Deakin University
221 Burwood Highway
Melbourne, Victoria 3125
Australia

E-mail: robert.cummins@deakin.edu.au
                                           Abstract

An argument is presented that the Likert scales commonly employed to measure subjective
quality of life (SQOL) are not sufficiently sensitive for the purpose of using SQOL as a
measure of outcome. A review of the literature indicates that expanding the number of
choice-points beyond 5- or 7-points does not systematically damage scale reliability, yet such
an increase does increase scale sensitivity. It is also argued that naming the Likert scale
categories detracts from the interval nature of the derived data. As a consequence it is
recommended that SQOL be measured using 10-point, end-defined scales.




                                     Acknowledgement

I thank Betina Gardner for her assistance in the preparation of this document.
Subjective quality of life (SQOL) is gaining prominence as a measure of intervention
effectiveness. This is most evident in the fields of human service delivery and medicine
where a very large number of instruments have been devised to measure SQOL in some form.
Typically, each instrument will comprise 10 to 20 items, with each item scored on a 3- to 5-
point Likert scale.

While this recognition of personal, perceived well-being as a measure of outcome
effectiveness represents a major conceptual advance, it is time to take the forms of
instrumentation more seriously. Cummins (1997) lists around 400 instruments that have been
devised to measure SQOL or related constructs, and almost all rely on the use of „Likert-type‟
scales. However, a glance at this collection reveals an apparent absence of rules by which
researchers have designed their scales. The number of choice points can vary from two to
100, some use unidimensional scales (e.g. from „no satisfaction‟ to „complete satisfaction),
some use bidimensional scales (e.g. from „complete dissatisfaction‟ to „complete
satisfaction‟), some use a neutral scale mid-point while others do not, some use extreme
anchors (e.g. „Terrible‟) while others use mild anchors (e.g. „Dissatisfied‟), and so on. Each
of these different forms is known to influence the response pattern that people make to Likert
scales, and yet there has been almost no attention to such issues in relation to SQOL
measurement.

At one level it could be argued this is relatively unimportant. The Likert scale has a fairly
robust character which has proved to be reliable over a wide variety of forms, and the larger
issues concern such matters as the item content of scales related to their construct validity.
While such concerns are valid there is one aspect of Likert scale construction which is at least
of equal importance when the data are used as measures of outcome. As suggested by Guyatt
and Jaeschke (1990), this is the issue of measurement sensitivity, and it is really quite curious
that this crucial parameter has been virtually ignored.

The typical Likert scale offers 5- or 7-choice points which, of itself, is hardly likely to exploit
the discriminative capacity of most people in terms of their perceived well-being. But even
this modest array of response choices is actually reduced by a number of factors, some of
which are as follows:

1. SQOL is not free to vary over its entire range subject to the contingency of personal
   circumstance. It is constrained in range by its trait characteristic (see for example
   McCauley & Bremer, 1991) which causes it to behave as though it is held under
   homeostatic control. For example, the population mean for life satisfaction, a common
   measure of SQOL, is held within the normative range of 70-80 percentage of the scale
   maximum within Western populations (Cummins, 1995, 1998).                Consequently,
   respondents will normally experience SQOL variations that are limited to only a
   proportion of the available scale.

2. As noted above, SQOL data display a natural negative skew when measured using
   ordinary unipolar or bidimensional Likert scales (Watson, 1930; Cummins, 1997).
   Consequently, again the majority of respondents are employing only the higher or
   positive half of the scale in order to register their judgement.

3. Responses to Likert scales comprise a high degree of response-set. For example,
   Peabody (1962) found an average correlation of .54±.24 between individuals‟ intensity of
   agreement or disagreement to sets of attitudinal questions. Moreover, he calculated the
    relative contribution of response direction (positive or negative around a neutral mid-
    point) and response intensity (response distance from the neutral mid-point to an end of
    the scale) to a composite score, and found that intensity contributed only 10 percent to the
    composite variation. This has important implications for the detection of differences
    between QOL domains, such as provided by the Comprehensive Quality of Life Scale
    (Cummins, 1997). Maximum scale sensitivity is required in order to measure such
    differential levels of intensity.

4. Vokman (1951) first drew attention to the „variable series effect‟ (see also Upshaw,
   1962). Here it is proposed that the stimulus range is a major determinate of the value
   assigned to items in a series. Essentially, if people are presented with a scale in which
   their attitude is non-mid-point, then they will subjectively divide the range between the
   values that they recognize as being consistent with their own attitude range and, as a
   consequence, exhibit a narrower response range than the presented scale intends.

So the conclusion to be drawn is that Likert scales, as currently constructed, are likely to
represent very blunt instruments by which to judge change in SQOL. In order to see why this
has come about it is necessary to outline some of the history of scale development.


A history of Scale Construction

Likert (1932) was not the first to obtain subjective ratings on a printed scale, and it is most
interesting to note that the early scale developers used far more sensitive scales than we
currently employ. Freyd (1923) discusses the various forms of scale available at that time
and notes that they tended to be based on 10-point or 100-point formats. For example, the
„Decile‟ scale comprises a number of statements corresponding to different levels of
construct „strength‟ against numbers from 0 – 10. This numbering system is undoubtedly the
most intuitive and easy to conceptualize. The traditional counting task for children involves
their fingers or toes. It also has the advantage of having a perception of equal psychometric
distance between the scale points. This is an essential supposition when such scale are used
in combinations with parametric statistics, even though this condition is known to be violated
using Likert‟s scale to varying degrees (see e.g. Ferguson, 1941).

Freyd then introduced his „Graphic rating method‟ which had the following form:

[item] Does he appear neat or slovenly in his dress?

 Extremely neat     Appropriately and   Inconspicuous     Somewhat        Very slovenly
and clean. Almost    neatly dressed.       in dress.    careless in his   and unkempt.
     a dude.                                                dress.



The above scale was intended to be used in conjunction with job interviews, and raters were
instructed as follows:

    “When you have satisfied yourself on the standing of this person in the trait on which
    you are rating him, place a check at the appropriate point on the horizontal line. You
    do not have to place your check directly above a descriptive phrase. You may place
    your check at any point on the line.” (p.88).
He then recommended scoring the responses by dividing the line into 10 or 20 equal
intervals.

A few years later, Watson (1930) published a similar scale to measure an aspect of SQOL as
follows:




                 About three-fourths      The average       Happier, on the
Most miserable     of the population     person of your    whole, than three-       Happiest
    of all       are happier than you   own age and sex      fourths of the          of all
                           are                            population of similar
                                                              age and sex



Instructions to respondents were:

   “Comparing yourself with other persons of the same age and sex how do you feel you
   should rate your own general happiness? Place a short vertical mark across the line
   below to indicate about where you belong. Consider your average state over several
   months.” (p.84)

The scale was then scored 0-100.

Then, in 1932, Likert produced his scale which had the following form:



   Strongly           Approve            Undecided           Disapprove            Strongly
   Approve                                                                        Disapprove


While this format is clearly derivative from the previous examples, it importantly and
drastically reduces the number of effective choice-points in two ways. First, the scoring
system is no longer continuous. Respondents are now required to mark the line only adjacent
to one of the labels, making the scoring system 1-5. Second, he has introduced the
bidimensional scale with a neutral mid-point.

More than six decades have passed since Likert‟s formulation was published and it
instructive to consider why this original form has remained so popular. The reasons include
the type of psychometric investigation to which it has been subjected, the difficulty of
generating substantially larger numbers of labeled choice points, and the complex nature of
alternative scales. Each of these issues will now be considered with the aim of demonstrating
that the 5-point format has survived because it has utility for some types of measurement, but
that these have not included the issue of scale sensitivity in the measurement of SQOL.


Psychometric properties of the Likert scale

The three basic properties of Likert scales are reliability, validity, and sensitivity. And the
extent to which psychometric research has concentrated on the former is astonishing. Almost
40 years after the scale had been published, Jacoby and Matell (1971) stated “ … most of the
psychometric literature, dealing with the number-of-alternatives problem emphasizes
reliability as the major criterion in the number of scale points. However, the ultimate
criterion is the effect a change in the number of scale points has on the validity of the scale.
An intensive literature search failed to reveal any empirical investigation addressed to this
question.” (p.496).

While the situation has changed somewhat over the intervening period it is still true that the
kinds of psychometric data researchers report in support of their scales are generally
restricted to reliability (internal and test-retest) and convergent/divergent validity which is,
itself, highly dependent on reliability. This narrow focus has led, inexorably, to the view that
smaller, rather than larger number of scale points are advantageous to measurement (e.g.
Cronbach, 1946). The reason for this has been twofold. First, early research showed that
increased numbers of choice-points means that there is more scope for people to display
response-sets (see Cronbach, 1950, for a review), and this scope is minimized in
dichotomously scored scales. Second, later research produced considerable divergence of
opinion on the merit of shorter vs. longer scales. Given such ambiguity, the most pragmatic
choice favors shorter scales since they reduce response time in large surveys where they are
most commonly employed.

But now it is time to revisit this literature from a different perspective. Not whether evidence
can be found to support the use of simple-choice Likert scales in surveys, but whether
evidence can be found to support an expanded number of choice points for the purpose of
SQOL measurement. In terms of this appraisal there are two pertinent issues. First whether
such expansion is likely to enhance measurement sensitivity, and second whether such
expansion is detrimental to the other psychometric characteristics of scale reliability and
validity.

When Cronbach (1946, 1950) railed at the use of multiple-point scales he did so largely in the
context of educational tests where response-sets were evident in knowledge-based questions,
most particularly where the student did not know the answer and so engaged in such
response-sets as acquiescence, by favoring „true‟ response modes over false. But as
Cronbach acknowledged “If a situation is structured for the student, so that he knows the
answer required, he responds directly to the content of the item and response sets probably
are unimportant.” (1946, p.483). Surely this, then, places the SQOL scales in a quite
different context. The SQOL questions are simple, respondents have a view that they can
express on the scale provided, and the questions are not knowledge based. And, indeed, the
reliability data on non-knowledge based items do seem to indicate that response-sets are
generally „unimportant‟ as will be demonstrated.

In terms of sensitivity the issue seems intuitive. Few people would feel their discriminative
capacity for perceived well-being to be limited to five levels of experience. Moreover, as
indicated earlier, there are a number of scale-construction factors which tend to reduce the
effective choice still further. So, it would be expected that the empirical literature would
support this view, and it does. Indeed, opinion seems to be unanimous; increasing the
number of scale points increases scale sensitivity. Thus, for example, Diefenbach et al.
(1993) found a 7-point scale to be more sensitive than a 5-point scale while Russell and
Bobko (1992) found that data from a 15-point scale increased regression analysis effect sizes
by 93 percent over those from a 5-point scale. As noted by Jaeschke and Guyatt (1990) in the
context of medical QOL scales, it seems clear that 5-point scales, at the least, do not provide
sufficient sensitivity to detect small, clinically significant differences. So what, then, are the
psychometric impediments to using expanded Likert scales?

The main issue here is scale reliability, and it is true that some authors (e.g. Bardo & Yeager,
1982a&b; Bardo, et al., 1982) have concluded that scale reliability decreases as the number
of choice-points exceeds two. However, these authors had the specific intention of detecting
response-sets with a methodology that involved asking people to respond randomly to
presented scales, an approach which has little relevance for SQOL measurement.

Another source of data generally favoring simple scales has been derived from computer
simulations. For example, Lissitz and Green (1975) determined that Likert-scale reliability
increased from 2- to 5-points, but that no further gains were made beyond this point.
However, later and larger-scale simulations (Cicchetti, et al., 1985; Jenkins & Taber, 1977)
extended this upward range. Using monte carlo methodology these authors found that while
the largest changes occurred over the range 2-points to 5/7-points, gradual increases beyond
5/7-points were evident in a range of psychometric parameters including inter-rater reliability
and judgement accuracy. Such findings imply no impediment to the generation of more
complex scales from consideration of reliability.

The conclusions reached by researchers using human-generated data have been more
variable. On the minimalist side, Peabody (1962) argued the case that agree/disagree scales
should simply be scored dichotomously according to the direction of response. This was
based on his finding that, when people responded to attitude scales, a „composite score‟
derived by combining direction of response (agree or disagree) with the intensity of response
(extent of agreement or disagreement) was dominated by the former. In fact, only 10 percent
of composite score variation could be attributed to intensity, as opposed to the 70-80 percent
attributed to direction. He also found a high degree of response-set within the intensity
dimension. While all of this led him to recommend dichotomous scales two factors should be
noted. First, he used a 6-point scale and acknowledged that with an increased number of
choice-points to 9 or 11 “the reasoning used with regard to six-point scales would imply that
the extremeness component might play a more significant role.” (p.72). Second, his data
formed near-normal distributions, as opposed to the determined negative skew of SQOL data.
In relation to this he acknowledged that a similar calculation would be unreliable since
subjects tend to respond in the same direction to nearly all items.

Other studies which have argued for fewer numbers of choice points are McKelvie (1978)
and Chang (1994). The former compared 5-, 7-, and 11-point scales over two studies, and
two rating tasks. In the main no differences were found between the scales on inter-rater
reliability (agreement), test-retest reliability, and validity (a tone judgement task). Where
effects were found on reliability, they tended to be contradictory, showing quite different
trends for raw and transformed data. In terms of validity, the 5-point scale was clearly
inferior. Curiously, however, the author concluded the 5-point scale to be superior, despite
the weight of empirical evidence to the contrary.

Chang (1994) took a more sophisticated approach. Using multitrait-multimethod to separate
trait and method variance he found lower internal reliability within a 6-point agree/disagree
scale than within a 4-point scale, but no difference between the scales in terms of criterion-
related reliability. The author concludes that increasing the number of scale points creates
opportunities for response sets to arise. However, he also acknowledges “The issue of
selecting 4- versus 6-point scales may not be generally resolvable, but may rather depend on
the empirical setting.” (p.205).

In summary for the negative, none of these authors have presented a strong generalist case
against the use of more complex scales in terms of reliability, and certainly not in the context
of negatively skewed data. On the other hand, other researchers have found no change in
reliability over scales ranging from 2 to 19 points (Test-retest and internal, Matell & Jacoby,
1971) or even 5 to 100 points (Diefenbach, et al., 1993). Still others have found increased
rater reliability (the ability of single raters to discriminate differences between the rated
stimuli:Bendig, 1954) and test reliability (the consistency of individual differences in the
assignment of high and low ratings) when employing from two to five categories (Bendig,
1954), and an increased internal reliability from two- to six-points (Komorita & Graham,
1963). Finally, in a meta-analysis of 131 studies in the marketing research literature,
Churchill and Peter (1984) found a positive relationship between internal reliability and the
number of scale choice points (mean number of points 5.8±2.3, range 1-20). Their regression
analysis revealed that the number of choice points explained five percent of the reliability
variance.

From all of these data it may be reasonably concluded that increasing the response options
beyond 7-points does not systematically detract from scale reliability, a conclusion shared by
others (e.g. Russell & Bobko, 1992). So, since many people will have a discriminative
capacity that exceeds 7 points, restricting people to such scales results in a loss of potentially
discriminative data. This is most particularly relevant in relation to SQOL measurement
where the data are skewed. However, there are some potential difficulties in generating
expanded scales, and one of these involves the tradition of using category names.


The issue of categorical naming

Likert (1932) named all of his five categories and the received wisdom is that this is a good
idea. Andrews and Withey (1976), authors of the most widely cited document in the QOL
literature, endorse category naming because, they suggest, it enhances comparability in the
way respondents use the scales, and it makes it possible to know exactly what the respondent
was endorsing. However they also note that it is not really true. The use of such terms as
„good‟ does not, in fact, ensure a standardized point of reference. For example, a „good‟
house has very different meanings depending on a respondent‟s socio-economic status, and
high ambiguity in the allocation of values to scale categories has been documented for some
time (e.g. Jones & Thurstone, 1955). So these reasons do not withstand scrutiny. Not only
that, but there are other, powerful reasons to actually avoid category naming and these ideas
have not been given the publicity they deserve.

The Likert scale makes the assumption that the psychometric distance between categories is
equal. This aspect of scale construction is always dutifully portrayed as an equally-spaced
visual image, comprising marks on a horizontal line or a series of boxes. It may even be
reinforced by a linear numbering system, for example from 1 to 7, with each successive
integer corresponding to the next printed category. And then we add category names.

The clear implication is that these categorical names exhibit the same internal scaling as the
printed scale and numbers suggest. This, however, is wrong, sometimes very wrong, and the
data demonstrating this have been available for a considerable period of time (Cronbach,
1946).


Table 1
Estimates of frequency (percent)

                                                     STUDY
         Category              A             B           C            D        Maximum
                                                                               Difference

    Never                                 0.2*±.6
    Rarely                       6.7      5.3±3.6                                    1.4
    Very unusual                                             9.0
    Unusual                                                 22.0
    Seldom                     25.6

    Very unlikely                                                  28.4±11.3
    Once or twice              28.9
    Infrequently               31.1
    Unlikely                                                       31.4±10.4
    Once in a while            32.2

    Now and then               47.8
    Sometimes                  50.0      33.4±17.0          35.0                    16.6
    Occasionally               55.6      19.5±15.1          33.0                    36.1
    Pretty often               56.7
    Often                      60.0      60.9±13.1          58.5                     2.4

    Frequently                 65.6                                81.2±10.4
    Repeatedly                 72.2
    Usually                               75.9±9.8          30.0                    45.9
    Very probable                                                  82.5±8.8
    Almost always                         94.2±4.9

    Most of the time           95.6
    Always                               99.5*±2.0                 83.1±8.8         16.4

   * Participants were told that always = 100% and never = 0%

   Table code:

   A:    Spector (1976) from College students.
   B:    Kenney (1981) from „professionals‟. Respondents were told that always = 100%, and never
         = 0%.
   C:    Nakao & Axelrod (1983) from „non-physicians‟.
   D:    Tavana et al. (1997) from financial strategy experts.


The data in Table 1 indicate the percentage allocated to category labels of „frequency‟ by
respondents in four separate studies. The following observations can be made:
1.   With two exceptions („rarely‟ and „often‟) there is considerable discrepancy between the
     average category values derived from the studies. The worst labels in this regard are
     „occasionally‟ and „usually‟ which differ by 36.1 percent and 45.9 percent respectively.

2.   The relative ordering of terms is not consistent between studies. See, for example,
     „sometimes‟ in relation to „occasionally‟. The relative ordering of terms within studies,
     however, is always consistent with the presented Likert scale order.

3.   Study B (Kenney, 1981) demonstrates a clear pattern of increasing intra-category
     variance towards the middle categories. But this is likely to be an artifact of their
     methodology where participants were told that „never = 0%‟ and „always = 100%‟. In
     the absence of such cuing, Jones and Thurstone (1955) found exactly the opposite result,
     that the standard deviation of score allocation on a 9-point scale (+4 to –4) increased
     from the more neutral, central categories (e.g. good) to the extreme categories (e.g. Best
     of all). Actually, the highest variance was accorded to extreme negative values (e.g.
     Loath, despise), but perhaps this is because they are unusual words and their relative
     meaning was uncertain to some respondents.

The conclusion from all of this is obvious. People have such a varying interpretation of the
numerical value accorded to frequency category labels, that their application in Likert scales
is simply detracting form the interval-nature of such scales. The reason that the intra-study
category values always accord with the presented order is that respondents are being cued by
the relative position of category labels on the printed scale. Support for this has been
provided by Solomon and Kopelman (1984) who found higher scale interval reliability when
items were grouped by content and labeled, rather than being randomly distributed.

It is also notable that the category labels of „frequency‟ which form Table I, should provide at
least some response cues to their numerical equivalence. This type of cue is far less evident
in the category labels used in most SQOL scales, and so the latter are probably even less
consistently employed than the frequency data. Some data are available to support this.
Ware and Gander (1994) used the Thurstone method of equal-appearing intervals to calculate
the following distances between category labels used in the SF-36 (Ware & Sherbourne,
1992) as follows: Poor (1.0), Fair (2.3), Good (3.4), Very Good (4.3), and Excellent (5.0). It
can be seen that the distance between the lowest two categories (1.3) is about double that
between the highest two categories (0.7). A comparison can also be made between the
ratings of these same categories by Spector (1976). In equivalent units he found the
separation of „poor‟ and „fair‟ to be 2.3, while that between „good‟ and „excellent‟ to be 1.1
units. It can be seen that the disparities between low and high adjacent category labels is in
the same direction but of even greater magnitude.

It can also be noted that the business of naming Likert categories constitutes a severe
impediment to the expansion of scales due to the difficulty of finding appropriate categorical
names. Consider, for example, the 9-point scale generated by Roy Morgan Research (1993)
as: Delighted, very pleased, pleased, mostly satisfied, mixed feelings, mostly dissatisfied,
unhappy, very unhappy, terrible. This scale assumes that the psychometric distance from
neutrality to „pleased‟ and to „unhappy‟ are equivalent. There are no data known to me that
support this view and yet such assumptions are forced by the inability of our language to
make fine discriminations between affective states. So it can be concluded that the addition
of category names to Likert scales not only detracts from the interval nature of the scale but
also makes it difficult to generate expanded choice formats. Therefore since expanded Likert
scales are desirable for SQOL measurement, as has been previously argued, the appropriate
scale format may be a 10-point, end-defined scale.


Justification for a 10-point, end-defined scale

The use of end-defined scales was pioneered by Jones and Thurstone (1955). They generated
a 9-point scale in relation to food preference, anchored by „greatest like/dislike‟ (+4/-4), with
a central category of „Neither like nor dislike‟ (0), and with the intermediate categories
labeled only by their appropriate integer. So, do such scales produce data that is different
from conventionally labeled Likert scales?

The answer appears to be in the negative. Matell and Jacoby (1972) used verbally anchored
adjective statements related to civic beliefs, with the number of intervening points varying
from two to 19. Apart from the fact that testing time increased with >12-point formats, no
differences were found on the proportion of scale utilized (>3-points) or in the proportion of
„uncertain‟ responses (>5-points). Similarly, Wyatt and Meyers (1987) using a 5-point scale,
and Dixon et al. (1984) using a 6-point scale, found no systematic differences between the
data from end-defined and conventional Likert scales. From this it can be concluded that the
end-defined format seems not to bias the data in any particular way. It is also interesting to
note that an increasing number of recent authors (e.g. Hooker & Siegler, 1993; Watkins, et
al., 1998) are using 10-point end-defined scales.


                                          Conclusion

Likert scales in common use, within the field of SQOL measurement, were devised for a
quite different context. Of particular importance in this regard is the fact that SQOL data are
negatively skewed, which means that most people will respond only to a restricted portion of
the conventional scale. Moreover, when SQOL is used as a measure of outcome, scale
sensitivity becomes a critical concern since this construct has a high trait component, and
small deviations are highly meaningful. So, it is proposed, the number of choice points needs
to be expanded.

Such expansion appears not to systematically influence scale reliability, and is therefore
psychometrically feasible, but is made difficult by the convention of naming all response
categories. It has been argued that this naming is quite unnecessary and actually detracts
from the interval nature of the scale. So the solution proposed is to adopt 10-point, end-
defined scales. These offer a form of rating (one to ten) which lies within common
experience and produce increased sensitivity of the measurement instrument.
                                         References

Andrews, F.M., & Withey, S.B. (1976). Social Indicators of well-being: Americans’
       perceptions of life quality. New York: Plenum Press.
Bardo, J.W., & Yeager, S.J. (1982a). Consistency of response style across types of response
       formats. Perceptual and Motor Skills, 55, 307-310.
Bardo, J.W., & Yeager, S.J. (1982b). Note on reliability of fixed-response formats.
       Perceptual and Motor Skills, 54, 1163-1166.
Bardo, J.W., Yeager, S.J., & Klingsporn, M.J. (1982). Preliminary assessment of format-
       specific control tendency and leniency error in summated rating scales. Perceptual
       and Motor Skills, 54, 227-234.
Bendig, A.W. (1954). Reliability of short rating scales and the heterogeneity of the rated
       stimuli. Journal of Applied Psychology , 38, 167-170.
Chang, L. (1994). A psychometric evaluation of 4-point and 6-point Likert-type scales in
       relation to reliability and validity. Applied Psychological Measurement, 18, 205-216.
Cicchetti, D.V., Showalter, D., & Tyrer, P.J. (1985). The effect of number of rating scale
       categories on levels of interater reliability: A Monte Carlo investigation. Applied
       Psychological Measurement, 9, 31-36.
Cronbach, L.J. (1946). Response sets and test validity. Educational and Psychological
       Measurement, 6, 475-494.
Cronbach, L.J. (1950). Further evidence on response sets and test design. Educational and
       Psychological Measurement, 10, 3-31.
Cummins, R.A. (1995). On the trail of the gold standard for subjective well-being. Social
       Indicators Research, 35, 179-200.
Cummins, R.A. (1997). The Directory of Instruments to measure quality of life and cognate
       areas of study. Fourth Edition. Melbourne: Deakin University.
Cummins, R.A. (1997). The Comprehensive Quality of Life Scale – Adult. Fifth Edition.
       Melbourne: Deakin University.
Cummins, R.A. (1998). The second approximation to an international standard for life
       satisfaction. Social Indicators Research, 43, 307-334.
Diefenbach, M.A., Weinstein, N.D., & O‟Reilly, J. (1993). Scales for assessing perceptions
       of health hazard susceptibility. Health Education Research, 8, 181-192.
Dixon, P.N., Bobo, M., & Stevick, R.A. (1984). Response differences and preferences for
       all-category-defined and end-defined Likert formats. Educational and Psychological
       Measurement, 44, 61-66.
Ferguson, L.W. (1941). A study of the Likert technique of attitude scale construction.
       Journal of Social Psychology, 13, 51-57.
Freyd, M. (1923). The graphic rating scale. Journal of Educational Psychology, 14, 83-102.
Guyatt, G.H., & Jaeschker, R. (1990). Measurement in clinical trials: Choosing the
       appropriate approach. In: B. Spilker (Ed.), Quality of life assessment in clinical trials
       (pp.37-46). New York: Raven Press.
Hooker, K., & Siegler, I.C. (1993). Life goals, satisfaction, and self-rated health: Preliminary
       findings. Experimental aging research, 19, 97-110.
Jacoby, J., & Matell, M.S. (1971). Three-point Likert scales are good enough. Journal of
       Marketing Research, 8, 495-500.
Jaeschke, R., & Guyatt, G.H. (1990). How to develop and validate a new quality of life
       instrument. In: B. Spilker (Ed.) Quality of life assessment in clinical trials (pp.47-
       57). New York: Raven Press.
Jenkins, G.D., & Taber, T.D. (1977). A monte carlo study of factors affecting three indices
       of composite scale reliability. Journal of Applied Psychology, 62, 392-398.
Jones, L.V., & Thurstone, L.L. (1955). The psychophysics of semantics. Journal of Applied
        Psychology, 39, 31-36.
Kenney, R.M. (1981). Between never and always. New England Journal of Medicine, 305,
        1097-8.
Komorita, S.S., & Graham, W.K. (1965). Number of scale points and the reliability of scales.
        Educational and Psychological Measurement, 25, 987-995.
Likert, R. (1932). A technique for the measurement of attitudes. Archives in Psychology,
        140, 1-55.
Lissitz, R.W., & Green, S.B. (1975). Effect of the number of scale points on reliability: A
        Monte Carlo approach. Journal of Applied Psychology, 60, 10-13.
Matell, M.S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert
        scale items? Study 1: Reliability and validity. Educational and Psychological
        Measurement, 31, 657-674.
McCauley, C., & Bremer, B.A. (1991). Subjective quality of life for evaluating medical
        intervention. Evaluation and the Health Professions, 14, 371-387.
McKelvie, S.J. (1978). Graphic rating scale – How many categories? British Journal of
        Psychology, 69, 185-202.
Nakao, M.A., & Axelrod, S. (1983). Numbers are better than words: Verbal specifications of
        frequency have no place in medicine. American Journal of Medicine, 74, 1061-1065.
Peabody, D. (1962). Two components in bipolar scales: Direction and extremeness.
        Psychological Review, 69, 65-73.
Roy Morgan Research (1993). International values audit, 22/23 May. Melbourne: Roy
        Morgan Research Centre.
Russell, C., & Bobko, P. (1992). Moderated regression analysis and Likert scales: Too
        coarse for comfort. Journal of Applied Psychology, 77, 336-342.
Solomon, E., & Kopelman, R.E. (1984). Questionnaire format and scale reliability: An
        examination of three modes of item presentation. Psychological Reports, 54, 447-
        452.
Spector, P.E. (1976). Choosing response categories for summated rating scales. Journal of
        Applied Psychology, 61, 374-375.
Tavana, M., Kennedy, D.T., & Mohebbi, B. (1997). An applied study using the analytic
        hierarchy process to translate common verbal phrases to numerical probabilities.
        Journal of Behavioral Decision Making, 10, 133-150.
Upshaw, H.S. (1962). Own attitude as an anchor in equal-appearing intervals. Journal of
        Abnormal and Social Psychology, 64, 85-96.
Vokman, P. (1951). In: Upshaw, H.S. (1962). Own attitude as an anchor in equal-appearing
        intervals. Journal of Abnormal and Social Psychology, 64, 85-96.
Ware, J.E., & Gander, B. (1994). The SF-36 health survey: Development and use in mental
        health research and the IQOLA project. International Journal of Mental Health, 23,
        49-73.
Ware, J.E., & Sherbourne, C.D. (1992). The MOS 36-item short-form health survey (SF-
        36). Medical Care, 30, 473-481.
Watkins, D., Akande, A., Fleming, J., et al. (1998). Cultural dimensions, gender, and the
        nature of self-concept: A fourteen-country study. International Journal of
        Psychology, 33, 17-31.
Watson, G.B. (1930). Happiness among adult students of education. Journal of Educational
        Psychology, 21, 79-109.
Wyatt, R.C., & Meyers, L.S. (1987). Psychometric properties of four 5-point Likert-type
        response scales. Educational and Psychological Measurement, 47, 27-35.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:16
posted:7/13/2011
language:English
pages:13