									Clinical Reliability of Manual Muscle Testing
Middle Trapezius and Gluteus Medius Muscles

                            The purposes of this study were to develop a protocol to examine the reliability
                            of manual muscle testing in a clinical setting and to use that protocol to assess
                            the interrater reliability of manually testing the strength of the middle trapezius
                            and gluteus medius muscles. One hundred ten patients with various diagnoses
                            participated as subjects, and 11 physical therapists participated as examiners in
                            this study. The results showed that interrater reliability for right and left middle
                            trapezius and gluteus medius muscles was low. The percentage of therapists
                            obtaining a rating of the same grade or within one third of a grade ranged from
                            50% to 60% for the four muscles. This study indicates that using manual muscle
                            testing to make accurate clinical assessments of patient status is of questionable
                            Key Words: Manual muscle testing, Muscle hypotonia, Physical therapy.

   Manual muscle testing is an impor-                   necessary if the tests are to be used.       strength. Other variables that influence
tant clinical tool used by physical ther-               Manual muscle testing reliability in a       the accuracy of a muscle test are 1) the
apists to determine a patient's muscle                  clinical setting has been studied mini-      point and line of force application, 2)
strength. Muscle testing originated in                  mally. Lilienfeld et al found muscle test    the magnitude of resistive force, 3) the
the United States in the early 1900s                    grades from Zero to Normal assigned by       speed of resistive force application, 4)
during the study of muscle function in                   12 to 39 examiners in four different        the duration of the contraction, 5) the
patients with poliomyelitis. Despite the                trials to be within one grade, although      degree of cooperation from the patient,
change in the role of manual muscle                     the testing method was controlled be-        6) fatigue, 7) various distracting influ-
testing with the end of the last polio-                 cause the examiners were trained by the      ences, 8) the type of instructions given,
myelitis epidemic in this country, it                   same instructor.1 Iddings et al also found   9) the tone of the therapist's voice, and
remains an important clinical tool for                  manual muscle testing to be reliable          10) the amount of interaction between
assessing the muscular causes of move-                  among 10 examiners whose ratings were        the therapist and patient.4,9-15
ment dysfunction. Testing of muscles is                 within one grade in 90.6% of the trials.2       Beasley attempted to increase objec-
considered to be an essential prerequi-                 All of the subjects in both of these stud-   tivity in manual muscle testing by de-
site for treatment program planning and                 ies had the diagnosis of poliomyelitis,      veloping a standardized scale of norms
modification. The results of manual                     and the examiners were highly skilled in     for muscle strength.16 Using an elec-
muscle testing also are used to make                    manual muscle testing.                       tronic myodynagraph, Beasley found a
clinical judgments concerning the pa-                      The reliability of manual muscle tests    discrepancy between the percentage of
tient's progress or deterioration, as well              has been the most difficult to achieve       Normal strength assigned in a manual
as to assess the effectiveness of a partic-             for grades greater than Fair because of      muscle test and the percentage of
ular treatment.                                         the examiner's subjective judgment of        strength found by a quantitative meas-
   The study of the reliability of exam-                the amount of resistance applied during      ure.16 The Good muscle strength group,
iners performing manual muscle tests is                 the test. One of the problems central to     usually rated at 75% of Normal in the
                                                        manual muscle testing is the variable        manual muscle testing system,7 had only
                                                        "frame of reference" for making an as-       43% of the Normal value on Beasley's
   Mrs. Frese is Instructor, Department of Physical     sessment. Such subjective judgments in-      standardized scale. The Fair group had
Therapy, St. Louis University, 1504 S Grand Blvd,       clude determining what is normal mus-        a rating of only 9% of Normal, rather
St. Louis, MO 63104 (USA). She was a master's
degree student, Program in Physical Therapy,
                                                        cle strength for an individual given the     than 50% of Normal usually assigned.
School of Medicine, Washington University, St.          person's age and size, in addition to the    The Poor group, ordinarily rated at 25%
Louis, MO, when this study was completed.               relative strengths of the tester and pa-     of Normal on the manual scale, had a
   Dr. Brown is Instructor, Program in Physical
Therapy, PO Box 8083, School of Medicine, Wash-         tient.3-6                                    rating of only 2.6% of Normal on the
ington University, 660 S Euclid Ave, St. Louis, MO         Many other factors influence the re-      standardized scale. The standard devia-
63110.                                                  producibility of a manual muscle test.       tions showed considerable overlap in the
   Mrs. Norton is Instructor, Program in Physical
Therapy, School of Medicine, Washington Univer-         The testing method may vary among            percentage of Normal scores in grades
sity.                                                   therapists (eg, Kendall and McCreary7        below Fair, indicating poor differentia-
   This study was completed in partial fulfillment
of the requirements for Mrs. Frese's master's-degree,
                                                        vs Daniels and Worthingham8), both be-       tion in grades below Fair, the range in
Washington University.                                  cause the therapists' training may have      which manual muscle testing suppos-
   This article was submitted April 14, 1986; was       differed and because physical therapists     edly is more accurate.16
with the authors for revision 10 weeks; and was
accepted August 27, 1986. Potential Conflict of In-     tend to develop their own techniques            The purposes of this study were to
terest: 4.                                              and standards for grading muscle             develop a protocol to examine the reli-

ability of manually testing muscle           patient. A different therapist's name ap-     obtained in the study and gave the ratio-
strength in a physical therapy depart-       peared in each space so every examiner        scaled degrees of disagreement assigned
ment and to use that protocol to assess      was paired with 1 of 10 different thera-      to each cell. Each cell in the matrix
the interrater reliability for manually      pists. Each therapist also received a sec-    represents one score for each examiner.
testing the middle trapezius and gluteus     ond work sheet with 10 spaces to be           For example, the cell for Normal-Nor-
medius muscles. We chose the two mus-        used for recording muscle grades of an-       mal had a weight value of 1.0, the cell
cles indicated 1) because we wanted to       other therapist's patient when her name       for Good-Normal was 0.7, and the cell
examine muscles from both the upper          appeared on that therapist's list for that    for Poor minus-Normal was 0.0.
and lower extremities and 2) because         patient. Each examiner then selected 10          To determine whether eliminating the
the selected muscles are difficult to test   patients to be included in the study. The     pluses and minuses would improve the
owing to the stabilization required by       Appendix gives the muscle testing scale       reliability coefficient, we compressed
other muscle groups during testing. In       and definitions that all of the therapists    the original scores into afive-pointscale.
addition, the two muscles selected for       used.                                         Pluses and minuses were assigned the
study frequently are found to be weak                                                      same score as the main grade (eg, Fair
in patients. The hypothesis was that a       Testing Procedures                            plus and Fair minus became Fair), and
staff of physical therapists working to-                                                   a weight matrix was designed for these
gether in a physical therapy department         Manual muscle testing was performed        scores.
would demonstrate interrater reliability     during the patient's daily treatment ses-        The muscle test scores of every patient
in testing the middle trapezius and glu-     sion. A rest period of at least three min-    whom Therapist 1 examined were com-
teus medius muscles.                         utes was allowed between the two ex-          pared with the scores of each of the other
                                             aminers' tests and the two therapists          10 therapists with whom she was paired.
METHOD                                       kept their results confidential. The ex-      An interrater reliability coefficient then
                                             aminers used a "break" test, and for the      was computed for Therapist 1. This
Subjects                                     gluteus medius muscle test, the patient's     process was repeated for each therapist
   One hundred ten patients, who were        hip was placed in as much extension as        so that an interrater reliability coeffi-
referred for physical therapy at St. Louis   possible.                                     cient was computed for all 11 exam-
University Hospital, participated in the                                                   iners. By doing so, we wanted to deter-
study. The patients had various muscu-       Testing Sequence                              mine whether any particular therapist
loskeletal and neurological disorders in-                                                  appeared to be less reliable compared
cluding low back pain, degenerative            The testing sequence involved the fol-      with the other 10, and whether the
joint disease, cervical pain, gunshot        lowing steps:                                 school the therapist graduated from or
wound, chondromalacia, rheumatoid            1. The examiner first identified a pa-        her years of experience were factors af-
arthritis, and connective tissue disease.       tient suitable for the study.              fecting reliability.
The patients had to exhibit sufficient
                                             2. The examiner performed the middle
range of motion to allow the body part                                                     RESULTS
                                                trapezius and gluteus medius muscle
to be placed in the test position and
                                                tests bilaterally. The side and muscle        Table 2 summarizes the percentages
either pain-free motion or pain that did
                                                to be tested first was assigned ran-       of the total number of subjects on which
not interfere with the muscle test. The
                                                domly before the beginning of the          the examiners agreed, in addition to per-
test group consisted of 50 female and 60
                                                test phase. The examiner used her          centages of agreement within several
male subjects, aged 15 to 76 years, with
                                                accustomed technique of muscle test-       ranges of disparity (ie, fractions of grades
a mean age of 41 years (± 15 years).
                                                ing to determine the appropriate           they were apart). The percentage of sub-
                                                grade and repeated the test several        jects on whom the same grade was ob-
Examiners                                       times, if needed, to assign a grade.       tained by two examiners ranged from
   Eleven staff physical therapists at St.      She then recorded the grades in the        28% to 45% for the four muscles, and
Louis University Hospital served as the         appropriate space on her work sheet.       for 89% to 92% of the subjects we found
examiners. All examiners were gradu-         3. A second therapist, who had been           either complete agreement or agreement
ates of accredited university programs.         paired randomly with her for that          within one grade.
Seven were graduates of the same uni-           patient, then performed the same              The percentage of patients who were
versity, 2 others graduated from another        two muscle tests in the same order,        rated Fair plus or above by one or both
university, and the remaining 2 thera-          but using her own testing technique.       examiners was 88% for the right middle
pists graduated from two other different        The second therapist also repeated         trapezius muscle, 90% for the left mid-
schools. The mean number of years of            the test several times, if necessary, to   dle trapezius muscle, 91% for the right
experience of the staff members was 2.3         determine a grade. She then recorded       gluteus medius muscle, and 95% for the
± 1.2 years. Eight of the therapists pre-       the grades on her work sheet.              left gluteus medius muscle. One or both
ferred the Kendall and McCreary mus-                                                       examiners assigned a grade of Normal
cle testing technique,7 2 preferred that     Data Analysis                                 in 50% of the tests for the right middle
of Daniels and Worthingham,8 and 1                                                         trapezius muscle, in 44% of the tests for
used both.                                      Cohen's weighted Kappa (Kw) deter-         the left middle trapezius muscle, in 67%
   Each therapist received a work sheet      mination17 was used as an index of            of the tests for the right gluteus medius
with 10 spaces for 10 different patients.    agreement for interrater reliability. This    muscle, and in 70% of tests for the left
Next to each space was the name of the       index weighs disagreements by the             gluteus medius muscle.
therapist with whom the examiner had         amount of disagreement. A weight ma-             Table 3 gives the interrater reliability
been paired randomly for that particular     trix (Tab. 1) was designed for the scores     coefficients for both the original and the

compressed muscle testing scores. The          TABLE 1
reliability for the original scores was low,   Weight Matrix for Original Scoresa
ranging from .11 to .58. Compressing
the scores did not change the interrater             Muscle Test                           Muscle Test Scores for Examiner 2
reliability coefficient appreciably (.26-            Scores for
                                                     Examiner 1     P-     P          P+      F-     F       F+       G-         G      G+     N-         N
.42). Even for grades below Fair, we
found poor interrater reliability.                      P-          1.0   0.9       0.8      0.7   0.6       0.5      0.4        0.3    0.2    0.1       0.0
                                                        P           0.9   1.0       0.9      0.8   0.7       0.6      0.5        0.4    0.3    0.2       0.1
   Table 4 summarizes the results of                    P+          0.8   0.9       1.0      0.9   0.8       0.7      0.6        0.5    0.4    0.3       0.2
comparing each of the examiners with                    F-          0.7   0.8       0.9      1.0   0.9       0.8      0.7        0.6    0.5    0.4       0.3
every other examiner for each test. Re-                 F           0.6   0.7       0.8      0.9   1.0       0.9      0.8        0.7    0.6    0.5       0.4
liability coefficients ranged from .04 to               F+          0.5   0.6       0.7      0.8   0.9       1.0      0.9        0.8    0.7    0.6       0.5
.66 with no pattern of high reliability                 G-          0.4   0.5       0.6      0.7   0.8       0.9      1.0        0.9    0.8    0.7       0.6
being established by any one therapist.                 G           0.3   0.4       0.5      0.6   0.7       0.8      0.9        1.0    0.9    0.8       0.7
                                                        G+          0.2   0.3       0.4      0.5   0.6       0.7      0.8        0.9    1.0    0.9       0.8
Those therapists with more clinical ex-
                                                        N-          0.1   0.2       0.3      0.4   0.5       0.6      0.7        0.8    0.9    1.0       0.9
perience did not demonstrate any                        N           0.0   0.1       0.2      0.3   0.4       0.5      0.6        0.7    0.8    0.9       1.0
greater level of reliability than those who      a
had graduated more recently. The                     Eleven possible scores ranging from P - to N.
school from which the therapist gradu-
ated did not appear to affect reliability      TABLE 2
because those therapists who graduated         Percentage of Agreement Among Scores for Subjectsa
from the same university did not dem-
onstrate any greater reliability among                                                                          Musclesb
each other than the therapists who grad-                   Grade                        RMT               LMT                   RGM                LGM
uated from different schools. Therapist
3 demonstrated low reliability coeffi-                                              n        %       n          %           n          '%     n          %
cients on all four tests (.08-. 19).           Same grade                         31        28       32       29            52         47     50         45
                                               1/3 grade apart                    24        22       27       25            11         10     17         15
                                               2/3 grade apart                    19        17       23       21            24         22     15         14
DISCUSSION                                     1 grade apart                      25        23       15       14            14         13     16         15
                                               1 1/3 grades apart                  6         5        8        7             4          4      7          6
   Using Cohen's weighted Kappa deter-         1 2/3 grades apart                  5         5        4        4             1          1      5          5
mination, we found interrater reliability      Within 1 grade apart               68        62       65       60            49         45     48         44
for manually testing the strength of mid-      Same grade or within 1
dle trapezius and gluteus medius mus-            grade                            99        90       97       89        101            92     98         89
cles in a clinical setting to be poor. When      a
                                                    Each grade was divided into thirds with the use of pluses and minuses; therefore, the
the results were expressed as percentages      difference between 2 and 2+ was considered 1/3, the difference between 2 - and 2+ was
of agreement, however, they were simi-         2/3, and the difference between 2 and 3 was one grade.
lar to the findings of Lilienfeld et al1 and      b
                                                    RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGM-
Iddings et al2 who reported good relia-        left gluteus medius.
bility within one grade among experi-
enced examiners (more experienced
                                               TABLE 3
than those in our study). The results
                                               Interrater Reliability of Original and Compressed Scores
(28%-47% agreement) did not agree
with those of Williams,10 who found that                                                                     Musclesa
two examiners agreed completely be-                       Conditions                           RMT            LMT                RGM              LGM
tween 60% and 75% of the time. The                                              N
examiners in our study agreed more fre-                                                        Kwb              Kw                Kw               Kw
quently on the gluteal muscle tests than                 Original               110            .58              .29               .25              .11
on the middle trapezius muscle tests for                 Compressed             110            .26              .26               .30              .42
reasons we could not determine. We               a
                                                    RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGM-
also found poor interrater reliability in      left gluteus medius.
grades below Fair, which agrees with              b
                                                    Kw = weighted Kappa coefficient.
Beasley's16 finding of poor differentia-
tion in grades below Fair.
                                                 The distribution of the scores might                     strating an accurate measure of agree-
   Compressing the scores by eliminat-         have affected the reliability or agree-                    ment. Because we established the crite-
ing pluses and minuses did not appre-          ment coefficient. Because the majority                     rion of pain not interfering with the
ciably change the interrater reliability       of the subjects' scores were Fair plus or                  muscle test, some of the weaker patients
coefficients. The coefficient for the right    greater for all of the muscles (88%-95%),                  may have been excluded from the study.
middle trapezius muscle decreased, pos-        the scores were not well distributed                          One procedural problem that could
sibly because the interval widened be-         across all possible muscle grades. This                    have affected our results was the diffi-
tween grades with pluses and minuses           skewed distribution might have reduced                     culty of positioning some of the patients
when they were compressed (eg, Fair            spuriously the magnitude of the Kappa                      for a particular test. Different therapists
plus-Good minus was changed to Fair-           coefficient. A broader range of scores                     adjusted the procedure differently to
Good).                                         should improve the chances of demon-                       solve the problem.

                                                                                                    progress or deterioration, therefore,
                                                                                                    would be questionable despite reliability
Interrater Reliability Among Therapists
                                                                                                    within one grade.
                                                            Musclesa                                   Manual muscle testing is an inexpen-
               Therapist          RMT                 LMT              RGM             LGM          sive, relatively quick, and convenient
                                                                                                    method for assessing a patient's muscle
                                  Kw                  Kw               Kw               Kw          strength. In view of the results of this
                   1              .22                 .15              .31              .58         study, however, physical therapists
                   2              .21                 .52              .34              .55         should consider supplementing their
                   3              .19                 .16              .08              .13         manual muscle test scores with isoki-
                   4              .06                 .33              .52              .44         netic testing, dynamometry, or tensiom-
                   5              .42                 .25              .48              .38
                                                                                                    etry. Griffin et al compared the results
                   6              .28                 .30              .26              .34
                   7              .42                 .63              .50              .29         of manual muscle testing with isokinetic
                   8              .15                 .47              .66              .37         testing for knee extensor muscles in pa-
                   9              .37                 .46              .25              .29         tients with neuromuscular disease and
                  10              .14                 .28              .56              .49         found that a lack of strength improve-
                  11              .62                 .04              .20              .11         ment or a decrease in strength was dem-
  a                                                                                                 onstrated by both manual muscle testing
     RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGM-
left gluteus medius.
                                                                                                    and isokinetic testing.18 They also
     Kw = weighted Kappa coefficient.                                                               found, however, that in patients with a
                                                                                                    manual muscle test score of 9 to 10
                                                                                                    (Normal minus-Normal), isokinetic
                                              APPENDIX                                              testing revealed either muscle strength
                           Muscle Testing Scale and Definitions*                                    deficits or improvement not detectable
                                                                                                    with manual muscle testing methods.
        Normal             (5)              able to hold the test against gravity and maximum       They concluded that isokinetic testing
                                              resistance, or to move the part into the test         adds valuable information when pa-
                                              position and hold against gravity and maximum         tients have manual muscle test scores of
                                                                                                    Normal. Bohannan found a significant
        Normal minus       (5-)             same as for Normal except slightly less resistance
                                              can be given
                                                                                                    reliability correlation between manual
        Good plus          (4+)             same as for Good but slightly more resistance           muscle test scores and dynamometer
                                              can be given                                          test scores for knee extensor muscles,
        Good               (4)              same as for Normal except able to hold against          which indicated that both testing meth-
                                              moderate resistance                                   ods measure muscle strength similarly.19
        Good minus         (4-)             same as for Good but slightly less resistance can       He found a significant difference, how-
                                              be given                                              ever, between theoretical percentage
        Fair plus          (3+)             able to hold the test position against gravity, or to   manual muscle test scores and calcu-
                                              move the part into the test position and hold         lated dynamometer percentage test
                                              against gravity and slight resistance
                                                                                                    scores, which indicated that theoretical
        Fair               (3)              able to hold the test position against gravity, or to
                                              move the part into the test position and hold
                                                                                                    percentage scores based on manual mus-
                                              against gravity                                       cle testing are likely to overestimate a
        Fair minus         (3-)             able to release gradually from the test position        patient's muscle strength. Supplement-
                                              against gravity, or to move the part toward the       ing manual muscle test scores with iso-
                                              test position against gravity almost through full     kinetic testing, dynamometry, or ten-
                                              range                                                 siometry would decrease the subjectivity
        Poor plus          (2+)             able to move the part through full range with           in assessing a patient's disability.
                                              gravity eliminated, but against slight resistance
        Poor               (2)              able to move the part through full range with              Further study is needed in this area
                                              gravity eliminated                                    with each therapist being paired more
        Poor minus         (2-)             able to move the part through partial range with        than twice with another therapist. One
                                              gravity eliminated                                    potential study might incorporate sev-
        Trace              (1)              muscle contraction can be palpated                      eral staff in-service training sessions be-
        Zero               (0)              no contraction can be elicited                          fore the start of testing to help standard-
      Adapted from Kendall and McCreary7 and Daniels and Worthingham.8                              ize the muscle testing techniques among
                                                                                                    the staff members as much as possible.
                                                                                                    Reliability then could be reassessed to
   The patients' age did not appear to be                clinical value, especially when consid-    determine whether any improvement is
a factor in the low interrater reliability               ering the differences between Poor and     noted. Garraway et al were able to in-
coefficients because the scores for the                  Fair, or Fair and Good, versus the dif-    crease the proportion of examinations
youngest and the oldest subjects in the                  ference between Good and Normal. The       for stroke assessment, which included
study were not consistently any farther                  interval between each of these pairs of    motor function, in which total agree-
apart than those of the subjects in the                  grades is one grade, although the thera-   ment was reached from 41% to 68%
middle age range.                                        pists' subjective judgments of patient     after standardizing definitions, discus-
   Achieving reliability within one                      function may have been quite different.    sion and interpretation of instructions
grade, as in this study, has questionable                The accuracy of assessments of patient     by the examiners, and practice.20

CONCLUSIONS                                              may not be adequate for making clinical              grades of Good and Normal so that
                                                         judgments. Supplementing muscle test                 subjective judgment is minimized is an
  The results of this study do not sup-                  scores with isokinetic testing, dyna-                area in which further study is needed.
port the research hypothesis that staff                  mometry, or tensiometry is suggested.
physical therapists can perform manual                     The development of a standardized                    Acknowledgments. We thank the
muscle tests reliably in a clinical setting.             method of muscle testing is needed so                physical therapy staff of St. Louis Uni-
The results do demonstrate that the                      that different examiners can obtain                  versity Hospital for their cooperation
therapists are reliable within one grade;                comparable results in a clinical setting.            and Carolyn Heriza for her advice in
however, this degree of reproducibility                  Standardizing the resistance given in                planning the study.


