Voice quality measurement Understanding VoIP

					Voice quality measurement:
Understanding VoIP
By Alan Clark                        audio file is calculated. To           phrases chosen so that the spo-      communication scenarios and
CEO and President                    achieve a reliable result for an       ken text will contain the range      asked to complete a task over a
Telchemy Inc.                        ACR test, a large pool of test         of sounds typically found in         telephone or VoIP system.
E-mail: alan.clark                   subjects should be used (16 or         speech. Recordings are ob-           Testers introduce effects such as
@telchemy.com                        more) and the test should be           tained in quiet conditions using     delay and echo, and the test sub-
                                     conducted under controlled             high-resolution (16bit) digital      jects are asked for their opinion
Voice-over-IP systems can suf-       conditions using a quiet envi-         recording systems and are ad-        on the quality of the connection.
fer from significant call-quality    ronment. Generally, scores be-         justed to standardized signal           The effect of delay on con-
and performance-management           come more stable as the num-           levels and spectral characteris-     versational quality is very task-
problems. To successfully            ber of listeners increases. To         tics. The International Telecom-     dependent. For non-interac-
monitor, manage and diagnose         reduce the variability in scores       munications Union (ITU) and          tive tasks, one-way delays of
these problems, network man-         and to help with scaling of re-        the Open Speech Repository           several hundred milliseconds
agers and engineers need to un-      sults, tests commonly include          (OSR) are sources of phoneti-        can be tolerated; for highly in-
derstand basic call-quality mea-     reference files that have indus-       cally-balanced speech material.      teractive tasks, even short de-
surement techniques.                 try-accepted MOSes.                        In addition to ACR, other        lays can introduce conversa-
   IP call quality can be af-           Figure 1 shows the raw              types of subjective tests include    tional difficulty.
fected by noise, distortion, too     votes from an actual ACR test          the degradation category rat-           The task-dependency of de-
high or low signal volume,                                                                                       lay introduces some questions
echo, gaps in speech and a va-                                                                                   over the interpretation of con-
riety of other problems. When                            50
                                                                                                                 versational call-quality metrics.
measuring call quality, three                                                               5   Excellent        For example, two identical
                                                                                            4   Good
basic categories are studied.                                                               3   Fair
                                                                                                                 VoIP system connections have
Listening quality refers to how                          40                                 2   Poor             300ms of one-way delay; how-
                                       Number of votes

                                                                                            1   Bad
users rate what they “hear”                                                                                      ever, one supports a highly in-
during a call. Conversational                            30                                                      teractive business negotiation,
quality refers to how users rate                                                                                 while the other supports an in-
the overall quality of a call                            20                                                      formal chat between friends. In
based on listening quality and                                                                                   the first example, users may say
their ability to converse during                                                                                 that call quality was bad. In the
a call. This includes any echo-                                                                                  second case, the users probably
or delay-related difficulties                                                                                    would not even notice the delay.
that may affect the conversa-                             0                                                         In an effort to supplement
tion. Transmission quality re-                                1   2        3           4               5         subjective listening quality test-
fers to the quality of the net-                                       Opinion score                              ing with lower-cost objective
work connection used to carry                                                                                    methods, the ITU developed
the voice signal. This is a mea-     Figure 1: The high number of votes for opinion scores 2 and 3 are           P.861 (PSQM) and the newer
sure of network service quality      consistent with the MOS score of 2.4.                                       P.862. These measurement
as opposed to the specific call                                                                                  techniques determine the dis-
quality.                             with 16 listener votes that re-        ing (DCR) and comparison cat-        tortion introduced by a trans-
   The objective of call-quality     sulted in a MOS of 2.4. The high       egory rating (CCR). DCR meth-        mission system or codec by com-
measurement is to obtain a re-       number of votes for opinion            odology looks at the level of        paring an original reference file
liable estimate of one or more       scores 2 and 3 are consistent          degradation for the impaired         sent into the system with the
of the above categories using        with the MOS of 2.4. However,          files and produces a DMOS.           impaired signal that came out.
either subjective or objective       many listeners did vote scores         The CCR test compares pairs of       Although these techniques were
testing methods—i.e. using hu-       of 1 and 4.                            files and produces a CMOS.           developed for lab testing of
man test subjects or computer-          When conducting a subjec-               To differentiate between         codecs, they are widely used for
based measurement tools.             tive test, it is important to un-      listening and conversational         VoIP network testing.
                                     derstand that the test is truly        scores, ITU introduced the              The P.861 and P.862 algo-
Listening quality                    subjective and that the test           terms MOS-listening quality          rithms divide the reference and
Subjective testing is the “time-     results can vary considerably.         (MOS-LQ) and MOS-conver-             impaired signals into short
honored” method of measuring         Within the telephony industry,         sational quality (MOS-CQ)            overlapping blocks of samples,
voice quality, but is a costly and   manufacturers often quote              with the additional suffixes         calculate Fourier transform co-
time-consuming process. One          MOSes associated with codecs.          (S)ubjective, (O)bjective and        efficients for each block and
of the better-known subjective       In reality, these scores are a         (E)stimated. Hence, a listening      compare the sets of coeffi-
test methodologies is the abso-      value selected from a given sub-       quality score from an ACR test       cients. P.862 produces a PESQ
lute category rating (ACR) test.     jective test.                          is a MOS-LQS.                        score that has a similar range to
    In an ACR test, a pool of lis-      Test labs typically use high-                                            MOS; however, it is not an ex-
teners rate a series of audio        quality audio recordings of pho-       Conversational quality               act mapping. The new PESQ-
files using a five-grade impair-     netically balanced source text,        Conversational quality testing       LQ score is more closely
ment scale.                          such as the Harvard Sentences,         is more complex and, hence, is       aligned with listening-quality
    After obtaining individual       for input to the VoIP system           used less frequently. In this type   MOS. These algorithms require
scores, the average or mean          being tested. The Harvard Sen-         of test, a pool of listeners are     access to both the source and
opinion score (MOS) for each         tences are a set of English            typically placed into interactive    output files to measure the rela-
                                                                                                                           premise that the effects of               Recent ACR subjective test
                       5                                                                                                   impairments are additive. The          data suggests that a MOS of 4.1
                                                                                                                           basic E model equation is:             to 4.2 would be more appropri-
                      4.5       ITU G.107                                                                                                                         ate for unimpaired G.711. This
                                Typical ACR
                       4        Japan TTC                                                                                       R = Ro - Is - Id - Ie + A         would provide a slightly differ-
                                                                                                                                                                  ent mapping for the typical
 Mean opinion score

                      3.5                                                                                                  where Ro is a base factor deter-       ACR than the one shown in
                                                                                                                           mined from noise levels and            Figure 2.
                       3                                                                                                   loudness; Is represents signal            In Japan, the TTC commit-
                                                                                                                           impairments occurring simul-           tee developed an R factor to
                                                                                                                           taneously with speech, includ-         MOS mapping methodology
                                                                                                                           ing loudness, quantization             that provides a closer match
                                                                                                                           (codec) distortion and non-op-         based on the results of subjec-
                      1.5                                                                                                  timum sidetone level; Id repre-        tive tests conducted in Japan.
                                                                                                                           sents impairments that are de-         The TTC scores are tradition-
                       1                                                                                                   layed with respect to speech,          ally lower than those in the U.S.
                            0   20            40                                    60            80             100
                                                                                                                           including echo and conversa-           and Europe due in some part to
                                                   R factor                                                                tional difficulty due to delay;        cultural perceptions of quality
                                                                                                                           and Ie is the “equipment im-           and voice transmission. There-
Figure 2: The official mapping function provided in ITU G.107 gives a MOS
                                                                                                                           pairment factor” and repre-            fore, the chart shows three
of 4.4 for an R factor of 93 (corresponding to an unimpaired connection).
                                                                                                                           sents the effects of VoIP sys-         potential mappings from R to
tive distortion.                                     has a lengthy history. The E                                          tems on transmission signals. A        MOS: ITU G.107 mapping,
   In 2004, the ITU standard-                        model was standardized by the                                         is the “advantage factor” and          ACR mapping and Japanese
ized P.563, a single-ended ob-                       ITU as recommendation G.107                                           represents the user’s expecta-         TTC mapping.
jective measurement algorithm                        in 1998 and is being updated                                          tion of quality when making a             Another complication is
that can operate only on the                         and revised annually. Some ex-                                        phone call. For example, a mo-         introduced when wideband
received audio stream. The                           tensions to the E model that                                          bile phone is convenient to            codecs are used. An ACR test is
MOSes produced by P.563 are                          enable its use in VoIP service                                        use, hence, people are more            on a fixed 1-5 scale and is rela-
more widely spread than those                        quality monitoring were devel-                                        forgiving on quality-related           tive to some reference condi-
produced by P.862 and it is nec-                     oped by Telchemy Inc. and have                                        problems.                              tions. In a wideband test, the
essary to average the results of                     been standardized in ETSI TS                                              VQmon is an extended ver-          same scale is used, thus a
multiple tests to achieve a                          101 329-5 Annex E.                                                    sion of the E model that incor-        wideband codec may have a
stable quality metric. This ap-                         The objective of the E model                                       porates the effects of time-vary-      MOS of 3.9 even if it sounds
proach is not suited for measur-                     is to determine transmission                                          ing IP network impairments and         much better than a narrowband
ing individual calls, but can                        quality rating—i.e. the “R” fac-                                      provides a more accurate esti-         codec with a MOS of 4.1. This
produce reliable results when                        tor that incorporates the                                             mate of user opinion. VQmon            is not the case for R factors,
used over many calls to measure                                                                                                                                   which have a scale that encom-
service quality.                                                                                                                                                  passes both narrowband and
   As this type of algorithm re-                                                                                                                                  wideband. Thus, a wideband
                                                                                                                         Terminate early
quires significant computation                                                      90                                   Poor or worse
                                                                                                                                                                  codec may result in an R factor
for every sample—i.e. process-                                                                                           Good or better                           of 105, while a typical narrow-
                                                       Percentage of subscribers

ing 8KSps for narrowband                                                                                                                                          band codec may result in an R
voice and 16,000KSps for                                                            70                                                                            factor of 93.
wideband voice—the processing                                                       60
                                                                                                                                                                     Figure 3 shows the relation-
load (of the order of 100MIPS                                                                                                                                     ship between R factor and the
per call stream) and memory                                                         50                                                                            percentage of subscribers that
requirements are quite signifi-                                                     40                                                                            would typically regard the call
cant. For many applications,                                                                                                                                      as being good or better, poor or
this is impractical, in which                                                                                                                                     worse, or terminate the call
case, packet-based approaches                                                       20                                                                            early. For example, at an R fac-
should be used.                                                                     10
                                                                                                                                                                  tor of 60, over 40 percent of
                                                                                                                                                                  subscribers would regard the
E model, VQmon                                                                       0                                                                            call quality as good. Nearly 20
                                                                                         0   10        20   30     40      50       60     70   80   90     100
VQmon is an efficient VoIP                                                                                                                                        percent of the subscribers
                                                                                                                        R factor
call-quality monitoring tech-                                                                                                                                     would regard the call quality
nology based on the E Model.                         Figure 3: In an R factor of 60, over 40 percent of subscribers would regard                                  poor and almost 10 percent
It is able to obtain call-quality                    the call quality as good, 20 percent would regard it poor and almost 10                                      would terminate the call early.
scores using typically less than                     percent would terminate the call early.
one-thousandth of the process-                                                                                                                                    Acceptable voice-quality levels
ing power needed by the P.861/                       “mouth-to-ear” characteristics                                        also incorporates extensions to        Generally, an R factor of 80 or
862/563 approaches. The E                            of a speech path. The range of                                        support wideband codecs.               above represents a good objec-
model was originally developed                       the R factor is nominally 0-120.                                         Figure 2 shows the relation-        tive. However, there are some
within the European Telecom-                         The typical range for R factors                                       ship between the R factor gen-         key things to note:
munications Standardization                          is 50-94 for narrowband tele-                                         erated by the E model and MOS.         • Since R factors are conversa-
Institute (ETSI) as a transmis-                      phony and 50-110 for wideband                                         The official mapping function             tional metrics, the statement
sion planning tool for telecom-                      telephony. The R factor can be                                        provided in ITU G.107 gives a             that R factors should be 80 or
munication networks, how-                            converted to estimated conver-                                        MOS of 4.4 for an R factor of 93          more implies both a good lis-
ever, it is widely used for VoIP                     sational and listening quality                                        (corresponding to a typical un-           tening quality and low delay.
service quality measurement.                         MOS scores (MOSCQ and                                                 impaired G.711 connection,                Stating that (ITU-scaled)
    Based on several earlier                         MOS-LQ).                                                              the equivalent of a regular tele-         MOS should be 4.0 or better
opinion models, the E model                             The E model is based on the                                        phone connection).                        is not the same as assuming
 User opinion                                      R factor                MOS              MOS                 should be compared to an
                                                                       (ITU-scaled)     (ACR-scaled)            ACR scaled range. For ex-
 Maximum obtainable for G.711                          93                   4.4              4.1                ample, “satisfied” would
                                                                                                                range from 3.7 to 4.1 and
 Very satisfied                                     90-100                 4.3-5            4.1-5
                                                                                                                hence, the G.729A MOS of
 Satisfied                                           80-90                 4-4.3           3.7-4.1
                                                                                                                3.9 would be within the “sat-
 Some users satisfied                                70-80                 3.6-4           3.4-3.7              isfied” range.
 Many users dissatisfied                             60-70                3.1-3.6          2.9-3.4
 Nearly all users dissatisfied                       50-60                2.6-3.1          2.4-2.9               When specifying call-quality
 Not recommended                                      0-50                 1-2.6            1-2.4             objectives, it is important to be
Figure 4: Generally, an R factor of 80 or more represents a good objective.                                   clear about terminology—either
                                                                                                              specify R-CQ or MOS-CQ, or
  that this is MOS-LQ and does           with this; hence, an R-LQ of         scaled MOS for “satisfied.”     the combination of MOS-LQ
  not incorporate delay. Saying          80 would be comparable with          However, the G.729A is          and delay. If using wideband
  that R should be 80 or more            a MOS of 4.0.                        widely used and appears to be   and narrowband codecs, be
  and MOS should be 4.0 or             • The typically manufacturer-          quite acceptable. This prob-    aware that there is a need to in-
  more is not consistent.                quoted MOS for G.729A is             lem is due to the scaling of    terpret MOS scores as narrow-
  Telchemy introduced the no-            3.9, implying that G.729A            MOS and not the codec. Typi-    band MOS or wideband MOS
  tation R-LQ and R-CQ to deal           could not meet the ITU-              cal ACR scores for codecs       to avoid confusion.