SOU Southern Oregon University by nikeborome


									               Voice Transformations
  Definition: modifying a signal to intentionally change its characteristics

• Challenges: Signal processing techniques have advanced
  faster than our understanding of the physics
• Examples:
   – Rate of articulation maintaining the formant structure
   – Alter F0 and modify the spacing between the harmonics components.
     Change between male, female, and child voices.
   – Modify the intensity: multiplying the amplitudes of signal sections
   – Voice Transformation: Alter a person’s speech to sound like another’s
   – Voice Morphing: Morph audio spoken by one speaker to sound like
     the same audio spoken by another
          Helium’s Effect on Speech
•   Changes the formants (resonances of F0), but not the pitch
•   Vocal tension, geometry, and length affects the pitch
•   Speed of sound greater, so resonances shifted higher
•   Diagram: Second formant shifted to the right, off the diagram.
•   Less power at lower frequencies; vowels articulate differently

          Normal voice spectrum     Helium voice spectrum

             The vertical lines are resonances of F0
           Voice Characteristics
• Breathy voice: The amplitude the first F0
  harmonic/amplitude much larger than the amplitude
  of the second F0 harmonic (large vocal opening)

• Creaky voice: Small or negative value, when
  subtracting the amplitude of higher formants of F0
  from the amplitude of first F0 (spectral tilt)
                 Vowel Acoustics
• Each person has a unique acoustic space:
  vowels exhibit patterns within that space
• Vowels are primarily distinguished by their
  first two formant frequencies: F1 and F2
  – F1 corresponds to vowel height
      o A smaller F1 amplitude implies a higher vowel
      o A larger F1 amplitude implies a lower vowel
  – F2 corresponds to a front or back vowel
      o A smaller F2 amplitude implies a back vowel
      o A larger F2 amplitude implies a front vowel
  – Lip rounding tends to lower both F1 and F2
                at different pitches

       100 Hz                                 120 Hz

                            150 Hz

F1 moves slightly to the right and F2 to the left as F0 increases
          Combined Formant Averages

3000   2500       2000        1500       1000









       Men: lower F0, Women: higher F0
            Synthesizing Speech
• Source-filter model
   – Excitation: glottal signal (source)
   – Time varying linear filter (vocal tract)
• Simplest form
   – Excitation
      • Quasi-periodic pulse sequences (voiced speech)
      • Noise (unvoiced speech)
   – Time varying linear filter (Linear prediction)
• Challenge: define an excitation sequence that
  produces natural sounding speech
           Synthesis Approaches
• Multi-pulse sequences of zeros and ones to better
  represent the glottal excitation
• Combine a series of sinusoids to create “glottal like”
• Determine F0 and use harmonics of F0 as excitation
• Concatenation and unit selection approaches

 Most modern synthesis implementations utilize unit
 selection. However, because of a desire to implement
 voice transformation algorithms, there is a renewed
 focus on utilizing digital signal processing techniques
            Pitch and Rate of Change
• TD-PSOLA (Time domain – pitch synchronized overlap and add)
• Advantages
   – Does a good job when changes are less than a factor of two
   – Time domain algorithm; very efficient
• Disadvantages: Not sufficient for complex transformations
       • Maintain amplitude and phase relationships between formants
       • Repeated fricative frames starts sounding tonal. Reversing or
         randomizing fricative spectrums helps, but not for voiced fricatives.
       • Increased articulation compresses vowels/consonants by 50%/25%
         (We protect consonants which carry more information).
       • The pitch values and contour are affected.
       • Non-linearities between sub-glottal resonances
       • Unexpected artifacts contained in the synthesized signal
          Energy Modification
• Naïve approach: Multiply each sample by
  some constant.
• Problems:
  – When we speak louder, we emphasize some parts
    of the signal more than others; we stress
    consonants more than vowels.
  – More sub-glottal pressure will stress higher
    frequencies more than those that are lower
  – Pitch tends to rise as speech becomes louder.
   Harmonic Plus Excitation Model
• Speech harmonic and excitation components
   – Harmonic: Vocal tract as a linear prediction filter
   – Noise component: collection of sinusoids with time varying
     amplitudes and frequencies
• Harmonic component: Linear prediction
   – yn = rn + ∑i=1,P aiyn-P or yn ≈ ∑i=1,P aiyn-P
   – Residue rn : excitation and nasal/sub-glottal non-linearities)
• Excitation Signal Estimate: e(t) = ∑k=0,K(t) mk(t)eiφk(t)
   – K(t) is the number of sinusoids at time t
   – mk is the amplitude of the kth sinusoid at time t
   – φk(t) is the phase of the kth sinusoid at time t
             The Harmonic Model
         Excitation signal: e(t) = ∑k=0,K(t) mk(t)eiφk(t)

• Questions to answer:
   –   How do we determine which sine waves to use?
   –   How do we determine the phases and amplitudes?
   –   How many sine waves should we use?
   –   How do we represent unvoiced speech?
• Note: φ k(t) = 2πkF0(t)
   – The sinusoids are harmonics of F0 (fundamental frequency)
   – Otherwise this would be a sinusoidal model (not
                 Linear Interpolation
 Goal: Compute partial phases/amplitudes at time, t
• Formula: (y-y0)/(x-x0) = (y1-y0)/(x1-x0)
• Application:
   –   Assume window size = w ms
   –   Frame n represents time nw
   –   Frame n+1 represents time (n+1)w
   –   nw <= t <= (n+1)w is time of interest
   –   x0, x1 = phases at times nw, (n+1)w
   –   y0, y1 = amplitudes at times nw, (n+1)w
   –   x, y = phase and amplitude at time t
    Note: Cubic interpolation uses the successive and previous
    windows and interpolates points between
      McAulay-Quatieri Algorithm
Perform FFT on the signal
Extract peak frequencies with phases/amplitudes.
Find F0 whose harmonics closely represent the partials
Connect partials of successive and previous windows
Generate time varying sign waves cubic interpolation
Apply to the vocal track filter to generate synthesized speech

• Death of a track: If no matching successive window partial
• Birth of a track: If no matching previous window partial
Partial: An FFT peak extracted with its phases and amplitudes
Track: Connections between partials of adjacent windows
Note: Typical number of partials for synthesis is from 20 to 160.
Sinusoid Death and Birth
               Unvoiced Speech
• Problem:
   – Unvoiced speech resembles noise
   – Noise requires too many sinusoids for an accurate
   – Signal transformations (such as stretching) to closely
     related harmonics produces sound heard as (wormy
     or jittery)
   – Unvoiced tracks span only a small number of windows
     so interpolation methods become problematic
• Solution: Bandwidth enhanced oscillators
• Carrier signal: A sinusoidal signal transmitted
  at a steady frequency

• Modulation: the process of varying one or
  more properties of a high-frequency carrier
  periodic waveform

• Oscillation is the repetitive variation, typically
  in time
   Bandwidth Enhanced Oscillation
• Technique: A partial’s energy is increased relative to its
  spectral amplitude and spread across adjacent frequencies
• Details: (a) The center frequency stays the same, (b) Energy is
  spread evenly on both sides (c) Random modulations
• Parameters: widening amount, fall off intensities
• Result: A closer representation to the original signal

   (a)Partial with no widening (b) Partial with moderate widening
                (c) Partial with large amount of widening
            Algorithm Refinements
• Add bandwidth enhanced oscillation
• Vary the spread of bandwidths based on the amount
  of voicing in the signal
• Formula: Yt = ∑k=0,K-1 ∑n=0,N(Ak(t) + βt) sin(kNnF0 + Ѳk(t))
   –   Yt is the synthesized signal at time t
   –   Ak(t) is the carrier frequency amplitude at time t
   –   k is a harmonic multiple of F0 (partial); K = number of partials
   –   Ѳk(t) of phase of the kth partial
   –   N is the number of oscillations for introducing noise
   –   Nn is the output of a random number generator to modulate F0
   –   β is a noise modulation factor

To top