4.10  Applications

Application 4.1. FFT leakage and picket-fence effects. This application explores a script that illustrates the results discussed in Section 4.3.4.0 and can be found at folder Applications/FFTLeakagePicketFenceEffects. The version ak_window4_noGUI.m runs on both Matlab and Octave while ak_window4gui.m incorporates a Matlab GUI.12

Basically the software varies the frequency Ωc of a cosine and compares the magnitudes of its DTFT and FFT. The goal is to show how the cosine is represented by two frequency components ±Ωc, and the interaction of the corresponding “positive” and “negative” sinc functions for composing the DTFT by their sum. Also, the script indicates how the FFT discretizes the frequency axis, which creates the picket-fence effect, and the leakage that an FFT user observes when Ωc does not coincide with a frequency bin.

Figure 4.45 is a screenshot when the cosine frequency is Ωc = 1.7279 rad and the corresponding DTFT magnitude. The cosine is represented by two spectral lines (tones), at normalized frequencies ±Ωcπ = ±0.55. The (dashed) reference line indicates the value N(A2) = 24 given the FFT-length N = 8 and cosine amplitude A = 6. Note that the DTFT, which is the sum of two sinc functions, surpasses this reference value for normalized frequencies around ± 0.55.

PIC

Figure 4.45: DTFT magnitude of a cosine of frequency Ωc = 1.7279 rad.

Figure 4.46 corresponds to a situation where Ωc = 2.1206 rad and the FFT magnitude values are superimposed to the DTFT. Besides, the two sinc functions centered at normalized frequencies ± 0.675 are also displayed such that one can see the DTFT being composed by their summation.

PIC

Figure 4.46: DTFT and FFT magnitudes of a cosine of frequency Ωc = 2.1206 rad.

In Figure 4.46, Ωc does not coincide with an FFT bin and the leakage is observed by the FFT values, which are samples of the DTFT. The picket-fence is clearly seen because in this case the resolution is low, and only N = 8 DTFT values in the range [π,π[ are obtained by the FFT. Increasing N alleviates this effect. The reader is invited to run the code with different settings.    

Application 4.2. Using Welch’s method to estimate the mean square (MS) spectrum. As explained, to decrease the variance of the estimated spectrum, it is useful to adopt Welch’s method. In this case, one should consider the effect of the window w[n] of L samples on the estimation.

In Matlab/Octave it is convenient to estimate a MS spectrum using pwelch, which takes care of segmenting the input signal. Matlab has support for MS spectrum estimation using pwelch while Octave does not.

When used for PSD estimation, pwelch.m scales the periodogram dividing it by the energy of the window:

E = n=0L1w2[n].

In contrast, when estimating a MS spectrum, the periodogram should be divided by the square of the window DC value

WDC2 = ( n=0L1w[n])2.

The reason is that the window is convolved with each power spectrum peak and WDC should be 1 to avoid modifying the peak height.

Matlab has the undocumented option of invoking pwelch.m with the argument ’ms’ for MS spectrum, such as in:

1H = pwelch(x,window,[],[],Fs,'twosided','ms');

which uses the adequate scaling factor to estimate a MS spectrum.

Because ’ms’ is not supported in Octave, Listing 4.29 illustrates that a workaround is to multiply the estimated spectrum by EWDC2.

Listing 4.29: MatlabOctaveCodeSnippets/snip_frequency_msspectrum.m. [ Python version]
1N=1024; x = transpose(10*cos(2*3/64*(0:N-1))); %generate a cosine 
2Xk = (abs(fft(x,N))/N).^2; %MS spectrum: |DTFS|^2 
3Fs = 2*pi; myWindow = hamming(N); %specify Fs and a Hamming window 
4H = Fs*pwelch(x,myWindow,[],N,Fs,'twosided');%Welch's estimate. Third 
5%parameter is [] because in Matlab is num. samples while Octave is % 
6H = H * sum(myWindow.^2)/sum(myWindow)^2; %scale for MS 
7plot(Xk,'x-'), hold on, plot(H,'or-') %compare
  

Using either Matlab or Octave, this code provides a pair of peaks of approximately 25 W for H, estimated via Welch’s method. Note that the cosine is not bin-centered and there is leakage. Matlab users13 can try a flat top window in place of Hamming’s with the command myWindow=flattopwin(N). When estimating the MS spectrum, a flat top window helps because it widens any peak in the original spectrum, such that the wider range of values has more chances of coinciding with a FFT bin. In other words, the flat top is not accurate to locate the sinusoid in frequency (bad frequency resolution) but helps when the goal is to find the sinusoid amplitude.

Another example with a sinusoid that is not bin-centered is provided below. It allows to observe the better performance of averaging segments of the signal using pwelch.m, as indicated in Listing 4.30.

Listing 4.30: MatlabOctaveCodeSnippets/snip_frequency_not_bin_cent_pwelch.m. [ Python version]
1N=1000; x = 10*cos(2*pi/64*(0:N-1)); %generate a cosine 
2Xk = (abs(fft(x,N))/N).^2; %MS spectrum |Xk|^2 
3Fs = 2*pi; window = hamming(128); %specify Fs and window 
4H=2*pi*pwelch(x,window,[],128,Fs,'twosided'); %Welch's estimate 
5H = H * sum(window.^2)/sum(window)^2; %scale for MS 
6disp(['Peak from pwelch = ' num2str(max(H)) ' Watts']) 
7disp(['Peak when using one FFT = ' num2str(max(Xk)) ' W'])
  

This code informs that Xk has peaks of 15.66 W and H (estimated via pwelch.m with segments of M = 128 samples) has peaks of 25.02 W (recall that the correct value is 25 W). In this and many practical cases, averaging windowed segments with pwelch.m leads to better results than using a single FFT.   

Application 4.3. Smoothing an FFT result by segmenting the signal and averaging the individual FFTs. When the task is to calculate the FFT of a noisy signal with N samples, it helps to segment it in M blocks and average the result of M FFTs. This is a basic strategy in spectral estimation, which is adopted in Welch’s method, for example. Here, the noisy signal is assumed to be the one recorded in the file impulseResponses.wav, as suggested in Application 1.4.

For isolating the response to an impulse using the signal described in Application 1.4, after zooming Figure 1.66 it was decided to consider the range of samples from n = 12650 to 22050 (given the third impulse was generated at n = 22051). This way the selected signal does not have many samples with small amplitudes before it actually begins.

In case you do not have available the companion file impulseResponses.wav and cannot generate yours due to the lack of a loopback cable, then use the signal suggested in the script below by changing the if instruction in line 3:

Listing 4.31: MatlabOctaveCodeSnippets/snip_systems_smothing_FFT.m. [ Python version]
1nstart=12650;%chosen after zooming the signal in impulseResponses.wav 
2nend=22050;%this was the chosen segment. Adjust them for your data! 
3if 1 %if you have impulseResponses.wav available 
4    [h,Fs,b]=readwav('impulseResponses.wav'); 
5    h=double(h(nstart:nend)); %segment and cast h to double 
6else %use signal with few samples extracted from impulseResponses.wav 
7    duration = floor(nend-nstart+1); %same duration as h above 
8    h =[-1051, 4155, -32678, -11250, 5536, -4756, 2941, -3162]; 
9    h = [h zeros(1,duration-length(h))]; %pad with zeros 
10    h = h + 10*randn(1,length(h)); %add some noise 
11    Fs=44100; %chosen sampling frequency 
12end 
13N=1024; %number of FFT points 
14M=floor(length(h)/N); %number of segments of N samples each 
15h=h(1:N*M); h=h(:); %make sure h is a column vector with N*M samples 
16xsegments=reshape(h,N,M); %segment h into M blocks 
17X=abs(fft(xsegments)); %obtain the magnitude for each segment 
18X=mean(transpose(X)); %transpose: the mean has to be over the FFTs 
19f=Fs/N*(0:N/2)/1000; %create abscissa in kHz. Fs/N is the bin width 
20plot(f,20*log10(X(1:N/2+1))),xlabel('f (kHz)'),ylabel('|H(f)| (dBW)')
  

Instead of the FFT, another option is to use the command pwelch(h,N,N/2,N,Fs) to observe the PSD.

PIC

Figure 4.47: Sound system magnitude frequency response |H(f)| estimated from an impulse response.

Figure 4.47 is the result obtained by running the previous script. It can be seen that the filters along the processing chain that includes the DAC and ADC present strong attenuation after 20 kHz and extra gain from DC to approximately 800 Hz. Choosing Fs different than 44.1 kHz would move the cutoff frequencies fc of the anti-aliasing and reconstruction filters. These filters are analog, but are programmable in the sense that fc can be modified by software (after the user chooses a new Fs in Audacity, for example). For example, switched-capacitor filters use a technology that allows this programmability feature. To check this feature in your sound board, repeat the experiment with Fs = 8,000 and 22,050 Hz.

Another alternative for obtaining an estimate of |H(f)| is to use a white noise as input to the system and use Eq. (4.38).

PIC

Figure 4.48: Sound system magnitude frequency response |H(f)| estimated from a white noise input.

Audacity can conveniently generate white noise via the menu “Generate - Noise”. Another option is to create the noise in Matlab/Octave, save as a WAVE file and read it with Audacity. Figure 4.48 was obtained using the former procedure to generate noise, play it back and record the system response using a loopback cable as for Figure 4.47. Then, the transient in the beginning of the recorded signal was discarded and approximately 32 thousand samples were saved as a WAVE file for processing in Matlab/Octave via the script:

Listing 4.32: MatlabOctaveCodeSnippets/snip_systems_recorded_noise.m. [ Python version]
1[x,Fs]=readwav('filteredNoise.wav'); %recorded output 
2N=1024; %number of FFT points 
3M=floor(length(x)/N); %number of segments of N samples each 
4x=x(1:N*M); x=x(:); %make sure x is a column vector with N*M samples 
5xsegments=reshape(x,N,M); %segment x into M blocks 
6X=abs(fft(xsegments)); %obtain the magnitude for each segment 
7X=mean(transpose(X)); %transpose: the mean has to be over the FFTs 
8f=Fs/N*(0:N/2)/1000; %create abscissa in kHz. Fs/N is the bin width 
9plot(f,20*log10(X(1:N/2+1))),xlabel('f (kHz)'),ylabel('|H(f)| (dBW)')
  

Figure 4.48 and Figure 4.47 are consistent but differ with respect to ordinate values and variance of the estimate. Eq. (4.43) states that the input noise level scales the output PSD. Take this in account and repeat the experiment with an input noise of controlled power, such that you can use Eq. (4.43) to properly scale the estimate and make the values closer to the ones in Figure 4.47. Investigate other factors that improve the relation between the two figures.   

Application 4.4. Speech formant frequencies via LPC analysis.

An interesting application of LPC analysis is to estimate the formant speech frequencies. The formants are related to the peaks of the spectrum of a vowel sound and are relatively well-defined in a sentence such as “We were away”, which does not have consonant sounds.

PIC

Figure 4.49: Spectrogram and tracks of the first four formant frequencies estimated via LPC for a speech sentence “We were away”.

Listing 4.33 was used to obtain Figure 4.49.

Listing 4.33: MatlabOctaveCodeSnippets/snip_frequency_formant_frequencies.m. [ Python version]
1[s,Fs,numbits]=readwav('WeWereAway.wav'); %read wav file 
2s = s - mean(s(:)); %extract any eventual DC level 
3Nfft=1024; %number of FFT points 
4N = length(s); %number of samples in signal 
5frame_duration = 160; %frame duration 
6step = 80; %number of samples the window is shifted 
7order = 10 %LPC order 
8numFormants = 4 %desired number of formants 
9%calculate the number of frames in signal 
10numFrames = floor((N-frame_duration) / step ) + 1 
11window = hamming(frame_duration); %window for LPC analysis 
12formants = zeros(numFrames,numFormants); %pre-allocate 
13for i=1:numFrames %go over all frames 
14    startSample=1+(i-1)*step; %first sample of frame 
15    endSample=startSample + frame_duration - 1; %frame end 
16    x = s(startSample:endSample); %extract frame samples 
17    x = x.* window; %windowing 
18    a = aryule(x,order); %LPC analysis 
19    poles = roots(a); %Roots of filter 1/A(z) 
20    freqsInRads = atan2(imag(poles),real(poles)); %angles 
21    freqsHz = round(sort(freqsInRads*Fs/(2*pi)))'; %in Hz 
22    frequencies=freqsHz(freqsHz>5); %keeps only > 5 Hz 
23    formants(i,:) = frequencies(1:numFormants); %formants 
24end 
25window = blackman(64); %window for the spectrogram 
26specgram(s,Nfft,Fs,window,round(3/4*length(window))); 
27ylabel('Frequency (Hz)'); xlabel('Time (sec)'); hold on; 
28t = linspace(0, N / Fs, numFrames); %abscissa 
29for i=1:numFormants %plot the formants 
30    text(t,formants(:,i),num2str(i),'color','blue') 
31end
  

The code illustrates how a long signal can be segmented via windowing. It is interesting to notice that the Hamming window is typically adopted in LPC analysis, and the code uses another (Blackman) window for the spectrogram. An LPC of order 10 was used to estimate four formants. The code eliminates frequencies below 5 Hz because LPC sometimes returns real poles that correspond to a zero frequency.

Figure 4.49 indicates that at the sentence endpoints and at the pauses between words, when the vowel sounds are not well-defined, the formant estimation is noisy. In the middle of the sentence, around 0.5 s, the four formants are clearly identified. In the region around 0.2 s and also the one around 0.8 s, the estimation is problematic because more than four formants are necessary to describe the signal.    

Application 4.5. Spectral distortion. When the task is to compare two PSDs, the spectral distortion (SD) can be used. It is given by

SD = 1 2πππ [10log 10 (Sx(ejΩ) Sy(ejΩ) )]2dΩ
(4.68)

where x[n] and y[n] are discrete-time signals and Sx(ejΩ) and Sy(ejΩ) their respective PSDs. SD is given in dB and corresponds to the root mean square value of the error between 10log 10Sx(ejΩ) and 10log 10Sy(ejΩ) over frequency Ω. It should be noted that the SD can also be used to compare two MS spectra.

Because both the PSD and MS spectrum are normalized versions of the squared magnitude of a DTFT and the SD divides two spectra, any normalization factor is canceled out in Eq. (4.68). Therefore, Eq. (4.68) can be written in terms of DTFTs as

SD = 1 2πππ [10log 10 (|X(ejΩ)|2 |Y (ejΩ)|2 )] 2dΩ (4.69) = 1 2πππ (20log 10|X(ejΩ)| 20log 10|Y (ejΩ)|)2dΩ (4.70)

where X(ejΩ) and Y (ejΩ) are the respective DTFTs.

Eq. (4.70) can be approximated by

SD 1 N k=0N1 (20log 10|X[k]| 20log 10|Y [k]|)2
(4.71)

where X[k] and Y [k] are the respective N-points FFTs. Listing 4.34 illustrates a software routine to calculate SD.

Listing 4.34: MatlabOctaveFunctions/ak_spectralDistortion.m
1function distortion = ak_spectralDistortion(x,y,Nfft,thresholdIndB) 
2% function distortion = ak_spectralDistortion(x,y,Nfft,thresholdIndB) 
3%Returns the spectral distortion (SD) in dB as defined at 
4% http://en.wikipedia.org/wiki/Log-spectral_distance 
5X=abs(fft(x,Nfft)); %Calculate magnitude values of DTFTs 
6Y=abs(fft(y,Nfft)); 
7logX=20*log10(X); %Convert to dB and use factor=20 to convert ... 
8logY=20*log10(Y); %DTFTs into power spectra using log(X^2)=2*log(X) 
9maximumValue = max([max(logX) max(logY)]); %Avoid numerical problems 
10minimumValue = maximumValue - thresholdIndB; %such as log of 0 
11logX(logX<minimumValue)=minimumValue; %Impose the floor value 
12logY(logY<minimumValue)=minimumValue; 
13distortion = sqrt(mean((logX-logY).^2)); % SD as a mean-squared error
  

One issue when using logarithms is to deal with argument values equal to zero. For example, in Matlab/Octave, log(0) gives -Inf and can lead to NaN after operations such as 0*log(0). To avoid numerical problems, Listing 4.34 uses a floor value based on a threshold to limit the minimum value of an argument for logarithm functions. Some lines of Listing 4.34 that are not being shown also deal with special situations and are an example of how to prevent problems via proper exception treatment when using software. For example, some lines of Listing 4.34 deal with the situation where both signals x[n] and y[n] have only zero values.   

Application 4.6. Spectral distortion of speech autoregressive models. In speech coding applications, the signal is segmented into frames. Given two speech signals, it is discussed here how to compare their spectra using AR models in a frame-by-frame basis. While Application 4.5 used one value of SD for the whole duration of the signals, here the SD is calculated for each frame.

The pair of signals can be found at folder Applications/SpeechAnalysis. One of the files correspond to the digit “eight” spoken by a male speaker. The other file was generated by a computer (more specifically, using the Klatt speech synthesizer) and aims at sounding indistinguishable from the first, or target speech.

Listing 4.35 shows a code snippet with the part that segments the signals into frames and calculates the SD as described in Listing 4.34 and also two other SD versions. These SD versions are comparisons between autoregressive models and are calculated with the function ak_ARSpectralDistortion.m. Listing 4.35 uses the array isNotSilence, which was obtained from function Applications/SpeechAnalysis/endpointsDetector.m, to avoid computing the SD values for frames that have a low power and can be considered as silence, not speech.

Listing 4.35: Applications/SpeechAnalysis/spectralDistortion.m
26for i=1:M %do not use last frame, because it may use zero padding 
27    if isNotSilence(i)==0 
28        continue; %skip because this is a silence frame 
29    end 
30    firstSample = 1+(i-1)*S; 
31    lastSample = firstSample + L - 1; 
32    x=target(firstSample:lastSample); %frame from target signal 
33    y=synthetic(firstSample:lastSample); %frame from the other signal 
34    sd_all(i)=ak_spectralDistortion(x,y); 
35    sdAR_all(i)=ak_ARSpectralDistortion(x,y,10,Nfft); 
36    sdARp_all(i)=ak_ARSpectralDistortion(x,y,10,Nfft,60,1); 
37    disp(['#' num2str(i) ': SD=' num2str(sd_all(i)) ', AR SD=' ... 
38        num2str(sdAR_all(i)) ' and AR SD (power)= ' ... 
39        num2str(sdARp_all(i))]); 
40end
  

Autoregressive models are widely used in speech coding applications and their quantization is often evaluated according to the AR version of the SD. In this context, transparent quantization is obtained when the average SD is not larger than 1 dB, having no outliers with SD larger than 4 dB and at most 2% of frames with SD between 2 and 4 dB.