Tor's homepage         Tor's MASc thesis   /   Tor's Float-to-Fixed Conversion Software         ECE 341 Course Webpage

A Simple Speech Recognition Algorithm for ECE341...

Last revision: April 15, 2003, 2:17pm
Previous revision: November 29th, 2001, 6:13pm

  • Download Tor's Reference Implementation (achieves 100% accuracy on the following benchmarks)
  • Training and Testing Inputs (Tor saying "one", "two", "three", "left", "right", twice - 4000 samples/word, 10 words)
  • Spectrogram of Above
  • Tor's Float-to-Fixed Conversion Software (currently unavailable)
  • Citings (some folks that have used this information in their projects)
  • There are several points that need to be considered. Let us start with the fundamental, then progress to the details.

    Basic Algorithm:

    There are two phases: training, and recognition.

    To train, you store a spoken word's waveform sampled in the time domain (eg. lab M11) in a temporary buffer and ask the trainer which word this was (ie. have them type some text, or perhaps if there is a small fixed vocabulary, the program will tell the trainer which words to say). Next you simplify the waveform into a "fingerprint". This fingerprint represents characteristics of the sound in the frequency domain as time evolves. Specifically, we are interested in how the power within a few adjacent frequency bands evolves over time. This fingerprint is just a vector of numbers. Each number represents the energy, or average power that was heard in a particular frequency band, during a particular interval of time. Don't worry how you get these numbers just yet. For each word, you must create one of these "fingerprints" and store it in memory in a "dictionary". When you have a fingerprint for each word you want to recognize, you are ready to start doing crude but effective speech recognition...

    To recognize a word, you speak into your microphone and perform the same fingerprinting process which gives you another one of these vectors. You take this vector, call it X, and compare it to the ones you saved earlier in your dictionary. By compare I mean do the following: find the Euclidean distance squared between X and a vector stored in the dictionary. The vector in your dictionary that is closest to X, in other words, has the minimum Euclidean distance squared, was the closest match. To find the Euclidean distance squared, you are basically taking the difference (ie. subtracting) elementwise of the two vectors, squaring the difference, and taking the sum of all the squared differences (you should know this).

    Framing the input...

    Before we worry about converting this time-domain input into a "fingerprint", we need to think about one practical issue: Framing. By framing, what I mean is searching the input stream looking for the beginning of a word. This is pretty hard in general as background noise complicates things, even when you tell the group next to you to shut-up for a moment. At the very least, what you have to do is monitor the input waveform and look for an increase in the amplitude above a certain level. A good threshold level can be something like a factor of 2-4 times the average backgound noise amplitude (ie. a dynamic quantity).

    So, how do i get one of these "fingerprint" things?

    Here there are a number of alternatives. One is you use the Fast Fourier Transform to get the power spectrum in the frequency domain. The benefit of this method is you can use off-the-shelf FFT code. One alternative is to use bandpass filters, using code like that below. The benefit of the latter approach is that it requires far less storage (you don't really need to store each input sample into a queue, or buffer like you did in lab M11), and will almost certainly work with with my floating-point to fixed-point conversion software. The alternative to using this conversion software is basically using floating-point emulation which gcc can in priciple support using the -msoft-float option (you may need to supply your own floating-point emulation library--written in assembly!)

    There is a version of gcc on the ugsparc's that generates assembly for the motorola 68k: m68k-coff-gcc -S foo.c -o foo.s This generates motorola 68k assembly. Unfortunately it does not use the same syntax that a68 understands (a68 is the assembler used in the labs). You will have to modify the outputed assembly when using m68k-coff-gcc.

    Okay before going into the FFT's etc, instead of sampling at 48 kHz (as done in previous labs), let's drop 5 out of every 6 samples to get a sample rate of around 8 kHz.

    Bandpass filter method

    The benefit of this method is that it is really pretty simple, and you don't really need to store the whole waveform so you may get faster recognition times given all the signal processing that needs to be done. This is what you need to do: You use Matlab to create, let us say five 4th-order filters and convert them to a form that has good numerical properties (especially important after conversion to fixed-point!). For example:

    matlab>> [B1,A1] = cheby2(4,40,0.2); % lowpass with cutoff above 500 Hz
    matlab>> [B2,A2] = cheby2(2,20,[0.1 0.25]); % bandpass from 500 to 1000 Hz
    matlab>> [B3,A3] = cheby2(2,20,[0.25 0.45]); % bandpass from 1kHz to 1.7kHz
    matlab>> [B4,A4] = cheby2(2,20,[0.45 0.65]); % bandpass from 1.7kHz to 2.5kHz
    matlab>> [B5,A5] = cheby2(4,20,0.47,'high'); % highpass with cutoff below 2.5kHz
    matlab>> [sos1,g1] = tf2sos(B1,A1,'up','inf'); % create second-order filter sections...
    matlab>> [sos2,g2] = tf2sos(B2,A2,'up','inf');
    matlab>> [sos3,g3] = tf2sos(B3,A3,'up','inf');
    matlab>> [sos4,g4] = tf2sos(B4,A4,'up','inf');
    matlab>> [sos5,g5] = tf2sos(B5,A5,'up','inf');

    If you want to see the filter transfer function use:

    matlab>> freqz(B1,A1,1000,8000);

    The "sosN" matrices and "gN" values are what you need. This is the syntax:

    sosx = [ b01   b11   b21   1   a11   a21
    b02   b12   b22   1   a12   a22 ]

    For each of these sosN/gN pairs, you write ANSI C code that looks like this:

        float filterN( float x )
        {
            static float d01 = 0.0;
            static float d11 = 0.0;
            static float d02 = 0.0;
            static float d12 = 0.0;
            float y1, y2, t0, t1;

            /* first 2nd-order filter stage */
            t0 = x - a11*d01 - a21*d11;
            y1 = b01*t0 + b11*d01 + b21*d11;
            d11 = d01;
            d01 = t0;

            /* second 2nd-order filter stage */
            t1 = y1 - a12*d02 - a22*d12;
            y2 = b02*t1 + b12*d02 + b22*d12;
            d12 = d02;
            d02 = t1;
            return g*y2; // g is "gN" from tf2sos
        }

    For each sample you get from the codec you call filterN(), square its output, then add it to an accumulator:

            y1 = filter1(x);
            s1 += y1*y1;

    You want to do this for around 500 samples, then you append s1..s5 onto your fingerprint vector, and reset s1...s5 back to zero. (Actually, empirically, it is better to take the logarithm of the magnitudes s1...s5 before storing into the fingerprint--this is what I basically do in my reference implementation. The intuition is that hearing is fundamentally logarithmic in nature. This small change increased recognition accuracy from 60% to 100% for the example given). Repeating this 16 times you have a fingerprint that is 80 elements. You might want to play around with the filter cutoff frequencies etc... also with the number of time intervals.

    FFT Method:

    You want to do around 16 separate 1024-point FFT's so you end up sampling around 2 seconds of sound. The FFT code I can giving you will provide you with two output arrays, one real, one imaginary (email me if you want the code, alternatively use the FFT source in "Numerical Recipies in C"). Sum up 128 contiguous magnitude-squared samples of the output (remember they are complex values). Do this 4 times to cover the first 512 frequency domain values. The other 512 frequency values don't give you much more information, so ignore them. Thus you have taken the output of one 1024-point FFT and converted it into 4 numbers. In total you'll have 64 numbers for your fingerprint.

    Interfacing compiler generated code with assembly

    You will have to write some of your code in assembly and some in ANSI C. If you are using my "Float-to-Fixed Conversion Tool" you will pass the code you wrote in ANSI C through that with some sample inputs eg. speech waveforms sampled with a PC at home using Microsoft Sound Recorder. Matlab can read .WAV files so you can manipulate these and do what you want with them using "wavread" and "wavwrite" (type "help wavread" for details). Try to use 16-bit values like you end up using in the lab. The Float-to-Fixed Conversion Tool has been tested on Windows 2000, and Windows 98. Then you take the output from this (integer only ANSI C) and you get your 68000 assembly using m68k-coff-gcc and appropriate flags (eg. use -S to generate assembly output). Unfortunately, you need to modify the output so that the syntax is compatible with a68. To interface your assembly you should look at the assembly generated by the compiler and figure out what registers or stack locations to write/read to.

    Citations:

    1. http://instruct1.cit.cornell.edu/courses/ee476/FinalProjects/s2006/avh8_css34/avh8_css34/index.html
    2. http://instruct1.cit.cornell.edu/courses/ee476/FinalProjects/s2006/XL76_SL362/XL76%20SL362/index.html