The Two Methods for Capturing, Processing, and Indexing Audio
Many call-center based businesses are looking to employ speech analytics. There is a wealth of information about the advantages this technology gives a business; from data mining opportunities to improving compliance. While the descriptions of these applications are impressive, it can be confusing to understand the idiosyncrasies of the technology itself and since no two systems are built exactly alike, it may be helpful to dig into the basics. So here is a simple post to explain the difference between the two primary methods used by speech analytics technology to capture, process, and index the call audio.
The most basic function of a speech analytics system is to capture what is being said. This function is what allows the audio to be searched by keyword or phrase and what gives it the ability to be used for compliance tracking and dispute defense. When the audio is captured by the system, it must be processed and organized before it can be indexed. There are two methods that are typically used to do this: phonetic or speech-to-text.
The Phonetic method takes the audio and breaks it into strings of phonetic sounds (or phonemes). Processing phonetically can be very fast as the sounds don’t have to be checked against a language model or preinstalled vocabulary for accuracy prior to indexing and there are far less of them than words. Also, detecting different dialects can be faster through the analysis of phonetic sounds. The downside to this method is that it requires a degree of manual filtering because errors can frequently occur when the phonetic sounds are being reorganized later back into words.
Here is an example of a string of phonetic sound units:
K A N AY HH EH L P Y UW W IH DH EH N IY TH IH NG
To illustrate the problem that can occur with this method, try to locate the word “knee.” There are a couple strategies you could use to do this; either break the word into phonetic units and search for them in the right order or look for the first phoneme and narrow it from there. Either way, it will give you the same results. The phonetic translation of “knee” is N IY. Did you locate it?
K A N AY HH EH L P Y UW W IH DH EH N IY TH IH NG
That didn’t seem so difficult, did it? But what if I told you the above string was a sentence that the word “knee” wasn’t a part of? The above phonetic string is a translation of the audio: Can I help you with anything? Now it is clear where there can be a need for manual intervention. Because of issues like the above example, searching can present a challenge for systems that use the phonetic method.
The other method speech analytic technology uses to approach the processing of call audio is commonly referred to as “speech-to-text” but also “Large Vocabulary Continuous Speech Recognition” (LVCSR), or “Automatic Speech Recognition” (ASR). This method transcribes the audio into text prior to indexing. Unlike the phonetic method, speech-to-text tests words against a large dictionary and language model to guard against translational errors, and because of this processing time can be a bit slower. The speech-to-text translation of the audio from the example above is: Can I help you with anything? As you can see, there is no chance of this method mistakenly locating the word “knee” because the audio is indexed as complete words. Accuracy isn’t the only benefit to this method of processing. This method indexes a conversation in its entirety and so it creates a larger picture for analysis.
It’s hard to say which is the superior method; each has something a little different to offer. Call centers considering a speech analytics system should measure the benefits of each against their specific business needs. If quickly detecting different dialects benefits your business model and you have sufficient monitoring staff than the phonetic method may be the best choice for you. If you’re looking to use your audio to create a broad analysis of your call floor, and rapidly flag specific phrases then speech-to-text may be the better option.