... the subject of a lot of work over the years, and has been complicated by the fact that the conversion requirements for every language are different. Thus the requirements for Spanish, which has a regular writing system, differ from those of English, which has a very irregular spelling system.
The back end speech synthesis system is where the biggest advances have taken place over the last few years. It is this system that dictates the naturalness and intelligibility of the synthesised speech, and is why we have moved from the very mechanical robotic-sounding synthesised speech of a decade or so ago to a naturalness and intelligibility that is often barely distinguishable from the voice of a human.
This naturalness and intelligibility has been particularly important where synthesised speech is used in automated telephone response systems. These are now often extremely sophisticated, and have started to be used to replace human operators in some call centre applications. Applications which are also driving the development of speech synthesis mark-up languages such as the XML-compliant SSML proposed by the W3C.
The two main technologies used in new generation back end systems are concatenative synthesis and formant synthesis. Concatenative synthesis is based upon the stringing together of a lot of small segments of pre-recorded speech, and the output can often be indistinguishable from real human voices, however, this comes at the cost of very large speech databases, often involving gigabytes of data and as much as a hundred hours of recorded speech.
Formant synthesis, on the other hand, uses rule-based techniques to generate the different voice waveforms, and as such does not suffer from the acoustic glitches that often appear in concatenative systems. This technique also offers better control over vocal intonation, tone and emotion, but has proved far more complex and has until quite recently produced synthesised speech that is very robotic. However, recent developments, particularly from Japanese developers of humanoid robots, have seen the development of more sophisticated electromechanical analogues of the human vocal tract that promise much more natural-sounding formant speech synthesis.
The relative sophistication of current speech synthesis technology means that it is not surprising that voice response systems are now incorporated into a very wide range of products, ranging from toys and computer games, to aircraft and automobile alert and warning systems, whilst text to speech systems are used for applications ranging from the generation of complex scripted phone messages, to reading aids for the blind.
Speech recognition
Speech recognition, on the other hand, is a much harder task, and commercial off-the-shelf systems have only been available since the 1990s. Because every person's voice is different, and words can be spoken in a range of different nuances, tones and emotions, the computational task of successfully recognising spoken words is considerable, and has been the subject of many years of continuing research work around the world.
A variety of different approaches are used, dynamic algorithms, neural networks, and knowledge bases, with the most widely used underlying technology being...






