Talking PCs? Talk to the hand

By Nick Hampshire, ZDNet UK
Tuesday, June 13, 2006 03:04 PM

Being able to chat with a computer in plain English has been the standard fare of science fiction for decades, and yet, despite many promises from forecasters and other experts, we're still a long way from turning fantasy into fact.

Voice synthesis has been around for a long time. Bell Labs demonstrated a computer-based speech synthesis system running on an IBM704 in 1961, a demonstration seen by the author Arthur C. Clarke, giving him the inspiration for the talking computer HAL9000 in his book and film "2001: A Space Odyssey".

Forty-five years later, voice synthesis technology can be found in products as diverse as talking dolls, car information systems and various text-to-speech conversion services such as the one recently launched by BT. Many of these modern systems can convert text into a computer synthesized voice of quite respectable quality.

However, the problems faced by voice technology developers primarily lie not in getting a computer to talk, but in getting it to listen. Voice recognition has turned out to be a much harder task than researchers realised when work began on the problem over 40 years ago. However, limited voice recognition applications are starting to creep into everyday use, voice input telephone menu systems are now commonplace, speech-to-text dictaphones are increasingly used for note-taking by doctors and lawyers, and voice input has started to appear in computer games systems.

"The adoption of speech recognition [will] eliminate most manual transcription for healthcare in North America this decade."
--Paul Ricci
Nuance

The success of some of these limited-application voice recognition systems has recently prompted the big software heavyweights, Microsoft and IBM, to make further investments. IBM has hired more than a hundred extra speech technology researchers, with the aim of developing a system capable of matching the human level of speech recognition by 2010. And Bill Gates recently said that "we [Microsoft] aim to have computer systems capable of matching a human level of speech recognition by 2011".

If these predictions are true, then it means that within five years we could see the science fiction writers' vision of speech interaction with computers become a reality. However, there are still a lot of technological hurdles to overcome; to understand what these are, we need to delve further into the technology.

Speech synthesis
Speech synthesis, or Text to Speech (TTS) systems all consist of two parts, the front end which converts the text file into a "symbolic linguistic representation", and the back end which takes this symbolic representation and converts it into a speech waveform.

The front end first converts things like numbers and abbreviations into their written word equivalents to produce a normalized text. The next step is to phonetically transcribe each word, and divide the text into prosodic units such as phrases, clauses and sentences. The trouble is that text is full of words that are pronounced differently depending upon the context in which they are used, and this has required the development of sophisticated heuristic techniques that look at neighbouring words and statistics of frequency of occurrence in order to guess the proper pronunciation. The sequence of phonemes is then produced using either a dictionary or a rule-based approach.

The development of the front end speech synthesis system has been the subject of a lot of work over the years, and has been complicated by the fact that the conversion requirements for every language are different. Thus the requirements for Spanish, which has a regular writing system, differ from those of English, which has a very irregular spelling system.

The back end speech synthesis system is where the biggest advances have taken place over the last few years. It is this system that dictates the naturalness and intelligibility of the synthesized speech, and is why we have moved from the very mechanical robotic-sounding synthesized speech of a decade or so ago to a naturalness and intelligibility that is often barely distinguishable from the voice of a human.

This naturalness and intelligibility has been particularly important where synthesized speech is used in automated telephone response systems. These are now often extremely sophisticated, and have started to be used to replace human operators in some call centre applications. Applications which are also driving the development of


WORTHWHILE?

0

0 votes
Save to my library  Save to My Library  
Blog

Talkback 0 comments

There are currently no comments for this post.

Use shades of gray to enhance scale in Excel

Microsoft Office Suite

Excel's palette is generous, but don't throw buckets of pigment all over your spreadsheets just because you can.


Read more »


Ultimate 2012 recovery site: the moon

Blog thumbnail

Have you seen the disaster movie "2012"? A friend from Control Risks and I did, and we reluctantly concluded we wouldn't be able to write off the cost of our..... by Nathaniel Forbes

Read more »

Tech Jobs Now!


Tags

  1. antivirus
  2. apple ipod
  3. cnet networks inc.
  4. desktop
  5. e - mail
  6. hard drive
  7. intuit inc.
  8. mcafee inc.
  9. microsoft corp.
  10. microsoft windows
  11. microsoft windows vista
  12. microsoft windows xp
  13. norton co.
  14. pc
  15. performance
  16. security
  17. software
  18. tool
  19. web
  20. web site