Computers taught to lip-read many languages

Wednesday, 23 September, 2009

Scientists have created lip-reading computers that can distinguish between different languages. Computers that can read lips are already in development but, according to Brian Bell of London Press Service, this is the first time they have been ‘taught’ to recognise various languages.

The discovery in Britain could have practical uses for deaf people, for law enforcement agencies and in noisy environments.

Led by Stephen Cox and Jake Newman at the University of East Anglia’s School of Computing Sciences, the research was presented at a recent conference in Taiwan.

The technology was developed by statistical modelling of the lip motions made by a group of 23 bilingual and trilingual speakers.

The system was able to identify what language was being spoken by an individual speaker with very high accuracy and the languages included English, French, German, Arabic, Mandarin, Cantonese, Italian, Polish and Russian.

Professor Cox said: “This is an exciting advance in automatic lip-reading technology and the first scientific confirmation of something we already intuitively suspected, that when people speak different languages, they use different mouth shapes in different sequences.

“For example, we found frequent ‘lip-rounding’ among French speakers and more prominent tongue movements among Arabic speakers,” he added.

The research is part of a wider project on automatic lip-reading. The next step will be to make the system more robust to an individual’s physiology and his or her way of speaking.

Cox is director of the Speech, Language and Virtual Humans Laboratory at the University of East Anglia in eastern England. There are several closely linked areas involving speech, language and music that are covered within the laboratory.

Automatic lip-reading presents a number of demanding scientific challenges. The current project addresses several key scientific questions including:

What is the relationship between facial gesture and perceived speech?
How is that relationship affected by the language of the speaker and the context of the discourse?
What is the effect of language, the pose of the speaker and the context of the discourse on the recognition accuracy?

The joint research project brings together expertise from the Centre for Vision Speech and Signal Processing at the University of Surrey, the School of Computing Sciences at the UEA, and the Home Office Scientific Development Branch.

The project will build on computer vision and speech recognition to investigate and evaluate automated lip-reading from video.

The goal is to develop tools and techniques to allow automatic, language independent lip-reading of subjects from video streams. The project will also seek to quantify both human ability and automatic ability.