Researchers develop computer program that can read lips with superhuman accuracy

James Titcomb

Scientists at the University of Oxford have developed software that can read lips correctly 93.4 per cent of the time - a level that far surpasses the best professionals.
Scientists at the University of Oxford have developed software that can read lips correctly 93.4 per cent of the time - a level that far surpasses the best professionals.

Big Brother may soon be watching, but as a computer program that can read your lips with superhuman accuracy.

Scientists at the University of Oxford have developed software that can read lips correctly 93.4 per cent of the time - a level that far surpasses the best professionals.

The researchers said the program, dubbed “LipNet”, could have “enormous practical potential”, being used to improve hearing aids, enable conversations in noisy places or to add speech to silent movies.

However, it may also have more sinister uses, enabling mass surveillance of citizens’ speech in public via CCTV or allowing anyone to pick up on private conversations.

The researchers, working with Google’s artificial intelligence division DeepMind, trained the software on more than 30,000 videos of test subjects speaking sentences. Over time, it would match certain words with particular lip movements to learn what words were being spoken.

The researchers then played it further videos of people speaking sentences and the LipNet software succeeded with 93.4 per cent accuracy. This compares to 52.3 per cent for hearing impaired students, and surpassed other lip-reading programs.

Unlike previous software, LipNet digested the phrases as full sentences, and allowing it to put words in context rather than decipher them individually allowed much greater accuracy. It also means the software does not need to split a video into separate videos for each word.

he software isn’t quite ready for the real world, however. The research and tests only covered a specific collection of videos, which featured words in a set structure: command, colour, preposition, letter, digit, adverb eg. “place blue in m 1 soon” and 34 speakers.

To be able to understand more complex and variable sentences, as well as more people with different accents, it would need a much bigger bank of videos with more speakers.

With British accents, the software struggled to differenciate between the “aa” and “ay” phonemes (the sounds in “odd” and “hide”) and the “ih” and “ae” sounds (from “it” and “hut”).

It is unclear how it will deal with Irish accents.