There is coming, I think, a great change in how we discover things on the Internet. It is one which will play a major part in making the moving image central to knowledge and research, which is the goal that I am trying to pursue professionally. The great change will be brought about by speech-to-text technologies (also known as speech recognition).
Technologies that convert the spoken word into readable text have been around for a while now. Dictation tools such as those produced by Dragon do a fine job for the single voice which the software has been trained to recognise, and the new generation of smartphones now incorporates voice command technologies working on much the same principles. Speech-to-text systems are used in call centres, and by broadcasters to generate a rough transcript from which subtitles are then produced. But the great challenge has been how to apply speech-to-text to large-scale collection of speech-based audio and video, such as is held by broadcasters, archives and libraries.
Take the British Library for example. We have around one million speech recordings in our Sound Archive. They have catalogue records, so you can find out basic information about their contents, but providing more detailed descriptions, or even transcriptions, is enormously time-consuming, labour-intensive and slow – with the rate of production naturally falling way behind the rate of acquisition. On the video side of things, we have a rapidly growing collection of television news, amounting to 25,000 hours. Around half of this comes with subtitles captured as part of our off-air recording, so we can offer a pretty accurate, word-for-word (and word-searchable description) for those programmes. But for the other half – such as most 24-hour news channels – there are no subtitles. All you get is a one-line description taken from the Electronic Programme Guide saying something like “the latest news from around the world”, and that’s it. We need to open up those recordings to match the level of discovery we can offer for subtitled news programmes.
This isn’t just about opening up speech archives – it’s about levelling the playing field. The digitisation and digital production of text means that full-text searching across a vast corpus is now a reality, as we see with such sites as Project Gutenberg, Hathi Trust, Gallica, Trove, British Newspaper Archive, Papers Past, the Internet Archive and more. If video and sound are to be treated equally by libraries and archives, then that means they need to be discoverable to an equivalent level of depth, and for researchers to pursue subjects through books, manuscripts, newspapers, video and sound recordings on an equal footing. We need to know what those audiovisual records are saying.
Over the past couple of years speech-to-text technologies have developed to a stage where we are very close to achieving such a goal. University departments, broadcasters’ R&D divisions, the video industry, and the major web companies have all been in pursuit of this particular holy grail. It is no easy matter, as the human voice is a complex thing, and the huge variety of voices represented by any large video or sound archive will covers many different accents, languages, arrangements (i.e. multiple voices), instances of background noise, and so on. It’s interesting to see what’s driving much of this activity. It’s not an idealistic wish to push back the barriers of research. Go to the websites of the developers and service providers and again and again you’ll see the same thing – demonstrations of how good they are at reading Arabic. It is surveillance demands that are pushing this particular industry forwards.
Over the past year I have been leading a project at the British Library, entitled ‘Opening up Speech Archives’, which has looked at the application of speech-to-text technologies for research, particularly in the art and humanities. Funded by the Arts & Humanities Research Council, the project is not about assessing the best technical solutions. One thing you quickly learning when studying this field is that different applications work best in different situations. Instead the project has been looking at things from the researcher’s perspective, and asking some basic questions. How useful are the results to researchers? What are the methodological and interpretative issues involved? And how can speech-to-text technology be adopted in UK research in a form that is readily accessible and affordable?
For the project we have been interviewing researchers, either on a one-to-one basis, or in group sessions, and getting them to try out research topics on a variety of speech-to-text and related systems. We have been surveying the field, trying to get a good sense of the options and the possibilities. We have been in discussion with various vendors and service providers, and sending them test content. We are working on creating a demonstrator service. And we plan to publish our findings and share them with the research community at large.
So, what’s out there? Well I have gathered together just a few examples of the interesting work going on out there. Microsoft have been working in this area for some time, with Microsoft Research having produced a system that they call MAVIS. Speech recognition systems tend to fall into two camps – either they are dictionary-based, so that they match the sounds played to them and then match these to wards in their dictionary, or they work from individual sounds elements, or phonemes. MAVIS is dictionary-based. It is now marketed as inCus by a company called GreenButton. You can try them system out at the ScienceCinema site, which is a collection of videos from the US Department of Energy and the European Organisation for Nuclear Research (CERN). Type in any term (pick a scientific one – ‘chemistry’ is a good example), and the results will be presented as a list of sentences in which your search term appears. Click on any one, and it takes you directly to that point in the video.
Demonstration video for Nexidia’s Dialogue Search
An example of phoneme-searching is the solution provided by Nexidia, well-known in the video production business for its association with AVID. This sort of search system looks not for words but how words are constructed – so, for example, if you search for ‘barack obama’ it will look for ‘buhr-ock-oh-bah-muh’, returning the instances in a soundtrack where those phonemes come together. There isn’t a transcript you can browse, because the system of course doesn’t produce transcripts. I haven’t got an online example that I can point to, but the demonstration video above for the latest incarnation of its service, Dialogue Search, shows how it works.
Or you can go down the route taken by BBC Research & Development. They have been working with the most popular open source speech-to-text toolkit, called CMU Sphinx (CMU = Carnegie Mellon University). The archives they have been testing this on are the radio archives of the World Service. Instead of using the software to produce a pseudo-transcript, they use it to generate keywords (drawn from DBpedia) which then serve as tags for individual radio recordings. They combine these with tags created from their own catalogue records, then ask users to listen to the programmes and select those machine-read tags that accurately reflect the contents of the programme, rejecting those tags that are inaccurate. It’s an ingenious combination of machine and crowdsourcing that might just work as a model to build up the self-generating archive discovery system. (It also does ingenious things such as identify different speakers according to their vocal patterns, so you can break up a speech file into different people). This exists as a prototype for invitees only, and the programmes themselves are not openly available. (There’s more information on the BBC R&D blog about the project).
One of the most prominent examples of speech-to-text in action can be found on the BBC News website. Democracy Live is a collection of videos of proceddings on the UK Houses of Paraliment, Scottish Parliament, Welsh Assembly, Northern Ireland Assembly, select committees and the European Parliament. these videos are all word-searchable, using a system developed by Autonomy, one of the major players in this select field. The site is a good example of how speech-to-text system can work in practice, not just finding words quickly but presenting them in their contextm with links to MPs or other representatives, to associated themes, and so on.
Demonstration video for Voxalead
This enrichment of the search experience by generating subject terms, associations and other research tools out of the raw data generated by speech-to-text and other data sources is demonstrated by Voxalead. A product of French search engine Exalead, produced by Dassault Systèmes, this is a test service for searching video news online (Al Jazeera, BBC World Service, France 24, NBC, CBS etc). Type in any subject and it searches for programme descriptions, closed captions (subtitles) and audio tracks which it has indexed using speech-to-text. It presents these descriptions alongside related terms and people’s names that it has extracted from the records, linking these to other instances of those names. It takes the geographical records (i.e. place names) and positions these on a map. It also gives you a timeline showing when your search term is a its most frequent (search for ‘Mali’ for example and see how interest in the word has mushroomed over the past few weeks).
Voxalead is a really interesting demonstration of how these systems can work, though it’s only hosted as part of the labs site, and might disappear at any moment. That’s what happened to Google’s Gaudi service, which was available on the Google Labs site a few years ago, showing how you could search across videos of the first Obama election using speech-to-text, with instances of the word marked along the timeline of the video. This doesn’t mean Google has abandoned speech-to-text. Quite the opposite – the indications are that it is planning something major. There’s the almost casual mention in a recent Guardian piece on Amit Singhal, head of Google Search, that it “has been assiduously accumulating as much human voice recording as possible, in all the languages and dialects under the sun, in order to power its translation and voice recognition projects”, plus the news of the recruitment of Ray Kurzell, language processing expert, as director of engineering, with a brief to “to create technology that truly understands human language and its real meaning” and a blank cheque to enable him to do so, tend to point to something big, eventually.
Click on the automatic caption box on the right-hand side to see what I’m not saying about the British Library “brokaw snooze” service…
There are other indicators of big-ness. In January 2012 a US law was past which says that all TV broadcasts from the USA when published on the Web need to come with closed captions, to enable accessibility for all. this isn’t speech-to-text, but it is making a major tier of web video word-searchable, and where such a law can be passed in the USA, can Europe be far behind in its thinking? Already you can see on any YouTube video a new automatic captions service provided on the navigation bar. This provides an automatic transcription of what the person is saying. Often the results are quite comical, as you may see from the quite awful video of me above, which I use as an example only to spare the blushes of others.
Comical mis-transcriptions of speech-to-text are a joy. I particularly treasure one service transcribing “Chechnya” as “sexy eye”, and Voxalead reporting that French troops had liberated the city of Tim Buckley (Timbuktu). This Voxalead ‘transcription’ of some Al Jazeera news headlines from 4 February 2013 gives an idea of some of the delights to be found:
The Syrian president has accused Israel of trying to destabilize has country Bashar Al-Assad has told the Iranian foreign minister. The Serbian army and capable of handling any intervention. It comes as Israel’s defense minister, Ehud Barak hinted that Israeli involvement in an air strike near Damascus last week. British scientists are due to announce its results on whether the skeleton of King Richard the 3rd has been found buried under a cow pot. The first thing century ruler was immortalized in Shakespeare’s clay the hunchback you soak who had his 2 young nephews murdered archeologists think they find might lead to a real evaluation of the Monuc tension as a villain. You can find the latest on all those stories and more.
This must serve as a reminder that speech-to-text is not perfect, and probably never will be. Accuracy rates tend to range from 60-90%, depending on whether it is one person speaking to camera, or multiple voices. One reason that news programmes feature so heavily in such services is because the use of news presenters is ideal for them. Speech-to-text systems are not about perfect transcriptions in any case – they are about improving search. You have to concentrate on what they get right, and where that leads you.
We’re holding a conference at the British Library on 8 February, entitled Opening up Speech Archives, where we will be discussing some of these services, and trying to assess how best they can benefit academic research. There will be speakers from BBC R&D, the Netherlands Institute for Sound & Vision, Oxford University, Cambridge Imaging Systems, the BUFVC and Autonomy. The event is full, but we will be publishing conclusions from the conference and the project overall on the British Library website hopefully by the end of February – certainly not long after. You’ll also be able to follow me – and other participants – on Twitter, through the hash tag #ousa2013.
Will all of this change how we discover things on the Internet in a truly significant way? I think it will. It’s not just that we’ll be able to uncover huge amounts of speech-based content (assuming that these systems become affordable on a mass scale). It’s how these records will be discoverable alongside all the other text-based records in libraries, archives, and the Web, that is going to be so revolutionary. We will have two levels of discovery – the basic level (a catalogue description, essentially), and the full-text level, in which every word in a document, of whatever medium, is discoverable. And from the words our systems we then build, or enable us to build, further associations by extracting key terms – subjects, names, locations, dates, time periods, concepts – which can then create links to other files, and be used for themselves to visualise data, to map associations, to learn new things about the familiar and to discover the hidden and unsuspected.
Such interconnected systems won’t just make us able to do what we do now, which is to search in a rather linear way (query > listing > answer), but will immerse us in data, radiating out from whatever it is we are thinking about. We will have to see things differently, ask new questions, discover things we hadn’t even realised we were looking for. And the moving image will be central to all this. Of course moving images aren’t all about speech (there’s a whole other long post to be written about image recognition systems), but words give moving images their specificity, and they connect the medium to the traditional modes of discovering knowledge, which is to say through books and manuscripts. 120 years or so after motion picture film was established, we might finally be in a position to start learning from it.
- There’s background information on the Opening up Speech Archives project on my British Library Moving Image blog
- The ‘Speech to Text’ group on the Playback network for sound enthusiasts describes several of the systems out there
- ‘Google and the future of Search’, an article from The Guardian on 19 January 2013, gives some idea of the extent of its future ambitions
- See where speech-to-text comes on this graph of ‘Gartner’s Hype Cycle for Emerging Technologies‘, alongside natural-language question answering, automatic content recognition, 3D bioprinting and mobile robots
- Japanese mobile network NDD Docomo is to introduce a service in which users can speak in one language down the phone and the listener can hear the results in another language, thanks to a mixture of speech recognition and language software. Is this the way new wars might begin…?
- The talk I gave at the Opening up Speech Archives conference, adapted from this blog post, is available here: https://lukemckernan.com/wp-content/uploads/speechtotext.pdf