SpeechBot - Indexing Audio Conversations

filed under Crossover · 1 comment in the original

John Dowdell points to an interesting research project being conducted at HP Labs, the SpeechBot. As the site describes, "SpeechBot is a search engine for audio & video content that is hosted and played from other websites".

Digging a little deeper into the technical documentation for SpeechBot, I came across this summary:

SpeechBot (http://www.compaq.com/speechbot) is the first Internet search site for indexing streaming spoken audio on the web. Unlike previous attempts to index spoken audio on the Web, which have relied on either adjacent text, metadata, or hand supplied transcripts and close captions, SpeechBot uses automatic speech recognition technology to transcribe and index documents that do not have transcripts or other content information. The use of speech recognition permits the efficient and cost-effective indexing of thousands of hours of audio content, which were previously inaccessible. Because of this indexing, SpeechBot allows users to quickly search for relevant content in long audio documents and yields a high precision on first page-retrieved items.

SpeechBot indexes streaming media files based on their content, much as conventional search sites index ordinary Web pages by their text content. Like conventional search sites, SpeechBot does not store or serve the multimedia files themselves, but rather provides users with links. SpeechBot’s current index has over 3200 shows, 3500 hours of audio and 20 million words. The index is continually updated using SpeechBot’s highly scalable architecture.

Source: SpeechBot White Paper

SpeechBot was designed, in principle, to dynamically index streaming audio and other multimedia files that otherwise lack text transcripts. Unlike traditional text documents, audio and other multimedia documents have the additional time vector to account for. The interesting thing about SpeechBot is not that it generates textual transcripts from streaming media sources but that it also indexes time- and format-specific metadata into a separate database. Keyword searches then utilize both databases to pinpoint the location of a reference in the stream itself.

This particular type of technology, however, is most interesting when used in a quite different context. The assumption now is that the greatest value is actually in traversing published media. Realistically, though, there seems to be an even greater opportunity on the horizon. Consider two quick and coming trends: 1) migration of both consumer and business phone services to IP-based technologies and 2) growth of real-time communications tools such as IM, Video Conferencing, and Application Sharing. Both of these methods generate "streams" of content-rich media, though they're usually consumed immediately as opposed to persisted - a la "runtime media".

Imagine applying this to instead search your voicemail by keyword, or better yet, your online conversations with co-workers or friends. Unfortunately, the resources required to support SpeechBot are extensive and any usage in this scenario would require not only deep pockets but overwhelming public trust. This is probably not such a problem for now as the Privacy Policy would still have a field day with this application. Just think -- as packets are to Carnivore, our real-time, online engagements are to SpeechBot. And we know how much everyone loves Carnivore.