The MALACH project deals with automatic speech recognition, machine translation, and information retrieval. I work on creating support for search and exploration in this collection.
We are working with the spoken documents are video interviews of survivors of Holocaust which were collected by the SHOA foundation. To give you an idea on the size of the collection I should mention that there are 52,000 interviews each one of them is about 2.5 hour long. These interviews were collected in 32 languages.
Challenges:
1. Linear and non-scannable nature of speech.
When you search a collection of text (e.g., the Web), it is very easy to scan through the returned documents (e.g., read the titles, snippets returned by the search engine, or even scan the whole text) to estimate the content relevance of the material. The speech is different, you cannot scan it as easily as text. How do you provide a content overview which is both effective and efficient? What would an overview that allows a user to quickly focus on the relevant interviews and then zoom to the interesting parts of the interview look like? This situation is made more difficult by the following challenge.
2. Poor quality of the text transcripts.
One possible solution for the challenge described in the previous item is to perform the search on the collection of interview transcripts, effectively bringing the problem of searching speech to the problem of searching text. However, our collection has no manual or human-generated transcripts available. Making such transcripts is not feasible due to shear volume of the collection. We are presently working with automatically generated transcripts, which contain a significant amount of incorrectly recognized words. The quality of those transcripts, while sufficient to allow search, is not acceptable for smooth reading. How do we use the “word soup” generated by the ASR to provide the content description?
We are looking at automatic ways of labeling the media stream. Given a large ontology of labels and a set of training examples, we are building a system that will learn to assign the labels using the speech recognition output.
The label assignment are going to be probabilistic — each label is active at each particular time moment of the recording with a certain probability. We exploring alternative interface designs to visualize these labels and use the visualization for browsing and navigating the media stream.
Reference: [Oard03-sss]
3. Context interpretation
When a user asks a question of an information retrieval system, the system can point the user to the most relevant document and to the most relevant part of the document. Let us assume for a moment that the system did not make any mistake and that paragraph of text is exactly the most relevant datum in the collection. The user may still have trouble assimilating the information if she takes the paragraph out of context and disregards the surrounding material. With a text document the user can quickly examine the context by looking at the text that precedes and follows the returned paragraph, in a speech document this is less feasible.
We are working on designing an artificial intermediary that will interpret the user question and the available speech material to make the connection more natural and “ease” the user into the context of the interview.
For example, imagine a user who asks “What is was like to be on a train going from Bremen to Auschwitz when you are 11 years old?” Suppose the collection has no interviewers that describe exactly the same experience, but there are very similar testimonies. So, the system interprets the user’s request, selects a similar testimony and highlights the differences between the user’s request and the context of the relevant material. The system then makes the connection by explicitly stating the differences between the user’s request and the existing material.
An avatar shows up and says something like: “Mike, you were about that young, when you went to a labor camp, weren’t you? Please tell us about that morning when it all started.” The user sees an interview fragment where a man describes his experience with the departure. The interviewee may not describe the railroad journey itself, so the avatar shows up again and presents another fragment: “Mary, you were 12 when you have traveled to Auschwitz from Warsaw, weren’t you? Please tell us what did these train cars looked like?”
Our idea is to create a panel metaphor where the user will be interacting with a group of interviewees and the system provides the appropriate switching between the speakers.