Q1: What is AbbotDemo?
Q2: Why was AbbotDemo released?
Q3: How do install AbbotDemo?
Q4: How to run AbbotDemo?
Q5: Selecting the input device
Q6: Troubleshooting
Q7: Upgrading AbbotDemo
Q8: How does AbbotDemo work?
Q9: What short-cuts have been made in this system?
Q10: Known bugs
Q11: Is this package supported?
Q12: Can I do phone recognition?
QN-2: Legalities
QN-1: Who is responsible for AbbotDemo?
QN: Where can I find out more?
AbbotDemo is a packaged demonstration of the Abbot connectionist/HMM continuous speech recognition system developed by the Connectionist Speech Group at Cambridge University. The system is designed to recognize British English and American English clearly spoken in a quiet acoustic environment.
This demonstration system has a vocabulary of 20,000 words - anything spoken outside this vocabulary can not be recognised (and therefore will be recognised as another word or string of words). The vocabulary and grammar is based around the task of reading from a North American Business newspaper (the word list is given in file spring98-20k.lst).
To install you need extract the files using gzip and tar. Typically this will look something like:
unix$ gunzip -c AbbotDemo-0.6.tar.gz | tar xvf -
unix$ ./AbbotDemo -usA window should appear, called abbotAudio, for controlling the recording of the speech. A sample session is described below.
1 THIS 1 SPEECH READING 1 SPEECH RECOGNITION THE 1 SPEECH RECOGNITION IS A B. 1 SPEECH RECOGNITION IS A PIECE OF THE 1 SPEECH RECOGNITION IS A PIECE OF CAKEThe script prints out the best guess to the word string as the recognition proceeds and the final recognised word string at the end. Recognition should take about 20 Mbyte of memory and run in a few times real time on a Pentium or faster processor.
Alternatively, if you do not have X or have problems associated with abbotAudio, you can send prerecorded files through the recogniser by specifying the names of the audio files on the command line. These files should be of speech sampled at 16 kHz with 16 bits/sample in the natural byte order and with no header. For example:
unix$ srec -t 3 -s16000 -b16 test.raw Speed 16000 Hz (mono) 16 bits per sample unix$ ~/AbbotDemo-0.4/AbbotDemo test.raw 1 SPEECH RECOGNITION IS A PIECE OF CAKEThe file test.raw is included as an example in the 'etc' directory.
To select where the audio is recorded from and played back to, click on the "Audio" menu-button on abbotAudio, and highlight the appropriate option. Both input and output channels are controlled from this one menu.
The system can be upgraded to use a larger, and therefore more accurate, language model by fetching the file spring98-20k-8-8.bin.gz from the AbbotDemo FTP directory (see Q3). You will need the GNU gzip utility to uncompress this file. To run this system we recommend 64 Mbyte RAM. Please note that this file is 22 Mbyte when compressed, which is why it is not included in the core distribution.
To run AbbotDemo with the new language model simply uncompress and specify the file name on the command line with the -lm flag like this:
unix$ AbbotDemo -uk -lm spring98-20k-8-8.bin
In adddition we produced a language model for the EuroSpeech conference based on the EuroSpeech93 proceedings and some speech papers that were available locally. The amount of training data is much less than the standard North American Business news domain and hence the language model quality is not as good but it does show the use of speech recognition in another domain. Pick up a copy of ICASSP, EuroSpeech or ICSLP and start reading to both systems to see the effect of having an appropriate language model.
The files are available from host svr-ftp.eng.cam.ac.uk in directory pub/comp.speech/recognition/AbbotDemo/ as euro16k-00.bin.gz and euro16k.dict.gz. Only British English pronunciations are supported. The system is run as:
AbbotDemo -uk -dictionary euro16k.dict -lm euro16k-00.bineuro16k.dict contains pronunciations for 17372 words and the FTP size is 4.5 Mbyte. 32-64 Mbyte of RAM is recommended.
AbbotDemo uses one recurrent network to estimate phone probabilities. We find that using four networks and combining the outputs can result in 20% fewer errors.
AbbotDemo uses context independent phone models. Using context dependent phone models approximately doubles the number of parameters (and therefore the CPU required to run the network in real-time) but does result in about 15% fewer errors.
The vocabulary size of AbbotDemo is 20,000 words. If you pick up a copy of the Wall Street Journal chances are this will only cover about 97% of the words in a given passage of text and therefore there will be at least 3% errors as the system can not recognise words that are not in the vocabulary.
The size of the language models was constrained in order to allow ease of FTP access and reasonable disk usage. This results in a significant increase in the perplexity (the average number of words that are considered as the next word). The resulting increase in word error rate has been measured at 16%.
The decoder, noway, has tighter pruning than used in our "evaluation quality" decodes in order to achieve faster operation. These are adjustable with the -beam, -state_beam, -n_hyps and -prob_min options at the end of the AbbotDemo script. Our tests show that the faster options supplied with this demo result in about 10% increase in word error rate.
For the US English version of AbbotDemo, the "evaluation quality" pronunciation dictionary has been replaced by the publicly available CMU lexicon. This causes a mismatch between training and testing pronunciations and results in an additional 20-30% increase in word error rate for this system (please note that the increased error rate is due to the mismatch between lexica and is not necessarily due to the quality of the CMU pronunciations).
From this you may total all these numbers for a 20k system and arrive at about twice or three times the error rate for AbbotDemo than might otherwise be achieved (we have not yet measured the combined degradation). Our typical error word rate on clean read speech is about 11-17%.
Much of the funding for the recent development of this system was provided by the ESPRIT Wernicke Project with partners:
CUED Cambridge University Engineering Department, UK
ICSI International Computer Science Institute, USA
INESC Instituto de Engenharia de Sistemas e Computadores, Portugal
LHS Lernout Hauspie SpeechSystems, Belgium
and associates:
SU Sheffield University, UK
FPMs Faculte Polytechnique de Mons, Belgium
Dedicated hardware for training the recurrent networks and system
software for that hardware were provided by ICSI.
The acoustic and language models for AbbotDemo were derived from materials distributed by the Linguistic Data Consortium. ftp://ftp.cis.upenn.edu/pub/ldc
The CMU statistical language modelling toolkit was used to generate the trigram language model.
The BEEP dictionary was used for British English pronunciations. ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/beep-0.7.tar.gz
The CMU dictionary was used for American English pronunciations. ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.4.Z
The CMU phone set was expanded using code provided by ICSI.
The X-windows interface for speech capture was derived from speech processing software developed at the Laboratory for Engineering Man/Machine Systems (LEMS) at Brown University.
The ABBOT home page is
Please direct enqueries to AbbotDemo@softsound.com