CHI 97 Electronic Publications: Papers

MedSpeak: Report Creation with Continuous Speech Recognition

Jennifer Lai, John Vergo
IBM Corporation/T.J. Watson Research Center
30 Saw Mill River Road
jennifer_lai@vnet.ibm.com, jvergo@watson.ibm.com

ABSTRACT

MedSpeak/Radiology is a product that allows radiologists to create, edit and manage reports using real-time, continuous speech recognition. Speech is used both to navigate through the application, and to dictate reports. The system is multi-modal, accepting input by either voice, mouse or keyboard. This paper reports on how we addressed the critical user need of high throughput in our interface design, and ways of supporting both error prevention and error correction with continuous speech. User studies suggest that for this task there was low tolerance for accuracy less than 100%, but the additional time required for corrections was considered by many radiologists to be acceptable in view of the overall reduction in report turn around time.

Keywords

Continuous Speech Recognition, Speech Interface Design, Radiology, Navigation, Dictation

ABSTRACT

Keywords

INTRODUCTION
USER COMMUNITY
CURRENT WORK FLOW
MEDSPEAK USAGE SCENARIOS
MEDSPEAK
METHODOLOGY
SPEECH CONCEPTS
RELATED WORK
DESIGN CHALLENGES
FINDINGS

Report Turn Around Time Dramatically Decreased
Continuous Recognition not Universally Accepted
Accuracy Rates Comparable for Dictation and Navigation
Voice Commands Preferred Unless Hands Already On Keyboard or Mouse

FUTURE WORK
CONCLUSION
ACKNOWLEDGMENTS
REFERENCES

INTRODUCTION

When generating text from speech, eliminating the use of a keyboard or the intermediary of a typist has the promise not only of increased productivity but also of attracting non typists. However, speech recognition technology has not always met with great success when applied to the problem of speech-to-text creation. Studies with speech products to date suggest that skilled typists are slowed down by using a speech dictation system, and that such tools are best suited for non typists or people with typing disabilities [3,4].

Healthcare is one of the areas that has looked favorably at integrating the use of speech recognition since eliminating transcription time and cost matches their drive to improve productivity in the face of managed care plans. There are many areas in medicine that generate a large number of reports, such as postoperative reports, pathology, cardiology, progress notes, emergency medicine and radiology. Creating a highly usable speech system that would be accepted by a large number of doctors, eliminate the transcription step, and that would not take up significantly more of the doctors' time was the challenge we undertook in designing and implementing the MedSpeak system.

With the MedSpeak project, we seek to understand to what degree the limitations of using speech can be resolved by improved technology, and if designing for errors in the interface can compensate for imperfect accuracy. To this goal, we conducted four field observations lasting four weeks each, monitoring doctors using continuous speech recognition in a production environment.

USER COMMUNITY

The user group for our product consists of radiologists, including residents, fellows and attending doctors. Radiologists are responsible for interpreting X-rays, MRIs and CAT scans and compiling the information into a patient report that goes back to the referring doctor. We worked with a representative sample of radiologists at our alpha/beta sites. This sample included both males and females, of various age groups. Two of the sites we worked with closely are large, well-known teaching hospitals: Memorial Sloan Kettering Cancer Center in New York City and Massachusetts General Hospital (MGH) in Boston. Both MGH and Sloan Kettering attract radiologists from around the world, and many of residents and attending doctors that we worked with spoke English with a pronounced accent which became a slight challenge in terms of having them achieve high accuracy with speech recognition.

CURRENT WORK FLOW

Many radiologists are paid per report. Due in part to that fact, the process for creating reports today is very streamlined to minimize the radiologist's time since this is the most "expensive" component in the process. Usually the patients' images (films) are pre-loaded by a technician onto a revolving light box (called an alternator). Associated with the films is a requisition form with some pertinent data on the patient and procedure. The current dictation process almost always uses a tape or digital recording system. The radiologist picks up a microphone, reads the information off of the requisition form, and then starts her interpretation of the film/s. Many of the larger institutions use a bar code reader to scan in the requisition data. The doctor remains totally focused on the film since the mechanics of the tool for capturing the dictation are very simple and quickly become second nature. For the most part, the radiologist speaks very rapidly, blurring words, and often making it difficult for even the human ear to parse. In this way, the doctor dictates 100 reports, on average, daily. These reports will be transcribed by trained medical transcriptionists in either a typing pool (in large to medium hospitals) or a staff typist. Difficulties arise when the reports take more than a half day to get back to the doctor for signature. It is not uncommon in larger institutions with transcription pools for the reports to take 24 hours to be available for the review. By this time the radiologist has created an additional 100 reports, and the facts and patients have started to blur. When reviewing the reports for signature, many doctors have told us that they don't know if the report has been transcribed correctly and mostly they just review for typos and spelling errors. When the transcriptionist is unable to write anything at all because he was unable to hear or understand what the radiologist said, the doctors tell us that they need to get the film pulled again from the library for review, since they often don't remember the case.

MEDSPEAK USAGE SCENARIOS

With MedSpeak, a personal computer is situated adjacent to the light box where the radiologist does the interpretation of films. The system is comprised of a monitor that displays the GUI (see figure 1), a mouse and keyboard, as well as a microphone and set of speakers. The radiologist must first log onto the system which loads her personalized speech profile (macros, audio settings, and enrollment if there is one), and other user settings (font size, function preferences, etc...). She asks to create a new report, and identifies the patient with a record number. MedSpeak displays a list of orders for that patient and she selects the one that describes the film she is about to interpret. She then can initiate a dictation session and dictate her report while remaining totally focused on the film. When she is finished speaking, she says "End Dictation" and the transcribed text is available for her review and signature (see figure 2). She reads the report through for accuracy, makes corrections if certain words were misrecognized, and then electronically "signs" the report. If working in an institution with a Radiology Information System (RIS), the report is sent to the RIS and instantly available for other doctors to refer to.

Additionally, because we believed that certain doctors would be initially averse to changing their current work flow which is to dictate in batch, and then review and sign in batch at a quieter moment in the day, a secondary usage scenario for MedSpeak is to dictate, save the report as preliminary, and then move on to the next film. When the doctor goes back to sign the report, the audio portion is still available to her, so if she is unsure of what she said, she can listen to a section of, or the entire report. Alternatively, she may have a second party listen to the original audio and correct the report. A second party editor is restricted from certain features, such as the ability to electronically sign the report.

Figure 1 The MedSpeak dictation window in advanced user mode, prior to dictation.

Figure 2. The dictation window in Primary user mode, after a report has been dictated. Bubble help for the Begin/Stop Dictation icon is visible in the upper left.

MEDSPEAK

MedSpeak/Radiology is a commercially available application tailored to meet the workflow requirements of radiologists, which supports dictation, editing and electronic signature of reports. Input can be in the form of voice, mouse or keyboard. The system allows continuous voice input for dictation and navigation. Users speak at a natural conversational pace and their words appear on the screen virtually as they say them. Most users have very high accuracy from the first time they use the system. People with heavy accents or who do not speak clearly, need to enroll their voice in MedSpeak in order to achieve equivalent accuracy rates. MedSpeak functions as a speaker adaptive [6,10] system for individuals who need to enroll. Radiologists can generate entire reports with the utterance of a single word by creating personalized "canned" reports.

METHODOLOGY

The first thing we did was to spend as much time as radiologists were willing to give us, at six different hospitals and private practice locations, gaining an understanding of radiologists' workflow and requirements. We refined this understanding of requirements with four market focus groups in different geographic locations of the United States, to determine if there were substantial differences due to location. There were not. We determined to have a working prototype within three months that we could place with radiologists.

There were four phases to the project, each culminating in a four week period during which new versions of the software were installed at the client sites. The first two were alpha phases, where we delivered limited function. In alpha we concentrated on establishing a framework for the UI that the radiologists were comfortable with, as well as refining the requirements for additional functions. The last two phases were a beta and an early product phase. At beta we had all the functions but were lacking certain support features such as the online help. Finally, the early product version gave us the opportunity to ensure the product was robust enough for the general radiology public, and to do a field evaluation of our printed guides. These sessions with the radiologists in their environment resulted in valuable feedback for the design team. They provided an open forum to discuss and probe function and usability issues. Not only were we able to iterate on the design of the user interface and application function, but we became aware of important issues regarding the speech engine itself, which we were able to address mid-project.

SPEECH CONCEPTS

Speech recognition can be used for command and control of the application, called navigation, as well as transcription of dictated speech, called dictation. Dictation systems fall into two categories, discrete and continuous. Discrete systems require the user to insert a discernible pause between each spoken word. The pause provides a period of silence surrounding each spoken word, which makes it easier for the recognition engine to decode the word. Continuous dictation systems allow the user to pace their speech naturally. This causes adjacent words to run into each other, changing the acoustics within each word, based on the preceding and subsequent spoken words. This is referred to as co-articulation. As a result, each word does not have a "clean", or isolated, pronunciation associated with it. Some speech recognition systems utilize a combination of continuous and discrete speech [1].

A dictation application comes with a default vocabulary, language model and set of pronunciations for dictation. Each of these can be extended for individual users. A dictation vocabulary is a set of words that the recognition engine uses to translate dictated speech to text. As part of the decoding process, the engine matches the acoustics from the speech input to words in the vocabulary. Therefore, only words in the vocabulary are capable of being recognized. If a user wishes to dictate a word that is not in the dictation vocabulary, it must first be added.

A language model is a domain specific database of sequences of words in the vocabulary, along with the probabilities of the words occurring in a specific order. The language model assists the recognizer in decoding dictated speech by biasing the output of the speech system towards high probability word sequences. If the user speaks a sequence of words that have a high probability within the language model, the recognition engine has a better chance of correctly decoding the speech than if the user speaks an unusual sequence of words. Within a language domain such as radiology or general English, perplexity is a measurement of the number of equally likely word choices given a sequence of words [10]. In a high perplexity domain such as general English, it is more difficult to predict a word given its preceding words in a sentence, due to the large number of equally likely words that may follow. The radiology domain has a low perplexity than general English, which leads to higher accuracy.

Each word in the vocabulary has one or more pronunciations (for example, po-ta-to and po-tah-to). If a user's pronunciation differs from the pronunciation provided with the system, he can add his own pronunciation for a particular word.

RELATED WORK

Both Kurzweil A.I. and Dragon Systems have a dictation product for radiology. They are reported to have accuracy above 95% when fully trained to the speaker's voice [1]. Both these systems require discrete speech and the user to monitor the words as they appear on the screen. Corrections must be made immediately, and they do not have audio playback of the doctor's recording.

Philips Dictation Systems [9] offers accurate continuous speech recognition technology with its core engine product SpeechMagic. The radiology prototype works by having the doctor dictate the entire report prior to submitting the audio for transcription. Once transcribed, the text is available for editing, correction and playback.

StoryWriter [2] is a speech oriented editor, that was developed for use by reporters who suffer from Repetitive Stress Injury (RSI). While this system required discrete speech, it was favorably received by the user group because it allowed them to do their job in spite of the RSI.

DESIGN CHALLENGES

Our focus in the design of the interface was to emphasize ease of use and high throughput. Specifically, our goal was to design a system that could be used with less than two hours of training and where report creation times were comparable to the current system. These goals were motivated by the fact that many physicians exhibit a reluctance to use computers, and are compelled to produce reports rapidly. We focused our efforts on providing a simple, fast path through the application to create a radiology report. Because the basic set of functions was not large, we chose a "fighter pilot" approach to the UI design, where all critical path functions are always visible and accessible.

As part of our decision to have functions directly available, we decided against a menu bar with the standard, hierarchical Windows function groupings, File, Edit, etc.. We only have toolbars with icon buttons where the icons correspond to functions in the application. Each function available on the toolbar can be activated by either clicking on the icon or saying the associated voice command. The use of a bar code reader to enter the requisite patient record number further streamlines the process. We had several iterations on both placement and content of the function toolbars. By Beta, we had implemented Primary user and Advanced user modes, which correspond to the number and complexity of functions that are visible to the user. It was important to set the different levels up in such a way that users could gain experience with the Primary level, gain confidence and feel curious to seek out additional functionality, and that the experience they had gained at the Primary level would transfer seamlessly to the next level. The field trial for Beta and early product was not long enough to follow radiologists in primary mode and determine how many moved to advanced mode, or why they chose to do so.

To achieve high throughput, it is critical that physicians not be slowed down by incorrectly recognized voice commands. We attempt to prevent these types of errors by using the accepted practice of dynamically enabling vocabularies that are limited to the set of legal commands that the user can use, based on the state of the system. As the application moves from one state to another, we constantly match the active vocabulary to what can be said by the user. For example, when MedSpeak presents the user with a modal dialog with three buttons on it, we disable all vocabularies except one that has the words Continue, Cancel, and Help in it. These three words become the only words the engine can recognize. The perplexity is equal to three [10] and the number of possible incorrectly recognized words is greatly reduced.

The greatest challenge to achieving report creation times comparable to the existing process is that the editing and correction of reports that were previously done by the transcriptionist are now done by the radiologist. Streamlining these functions was an important success factor in our design. Correcting recognition errors that occur during dictation is an involved task, and we changed our original approach as a result of field observations. There are two goals in correcting recognition errors. The first is to have a correct report. The second is to take advantage of the correction to increase the accuracy of subsequent dictations.

Recognition errors occur for many reasons. Some causes, such as ambient noise or the speaker stumbling on a word, are difficult to adjust for. Other reasons, such as a word that is not in the vocabulary, a speaker who pronounces a word differently, or who uses words in unusual sequences, can be addressed. When a speaker corrects a recognition error, we can capitalize on the action by adding new words to the vocabulary, and updating the language model with the context of the corrected word. Updating the pronunciation of the corrected word presents a unique problem with continuous speech. Since the original pronunciation of the word is corrupt due to coarticulation, a new, discrete pronunciation has to be provided by the user. This led to an error correction control panel in our original interface that is represented in figure 3.

Figure 3. Error correction controls.

To correct a recognition error, the user highlights a word in the text of the report. The word appears in the correction text edit field, and the alternate word list is populated with other likely word choices. The user either types the correct word, or chooses it from the alternate word list. The final (and optional) step prior to applying the correction is the recording of a new pronunciation. The recording is done by pressing a button with the mouse, speaking the word, and pressing the button a second time to terminate the recording. If the acoustics of the recording can not be matched to the spelling of the word, the user is prompted to provide a "sounds-like" spelling to help the recognition engine build an internal representation of the pronunciation.

We observed two major problems with this procedure. First, all users including the most computer literate found this to be a difficult procedure to learn and remember, and a cumbersome and lengthy process to follow when correcting a simple error. The second problem stems from the fact that this correction procedure is intended to be used only to correct errors made by the recognition engine. In practice, other types of errors appear in a report. For example, the user may dictate a sentence and then change his mind about what he wants to say. In this case, we instruct the user to re-dictate the sentence, or to use the keyboard and mouse directly in the report rather than correcting the text with the specialized error correction control. We observed a lot of confusion from users trying to decide which technique to use when correcting errors.

The final resolution to the correction issue was to completely remove the error correction control panel. Users now make all corrections in the body of the report. Corrections can be made with keyboard and mouse, or they may re-dictate portions of the report by placing the cursor at an appropriate spot in the text and dictating. Most radiologists found this approach much easier to deal with. Words are added to the vocabulary and the language model is updated when the user electronically signs the report. In practice new words are rarely used, and the entire procedure has the appearance and behavior of a typical editing session to the user. When a new word is detected, the user is prompted to provide a pronunciation for it. A separate dialog is provided to allow the user to optionally give the system a pronunciation of an existing word, if the recognition engine consistently mis-recognizes it.

When dictating, radiologists focus intently on the film being read, and are unwilling to shift their attention between the film and MedSpeak. We support this requirement for "eyes free" dictation by giving the radiologist the ability to correct errors after dictating the entire report. We also provide audio and visual feedback, in the form of beeps and significant screen color changes to indicate that the voice commands to start and stop dictation have been recognized by MedSpeak.

FINDINGS

These observations are based on studies of the doctors using MedSpeak during the 4 feedback sessions described previously. Specifically, these are findings from observations of 40 doctors at 3 hospitals. Additionally, the doctors completed a questionnaire and participated in interviews.

Report Turn Around Time Dramatically Decreased

In a study conducted at one of the client sites, there was an overall decrease in the amount of time that it takes for report to go from completion of procedure (film ready for interpretation) to final signature of 76.8%. If you do not include the amount of time it takes for the film to be ready for interpretation which was equal in both cases, the decrease in time is 99.6%. The time from dictation to signature drops from around 50 hours to a couple of minutes. The 50 hours consist of transcription time and the time from transcription to final signature. Most radiologists at this hospital log on to sign reports once a day, potentially delaying signature by 24 hours. Similar increases in productivity are expected at other sites using MedSpeak, although formal studies have not yet been conducted.

In addition to getting the patient reports back out to the referring doctors much more quickly, one can speculate that using MedSpeak improves patient care another way. If a radiologist can sign a report while the film is still in front of him, there is less chance of a report going out that does not accurately reflect the findings.

Radiologists were very enthusiastic about the sense of closure that they got using MedSpeak. Many told us they dreaded coming in on Mondays to a long list of reports, dictated on Friday, awaiting signature.

Continuous Recognition not Universally Accepted

Perhaps one of the most interesting things that we found with our user group was that while we had delivered what has been touted as the holy grail of speech recognition [1]; continuous, highly accurate, speaker independent, real-time decoding, many radiologists were reluctant to embrace the new system. There may be several reasons for this, some due to the fact that MedSpeak represented a shift to new system, and some related entirely to the fact that MedSpeak is a continuous speech recognition application. First, we will discuss the non speech issues.

Not all users were dissatisfied with the current process. For many radiologists, the legacy system was not truly perceived as a problem for them. The impact of the long turn around time on reports was felt more by the administrative side of the hospital than by radiologists. In user questionnaires distributed at our alpha/beta sites, several radiologists mentioned the problem of delays in transcription and long turn around time for reports. However, when these same users were asked to quantify how satisfied they were with their current system, they indicated that they were either satisfied or very satisfied.

Some users were computer averse. Some of the doctors were comfortable speaking to a computer and using a mouse. This included many of the residents and those that had recently completed medical school; having been exposed to the use of computers in school. Others, especially those that had spent much of their career creating reports with either a dedicated typist at their side, or by dictating into a tape recorder, were reluctant to start using a computer. Even some of the doctors who are comfortable with computers found that they are uncomfortable relying on speech recognition. One of the residents we worked with said that he knew how to use computers, and he knew how to dictate reports, but he "felt strange speaking to the computer" and that it made him "feel like he had to talk like a robot".

The new system entailed a change in roles. It is not uncommon that the introduction of new technology changes roles in a way that does not always benefit the person using it. Radiologists objected to assuming the additional editing and correction tasks that had previously been the responsibility of transcriptionist. One doctor told us: "when I put my hands on the keyboard I am doing an administrative task and no longer functioning as a physician."

The following findings are associated with using speech to create reports.

The variability of the accuracy was disquieting to some. Those that are familiar with speech recognition know that the acoustics of a given word are rarely identical and therefore can be decoded correctly on some occasions and incorrectly on others. This is due to coarticulation and a normal variance in the user's speech. Background noise used to play a significant role in accuracy but no longer does, due to the great progress in noise-canceling microphones. However, highly accurate microphones are also prone to some sensitivity. Proximity to the mouth, angle and customized volume setting of the microphone can cause a variance in accuracy. Doctors were oblivious to the mic volume indicator once they turned their attention to the film. An audio signal may have been useful here. We saw a temporary decrease in accuracy that resulted from the microphone being left on while not in use since automatic and continuous adaptation to speaker volume occurs. This lack of predictability in the accuracy made it difficult for users to build and maintain a useful conceptual model of the application [8].

The difficulty of remembering spoken commands. Most new users received an average training time of one hour. With this amount of training, users would function well with some coaching, but if left to their own resources would forget what the sequence of events was and what the required spoken command was. We noticed this in our first alpha phase, and built in the possibility of using alternate commands. For example, the system will accept either Begin Dictation, or Start Dictation, and New Report is an alternate for Dictate New Report. Still, we saw users saying New Dictation when they meant New Report. We added a help feature that would display the entire list of active commands at a given point in time, when the user said "What Can I Say". Additionally hints can be obtained by positioning the cursor over the icon button. We found that the doctors made little use of What Can I Say, and only resorted to the hints as a last resort. Most radiologists who had not learned a path through the system relied on a pocket reference card for the 2 or 3 critical commands they needed.

MedSpeak took more of the radiologists' time. Table 1 shows the average time (in seconds) it took for a radiologist to create a report using both their current system and MedSpeak (MS).

Current System MS with < 5 usage MS with
> 5 usage
Control 72 146 114
Production 81 N/A 109
Table 1. Average time in seconds to create a report.

	Current System	MS with < 5 usage	MS with > 5 usage
Control	72	146	114
Production	81	N/A	109

The times include dictation, editing and signature of the reports on each system. We timed 7 users to obtain the control numbers, and 4 of those 7 users for the production timings. We divided the users between those that had more than 5 hours of experience with MedSpeak, and those that had less than 5 hours, to reflect time that was due solely to the newness of the system. In all cases, the users had less than 15 hours experience with MedSpeak. This compares with their current reporting system where the levels of experience vary between 7 and 15 years.

The control times are the results of a controlled study in which each user was timed reading a set of 10 films on their current system, and 10 using MedSpeak. Both sets consisted of highly comparable films with regard to film type (Xrays, MRIs, etc..) and difficulty (degrees of normality or abnormality). The order of the sets was alternated so that one user would use Set 1 with MedSpeak and Set 2 with their current system, and the next user would do the opposite to account for order effects. While it was clear from the doctors' comments that they were moving through the films much more quickly than they normally would because they knew that it did not correspond to real people, we believe that this factor had the same effect on MedSpeak as on the current system.

The production times are the average of timings taken while the doctors were interpreting real patient's films. Seventy films of varying type and difficulty were timed. The control times show that users with less than 5 hours of usage on MedSpeak took twice as long to create a report using MedSpeak. Users with at least 5 hours of experience took 1.6 times as long and in production mode this ratio dropped to 1.35. We believe this reflects the fact that when it came time to use MedSpeak in the high pressure mode of what is referred to as online reading, radiologists may have compensated for what they knew to be the additional time requirements of using a speech system by shortening their reports, or relaxing their standards for polish of the text. We observed physicians signing MedSpeak reports with minor recognition errors that did not change the meaning of the text.

The reports took longer to create using MedSpeak because the radiologists were spending more time editing and correcting their reports. This is tied to 2 things. First, more corrections were required when using MedSpeak. When using a Dictaphone to record a report that will be transcribed by a person who has in many cases learned the particular style of a given radiologist, the doctor has the freedom to change his mind two words into a sentence, and the listener will understand that the following words are to supersede what has just preceded. With intonation and pace, he can correct himself if he says "left..RIGHT" and the transcriber will know he meant right. Very often, a transcriptionist will intercede with human understanding and make corrections as required. For example, if the doctor has referenced a 3 millimeter mass in several places in the body of the report, and then in conclusion states the patient has a 3 centimeter mass, the transcriptionist can take an appropriate corrective action. Lastly, humans are superior to speech recognition systems at filtering the troublesome aspects of spontaneous speech such as "Umms", "Ahhhs", and other disfluencies. The second reason reports took longer was due to the difference in the nature of the errors made. Errors made by the transcriptionist were usually spelling mistakes that did not change the meaning of the text. If the transcriptionist can not hear what the doctor said, he places a blank line or a series of XXXs in the report. These can be scanned for quickly by the doctor. By contrast, speech errors can change the meaning of the report. For example, in the case of a deletion error, if a doctor says "there is no sign of cancer", it can be decoded as "there is sign of cancer". The radiologist has to carefully read each word, rather than quickly scanning the report.

Accuracy Rates Comparable for Dictation and Navigation

The average speaker independent dictation error rate was 7.62 % during an alpha phase of development and decreased to 2.97 % by the end of project. Table 2 shows average dictation error levels for 3 users reading a total of 4018 words.

User 1 User 2 User 3 Avg.
Alpha 7.56% 5.51% 9.79% 7.62%
Product 3.02% 2.91% N/A 2.97%
Table 2. Average dictation error levels.

	User 1	User 2	User 3	Avg.
Alpha	7.56%	5.51%	9.79%	7.62%
Product	3.02%	2.91%	N/A	2.97%

Table 3 shows average navigation error levels for 3 users dictating between 70 and 86 commands each. We differentiated between errors when the wrong command was recognized (misrecognized), and when the engine was unable to come up with a command match at all, and the user was asked to try again (unrecognized). Unrecognized commands sometimes caused users to sit and wait for the command to be executed. This may have been due to the subtlety of the feedback given to the users when a command was unrecognized.

User 1 User 2 User 3 Avg.
Misrecognized 1.3% 4.3% 4.7% 3.4%
Unrecognized 0% 2.9% 1.2% 1.4%
Total 1.3% 7.2% 5.9% 4.8%
Table 3. Average navigation error levels.

	User 1	User 2	User 3	Avg.
Misrecognized	1.3%	4.3%	4.7%	3.4%
Unrecognized	0%	2.9%	1.2%	1.4%
Total	1.3%	7.2%	5.9%	4.8%

Voice Commands Preferred Unless Hands Already On Keyboard or Mouse

We conducted a modality usage study on 72 reports across 3 different users. We found that for the functions that initiate a new report, initiate dictation or finalize the report, use of voice was preferred over use of mouse or keyboard. These are functions that can be completed from start to finish without any requirement to use the keyboard. However, for those functions that do require some use of the keyboard there was a strong tendency to execute other commands in that function by continuing with the keyboard or using a mouse. This was also true of the editing and correction of the report that was done predominantly with mouse and keyboard. It was very clear from our observations that users tend to find a pattern of modality usage and stick with it.

FUTURE WORK

A suggested solution that we did not have time to implement, to the difficulty of remembering the correct voice command is to display the command on its associated icon. Alternatively, a list of the currently active voice commands can be displayed constantly. Greater use of alternate commands would help to alleviate the problem (i.e., allow "List reports" in addition to "View preliminary reports"). Fully extending this idea leads to implementing a natural language processing interface.

We observed confusion and hesitation when commands were not recognized. MedSpeak could be improved by providing rapid and clear feedback to the user when a command is not recognized. The use of color in the command history window and/or sound (in the form of a beep) could be explored in future usability studies.

Additional work is required to move MedSpeak into the domain of a "hands free" application. Currently, not all functions are controllable by voice. List manipulations, form filling, setting of radio buttons and form navigation need voice control added. The number recognition and word spelling functions need improvement. We did provide voice controlled features to allow cursor placement and selection of text within the dictated text. We have fast forward and reverse controls that select text on a word or line basis. The selection can be cumulative, or the highlighted word can follow the cursor. This feature is rarely used, and further work is required for useful "hands free" voice controlled error correction and editing.

A common source of recognition errors is the improper use of the microphone. Problems include microphone gain settings, and incorrect microphone positioning by the user. A microphone wizard that detects these problems and interactively helps the user through the process of correcting them would be useful.

CONCLUSION

We achieved mixed results with respect to our two most important design challenges. The ease of use goal was met. All users that received two hours of training were able to create, edit, correct and file reports. The goal of comparable throughput was not met if the radiologist was solely responsible for editing the report. We speculate that if the corrections were made by a second party editor, the report throughput would be equal to that of the current system.

A true hands-free interface to our application would be a significant improvement. Fully speech enabling the application with efficient means of editing and correcting text by voice would allow the radiologist to better focus on the task of interpreting films, instead of having to be aware of how to interact with the MedSpeak application.

While MedSpeak was very well received by many radiologists, it was not met with universal acceptance. However, many people who refused to use discrete speech applications did find MedSpeak to be highly useful. One radiologist said "We have been following speech recognition technology for 10 years, mostly with great disappointment, until our experience with the MedSpeak/Radiology system". The primary objection of the radiologists who did not embrace the system is that it is not 100% accurate. As our recognition accuracy continues to improve from 97% we expect the vast majority of objections to speech recognition technology to fall by the wayside.

ACKNOWLEDGMENTS

We thank John Karat, Dick Lam, Cathy Wolf and Nicole Yankelovich for their review of this paper and helpful comments.

REFERENCES

1. Benet, Bernard Dictation Systems for Windows: Dragon, IBM, Kurzweil. Seybold Report on Desktop Publishing, n10, v9 June 10, 1995 pp. 12 -19

2. Danis, C. et.al. Story Writer: A Speech Oriented Editor. CHI '94 Conference companion, pp. 277-278

3. Danis, C. and Karat, J. Technology-Driven Design of Speech Recognition Systems. In Proceedings DIS, 1995

4. Grasso, M. Automated Speech Recognition in Medical Applications. MD Computing, n1, v12, Jan. 11 1995, pp16-23

5. Hemphill, C.T. Surfing the Web by Voice. Proc. Multimedia 95, Addison Wesley Reading Mass., 1995 pp. 215-222

6. Lerner, E.J. Talking to Your Computer. IBM Research Magazine, Number 3, 1994, pp. 8-14

7. Martin, Crabb, Adams, Baatz, Yankelovich SpeechActs: A Spoken-Language Framework. Computer, IEEE Computer Society, July 1996, pp. 33-40

8. Yankelovich, Nicole Designing Speech Acts: Issues in Speech User Interfaces CHI '95 pp. 369 - 376

9. Philips Dictation Systems. Available as http://www.speech.be.philips.com/products.htm

10. Schmandt, Christopher Voice Communication with Computers. Van Nostrand Reinhold, New York 1994

MedSpeak and MedSpeak/Radiology are registered trademarks belonging to the IBM Corporation.

CHI 97 Electronic Publications: Papers