Voice Recognition and Medical Transcription

Bryan Bergeron, MD

In This Article

Technology Update

Computer-based voice recognition is often thought of in the context of future science fiction, when computers and robots would be able to converse with their human masters. In reality, voice recognition predates the development of the digital computer.[1] The most significant developments in the past 50 years have been in the areas of vocabulary size, the ability to handle natural language as opposed to single word recognition, recognition accuracy, and integration with telephony and other technologies.

For the purpose of medical transcription, the most important advances have been in the availability of affordable, large-vocabulary, natural language recognition systems. The most common incarnation of this technology is a local, PC-based voice-recognition engine that generates reports, often using macros and templates to make the process more efficient and reduce recognition errors. Figure 1 shows a common configuration for an interactive voice-recognition medical transcription system.

PC-based, interactive dictation. Note local voice profile.

As shown in Figure 1, the clinician interacts with the voice-recognition engine on a local PC until the report is to his or her liking. The report is generated, the clinician signs off on the report, and the process is complete. The advantage of this scenario is immediacy; the clinician works with the system in real time to correct recognition errors (typically in the 2% to 10% range), and the report is ready for signing as soon as it's printed. Faster turn-around means faster reimbursement, a good thing in large medical groups.

There are several major limitations of this approach. The greatest is that it demands much more time from clinicians than the alternative it typically replaces -- dictating into a telephone or a recorder. Instead of spending perhaps a minute dictating notes into a recorder or over a cell phone on the way home from the office, the clinician is stuck in front of a PC, correcting misrecognized words, a task once relegated to transcriptionists. Many clinicians argue that they would be better off financially by squeezing another patient into their schedule rather than spending the time it would take to edit a dozen reports.

There is also the issue of a local voice profile, which is a large file that defines how the clinician's particular voice qualities map onto (usually) the English language. It's the voice profile that makes it possible for a general-purpose voice-recognition engine to work with a clinician from Texas as well as it does with a clinician in Massachusetts. The voice profile is modified with each use of the program, allowing it to "learn" pronunciation subtleties and increase the recognition accuracy. But because the profile sits on one PC, the clinician is limited to one machine -- often a major limitation unless the machine is a laptop.

A partial solution to the portability and mobility issue is to use blind dictation and a digital recorder, as shown in Figure 2. In this scenario, the clinician has the freedom to dictate a report from anywhere and at any time. After the reports are dictated, the data from the recorder are downloaded to a PC running the voice-recognition software, a report is generated, and then it is printed for sign-off. There is still a limitation of one PC because of the voice profile.

PC-based, blind dictation with a digital recorder.

There is a price for the freedom of mobility and time. One is the lack of immediacy. Reports may not be available for a day or 2 after the clinic visit, depending on when the clinician downloads the data. In addition, because the download process is often performed by an assistant, there is the cost of an office worker.

A larger issue is accuracy. Because dictation isn't interactive, the clinician can't correct the report in real time. The report is likely to have many more errors, meaning that the clinician will have more corrections to make during sign-off. One work-around is to have an assistant babysit the translation process and make changes on the PC before report generation. In fact, a common scenario is to have trained transcriptionists work with the raw documents, making corrections by listening to the dictated audio while reading the electronic report. This approach can save money because transcriptionists spend their time editing, not typing reports from scratch. But the scenario in Figure 2 still suffers from the limitation of a single voice profile and the need for someone to download the digital dictation file.

A natural extension of the "invisible" voice recognition system show in Figure 2 has been the relatively recent introduction of server-side recognition engines, similar to the automated voice response systems used by credit card and flight reservation companies. As shown in Figure 3, this model approximates the traditional dictation approach of calling in a dictation over a telephone and then reviewing and signing off on the report a day or later. The same limitations and features of the model shown in Figure 2 apply -- including the need for someone to verify and edit the document before it is printed -- with the major exception that the clinician's voice profile is no longer limited to a single PC used by one transcriptionist. In addition, the use of a telephone frees up the clinician and obviates the arduous process of downloading digital dictation files, returning memory sticks or chips, dealing with download cables, and the like.

Server-side, blind dictation. Note server-based voice profile. Accuracy can be improved, and demand on physician's time is lessened by the addition of a transcriptionist prior to report generation.

The downside of server-side systems are slightly lower translation accuracies, in part because of the limitations of the telephone system. Most telephones are limited to a bandwidth of about 3000 Hertz, whereas a stand-alone voice-recognition system with a good microphone has an audio bandwidth approaching 15,000 Hertz. That is, there is simply more voice data to work with when a local microphone is used. Still, server-side dictation solves many of the issues that plague the stand-alone approach -- and it's invisible to the clinician.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: