* This blog post is a summary of this video.

Building a Voice-to-Email App with Whisper and GPT-3

Author: NextgridTime: 2024-02-02 21:10:00

Table of Contents

Introducing the Whisper Speech Recognition Model

Whisper is an automatic speech recognition system trained on about 680,000 hours of diverse data collected from the web. This model enables transcription in multiple languages and is completely open source, allowing it to serve as a foundation for building useful voice applications.

The model comes in different sizes, trading off between accuracy and speed. For this project, we'll use the small, fast tiny model. Whisper also supports many languages besides English.

Whisper Model Details and Resources

You can find specifics on the architecture, training data, and benchmarks for Whisper at the official OpenAI page. The model paper is also available, providing full details on the techniques used to develop Whisper. For code samples and tutorials using Whisper, check out the boilerplates available on the Anthropic Labs platform and on GitHub.

Whisper Model Sizes and Languages

There are 9 different pre-trained Whisper models available, with options focused specifically on English transcription or supporting 50+ languages. The models range from 1 million parameters (tiny) up to 317 million parameters (large-v2). We'll use the 40 million parameter base model in this tutorial.

Building a Voice-to-Email Python App

We'll build a simple Python application that records audio, transcribes it with Whisper, and then uses the GPT-3 API to generate a formal email based on the transcript.

Key steps include:

  • Recording audio from the microphone

  • Sending to Whisper API for transcription

  • Feeding transcript text into GPT-3 prompt

  • Printing and auto-copying the generated email

Recording and Transcribing Audio with Whisper

We use the sounddevice, scipy, and whisper libraries to record 8 seconds of audio from the mic and save to an MP3 file. Then we load the Whisper base model and detect the spoken language. After some preprocessing we transcribe the audio file, printing out the resulting text.

Generating Emails from Transcripts with GPT-3

We feed the transcribed text into a GPT-3 prompt formatted to generate a formal email response. The model returns the email text, which we print and automatically copy to the clipboard via pyperclip.

Testing the Completed Voice-to-Email App

With the core functionality built, we test our Python script end-to-end. The app records audio, transcribes it reasonably well with Whisper base model, and generates a formal email using the transcript via GPT-3.

The auto-copy to clipboard makes it easy to paste the finished email into an email client.

Sample Outputs and Results

Some examples of audio transcriptions and resulting emails:

  • "I'm really sick and I won't be coming to work today" -> Apology email to boss
  • "I won a billion dollars yesterday" -> Email saying won't come to work tomorrow

Optimizations and Next Steps

While usable, the model does make some transcription mistakes that propagate into the final email. Next steps:

  • Try larger Whisper models for better accuracy
  • Improve GPT-3 prompt for more realistic emails
  • Add user validation of transcription before email generation


Q: What is the Whisper speech recognition model?
A: Whisper is an open source automatic speech recognition system trained on over 680,000 hours of diverse data from the web. It enables high-quality transcription in multiple languages.

Q: How do you build a voice-to-email app with Whisper and GPT-3?
A: You can record audio, use Whisper to transcribe it to text, then feed the text to GPT-3 to generate a formal email body which can be copied and pasted.