* This blog post is a summary of this video.

Testing OpenAI's New Features: GPT Vision, DALL-E 3, Text to Speech and More

Author: All About AITime: 2024-01-28 22:15:06

Table of Contents

GPT-4 Turbo with Visual Interpretation

One of the most exciting announcements from OpenAI's recent demo day event is the introduction of a new GPT-4 Turbo model with visual interpretation capabilities. This allows the language model to analyze and describe images, opening up many new potential use cases.

As shown in the demo, using the new GPT-4 Vision API is incredibly simple. You just need to pass in an image URL along with your prompt querying what's in the image. In under a second, you get back a detailed textual description of the contents and context of the photo.

Interpreting a Historical Image

For example, I passed in a Wall Street Journal front page image from a pivotal day during the 2008 financial crisis. GPT-4 Vision immediately recognized it showed a significant economic event, describing the graphs showing declines for major companies like Lehman Brothers and AIG. It also knew this specific newspaper issue captured the early stages of the crisis. The visual interpretation abilities open up many possibilities for automatically analyzing images and video to understand their significance.

DALL-E 3 Image Generation API

In addition to analysis, OpenAI also unveiled APIs for advanced image generation through DALL-E 3. Similar to the text APIs, getting up and running is straightforward. You pick a model, write a text prompt describing the image you want generated, set some basic parameters like image size and number of images, and call the API. In seconds, intricate images are created from the text descriptions.

For example, I prompted DALL-E to generate a 'faded image of a 90s hacker setup with CRT screen with code, mysterious vibe'. It produced a retro-futuristic image that looked straight out of a hacker movie from the early internet era. I'm excited to couple these generation capabilities with the GPT-4 visual interpretation model in creative workflows.

Generating a Custom Hacker Image

As a test, I asked DALL-E 3 to generate a 'faded image of a '90s hacker setup with CRT screen with code, mysterious vibe'. It produced an intricate, retro-futuristic image that looked straight out of a hacker movie from the early internet era. The ability to turn any text prompt into a custom generated image unlocks a lot of creative potential.

Text to Speech API

On top of text and images, OpenAI introduced a new text-to-speech API for converting language into lifelike audio narrations. The setup is almost identical to DALL-E - pick a model, input your text, choose from a variety of voices and speeds, and generate an audio file.

I created a short sample that said 'Hello, I'm really excited about the new GPT Vision API. I also wish you all a great day.' The voice sounded quite clear and natural. While it didn't sound fully human, it was still solid quality compared to traditional text-to-speech. I'm interested in exploring using it to make conversations with AI assistants feel more natural.

Converting Text to Audio File

As a test, I inputted the text 'Hello, I'm really excited about the new GPT Vision API. I also wish you all a great day.' The API generated an audio file that played back the sentence in a clear, natural-sounding voice. While not completely human-like, it was very solid quality compared to traditional text-to-speech systems. This will be useful for creating more conversational voice interfaces.

128k Context Window for GPT-4 Turbo

One of the most significant OpenAI updates for text generation is boosting GPT-4 Turbo to support much larger context windows up to 128,000 tokens (equivalent to about 300 pages of text). This allows prompting complex, multi-step tasks that depend on lengthy background knowledge.

As an example, I fed in the 30+ page transcripts of Tesla and Meta's latest earnings calls as context, then asked GPT-4 Turbo to summarize the key points from both calls. In around 50 seconds, it returned a concise yet detailed summary touching on financial metrics, new product launches like Meta's Quest 3 headset, and challenges like economic conditions. Condensing thousands of words into just the crucial details demonstrates the powerful capabilities unlocked by expanded context size.

Summarizing Earnings Call Transcripts

To test the larger 128,000 token context capacity, I provided full 30+ page transcripts of the latest earnings calls for Tesla and Meta. I then used this as context and asked GPT-4 Turbo to summarize just the key points from the lengthy transcripts. In around 50 seconds, it returned a concise yet detailed summary touching on financial metrics, new products like Meta's Quest 3 headset, and challenges like economic conditions for each company. The ability to condense thousands of words down to just the most crucial details demonstrates the enhanced understanding capabilities enabled by substantially larger context sizes.

Assistant Capabilities

Beyond standalone models, OpenAI introduced revamped Assistant capabilities that combine features like text generation, image generation, audio narration, and information retrieval together into a single workflow. For example, I uploaded Tesla's full earnings call transcript as a document, then gave my finance assistant instructions to read through it and compile an earnings report.

After processing the document, I was able to have a natural conversation asking followup questions about key points and sentiment from the call. The assistant gave thoughtful answers pulling relevant snippets of information from the document on demand. Packaging multiple AI functions into an easy to use virtual assistant unlocks new possibilities for automated document analysis.

Analyzing Sentiment of Earnings Call

As a demonstration, I uploaded Tesla's full Q3 earnings call transcript as a document for my customized finance assistant bot. After ingesting the transcript, I was able to ask natural followup questions about the overall sentiment on the call and get back clear, useful answers citing relevant passages from the document as needed. This streamlined workflow for document ingestion, analysis, and query exemplifies the future possibilities from AI assistants.

Conclusion and Next Steps

OpenAI's latest offerings provide exciting glimpses into the future of AI. The new models make interacting with AI radically easier and more powerful - just provide a text prompt, image, or document, and advanced AI capabilities handle the rest to generate insightful results.

While the offerings are still new with room for improvement, they provide a strong foundation to build upon. Some particular areas I'm interested in exploring further include using the visual AI models for automated video analysis, leveraging the Assistant for document search and reporting, and pushing creative boundaries with the image generation capabilities. As the models continue rapidly advancing, this feels like just the start of tapping into AI's vast potential.


Q: How to use the new GPT Vision API?
A: The GPT Vision API is easy to use. Simply pass an image URL, API key, and question about the image. The API will return a detailed interpretation.

Q: What was shown with the DALL-E 3 example?
A: The DALL-E 3 example showed generating a custom hacker image based on a text prompt. The image vividly matched the prompt specifications.

Q: What audio formats does the Text to Speech API support?
A: The Text to Speech API allows saving generated audio as MP3 files. It also supports streaming real-time audio.

Q: How was the 128k context utilized?
A: The 128k context window was leveraged to summarize lengthy earnings call transcripts into a concise overview.

Q: What was the purpose of the Assistant demo?
A: The Assistant demo analyzed sentiment and created a detailed earnings report by interpreting an uploaded earnings call transcript.

Q: What new OpenAI features might be covered next?
A: Some next features to cover are model fine-tuning, custom models, GPT-4 browser exploration, and testing the new Whisper speech recognition model.

Q: What was the overall sentiment of the earnings call?
A: The analysis found the earnings call sentiment was cautiously optimistic despite acknowledging future economic and production challenges.

Q: How can the Assistant be customized?
A: The Assistant can be customized by naming it, providing initial instructions, selecting a base model, adding code and retrieval, plus integrating custom functions.

Q: What made analyzing the earnings call straightforward?
A: Uploading the full earnings call transcript PDF enabled easy automated analysis versus manually processing lengthy audio content.

Q: What was the benefit of the 128k context window?
A: The expanded context window enabled directly processing lengthy documents without pre-processing or summarization.