May 21, 20243 min readUpdated May 6, 2026

ChatGPT Goes Multimodal with Voice and Images

ChatGPT, the popular chatbot from OpenAI, is now capable of understanding and responding to voice and image inputs. This new multimodal capability opens up new dimensions of conversation, allowing users to interact with

In an impressive stride towards pushing the boundaries of conversational AI, OpenAI has unveiled groundbreaking multimodal capabilities for ChatGPT. This game-changing breakthrough empowers the chatbot to not only comprehend images but also comprehend speech and engage in spoken interactions with users. By enabling ChatGPT to see, hear, and speak, OpenAI has ushered in a new era of chatbot interaction that promises to revolutionize the way we communicate and interact with AI assistants.

Image source: OpenAI

Speak with ChatGPT: A Revolution in Voice-based Conversations

The integration of Whisper, OpenAI's advanced text-to-speech technology, has unlocked the potential for users to engage in dynamic voice-based conversations with ChatGPT. Leveraging the power of Whisper, users can now communicate with the chatbot using their voice, leading to a more intuitive and natural dialogue.

Through a collaboration with professional voice actors, OpenAI has meticulously crafted five distinct voice options for chat interactions. This enhancement brings an unprecedented level of personalization and immersion, allowing users to truly converse with ChatGPT as if they were interacting with another human.

‍

Speak with ChatGPT:

You can now use voice to engage in a back-and-forth conversation with your assistant.

The hyper-realistic text-to-speech model allows you to choose from five different voices.

On mobile, opt-in to voice in Settings → New Features on the mobile app. pic.twitter.com/8VwiLxghfP

— Rowan Cheung (@rowancheung) September 25, 2023

‍Chat with Images: Expanding the Language Reasoning Horizons

With the introduction of multimodal capabilities, ChatGPT now possesses the remarkable ability to comprehend and reason about images, photographs, screenshots, and even text documents. This breakthrough enhancement enables users to seamlessly incorporate visual content into their conversations with ChatGPT, opening up a myriad of possibilities for collaboration, assistance, and creativity.

Users can discuss and reference multiple images within the same conversation, expanding the scope of topics that can be explored. Additionally, OpenAI has introduced a new drawing tool that allows users to guide the chatbot's understanding through visual cues.

Reimagining Conversational AI: Additional Notes and Future Outlook

The impact of ChatGPT's multimodal transformation extends beyond its standalone features. OpenAI's text-to-speech model, powered by Whisper, has already found practical utility in Spotify's Voice Translation feature pilot, effectively translating podcast audio for a wider audience. Looking ahead, OpenAI has announced a phased rollout of voice and image capabilities over the next two weeks, initially available to Plus and Enterprise users.

This inclusiveness is exemplified by the future plans to bring voice functionality to both iOS and Android platforms, ensuring a seamless experience across different devices. Similarly, the availability of image capabilities on all platforms amplifies the accessibility and applicability of ChatGPT's multimodal prowess.

Why it Matters: A Leap Forward in LLMs and Interactive AI Assistants

OpenAI's achievement of adding multimodal capabilities to ChatGPT represents a significant advancement in the field of Large Language Models (LLMs). By surpassing Google's anticipated launch of Gemini, OpenAI underscores its leadership in innovating conversational AI technologies. Furthermore, the integration of voice and image capabilities places ChatGPT on the path to becoming the virtual assistant we have all envisioned.

The convergence of natural language processing, computer vision, and speech recognition heralds a future where AI assistants can truly understand and engage with users in a seamless and human-like manner. OpenAI's multimodal breakthrough brings us closer to realizing the potential of interactive AI assistants, fulfilling the long-standing desire for a sophisticated virtual companion akin to the widely known Siri.

The awe-inspiring multimodal capabilities introduced to ChatGPT by OpenAI have set a new benchmark in conversational AI. With the ability to see, hear, and speak, ChatGPT will undoubtedly redefine the way we interact with AI assistants. The inclusion of voice-based conversations and image comprehension open up exciting avenues for seamless communication and collaboration between humans and AI. As the gradual rollout for Plus and Enterprise users commences, the future looks promising as OpenAI continues to push the envelope of what conversational AI can achieve. With ChatGPT's multimodal revolution, the dream of an intelligent virtual assistant that understands and assists us through multiple modalities is finally becoming a reality.

Speak with ChatGPT: A Revolution in Voice-based Conversations

‍Chat with Images: Expanding the Language Reasoning Horizons

Reimagining Conversational AI: Additional Notes and Future Outlook

Why it Matters: A Leap Forward in LLMs and Interactive AI Assistants

Keep reading

Build vs Buy for Healthcare AI Voice Agents: A Decision Framework for Scheduling, Intake, and Follow-Up

Build vs Buy for AI Workflow Automation: A Decision Framework for Operations Teams

How to Deploy AI Voice Agents for Healthcare Scheduling Without Breaking HIPAA