Poly Goes Multi-Modal: Voice, Vision, and Canvas
Real-time voice conversations, image understanding, and an infinite canvas workspace. Poly now speaks, sees, and creates with you.
Poly is no longer just text. Starting today, you can speak to Poly, show it images, and arrange your ideas on an infinite canvas — all within the same workspace. We've added three major capabilities that fundamentally expand what you can do with AI.
Voice
Voice brings real-time speech-to-text and text-to-speech to every conversation. Choose from dozens of AI voices, set your preferred language, and have natural back-and-forth conversations with AI agents.
Voice calls support:
- Live transcription — see what's being said in real time
- Interruption handling — speak over the AI naturally, it adapts
- Context-aware responses — the AI remembers the full conversation history
- Voice selection — choose from 40+ voices across 20+ languages
Perfect for brainstorming sessions, language practice, or hands-free productivity.
Vision
Vision lets you share images directly in chat. Poly understands what's in the image — text, objects, charts, code — and can analyze, describe, or build on it.
- Paste a screenshot of a UI bug → ask Poly to write the fix
- Show it a whiteboard sketch → get a structured document
- Upload a chart → ask for statistical insights
- Drop in a photo of handwritten notes → get a clean transcript
The Canvas
The Canvas is where everything comes together. It's an infinite spatial workspace where you can place chat streams, documents, images, code editors, and more — arrange them how you think, not how tabs dictate.
Draw connections between related items, group by project, zoom out for the big picture. Your AI workspace, finally spatial.
Multi-modal is available on all plans starting today. Voice requires microphone access; vision supports JPEG, PNG, GIF, and WebP up to 20MB.
More from Poly