AnnouncementMay 20, 20264 min read

Poly Goes Multi-Modal: Voice, Vision, and Canvas

Real-time voice conversations, image understanding, and an infinite canvas workspace. Poly now speaks, sees, and creates with you.

Poly is no longer just text. Starting today, you can speak to Poly, show it images, and arrange your ideas on an infinite canvas — all within the same workspace. We've added three major capabilities that fundamentally expand what you can do with AI.

Voice

Voice brings real-time speech-to-text and text-to-speech to every conversation. Choose from dozens of AI voices, set your preferred language, and have natural back-and-forth conversations with AI agents.

Voice calls support:

Live transcription — see what's being said in real time
Interruption handling — speak over the AI naturally, it adapts
Context-aware responses — the AI remembers the full conversation history
Voice selection — choose from 40+ voices across 20+ languages

Perfect for brainstorming sessions, language practice, or hands-free productivity.

Vision

Vision lets you share images directly in chat. Poly understands what's in the image — text, objects, charts, code — and can analyze, describe, or build on it.

Paste a screenshot of a UI bug → ask Poly to write the fix
Show it a whiteboard sketch → get a structured document
Upload a chart → ask for statistical insights
Drop in a photo of handwritten notes → get a clean transcript

The Canvas

The Canvas is where everything comes together. It's an infinite spatial workspace where you can place chat streams, documents, images, code editors, and more — arrange them how you think, not how tabs dictate.

Draw connections between related items, group by project, zoom out for the big picture. Your AI workspace, finally spatial.

Multi-modal is available on all plans starting today. Voice requires microphone access; vision supports JPEG, PNG, GIF, and WebP up to 20MB.