Multi

Multimodal: vision, audio, video, image gen

8 links across all digests

From Week 14, 2026

Qwen (Alibaba) · 5 min read

Qwen3.5-Omni processes text, images, audio, and video natively (opens in new tab)

Alibaba's Qwen team released an omni-modal model that handles text, images, audio, and video in a single architecture, with support for 113 languages and dialects.

Models Multi

From Week 13, 2026

Google Blog · 6 min read

Gemini 3.1 Flash Live brings real-time voice AI across Google products (opens in new tab)

Google launched its real-time voice model offering low-latency conversational AI across APIs and products, positioning it as infrastructure for voice-first agent experiences.

Models Multi

From Week 13, 2026

Mistral AI · 4 min read

Voxtral TTS — Mistral's multilingual text-to-speech model for edge deployment (opens in new tab)

Mistral released a 4B-parameter multilingual TTS model supporting nine languages, optimized for on-device deployment with minimal audio samples for voice cloning.

Models Multi OSS

From Week 12, 2026

Google · 4 min read

Google Stitch launches as an AI-native design canvas (opens in new tab)

Google's Stitch is a vibe design tool with an infinite canvas, voice input, Gemini-powered UI generation, and Design.md for portable design systems. Figma's stock dropped 8% on the news.

UX Tools Multi

From Week 11, 2026

Claude Blog · 4 min read

Claude now creates interactive charts, diagrams, and visualizations (opens in new tab)

Claude Chat gains the ability to generate interactive charts and diagrams directly in conversation, making data exploration more visual without leaving the thread.

Tools Multi

From Week 11, 2026

GitHub · 5 min read

RCLI: voice-controlled AI for macOS applications (opens in new tab)

Open-source voice AI tool that lets you control macOS apps by speaking. Small project, surprisingly polished, and a glimpse at ambient voice interfaces beyond Siri.

Tools Multi

From Week 10, 2026

TechCrunch · 6 min read

Luma Launches Creative AI Agents Powered by Unified Intelligence Models (opens in new tab)

Luma released multimodal agents built on its Uni-1 model that coordinate text, image, video, and audio generation end-to-end, launching with enterprise partners Publicis Groupe and Serviceplan Group.

Agents Multi

From Week 10, 2026

ntik.me · 12 min read

Building a Sub-500ms Voice Agent by Rethinking the Streaming Pipeline (opens in new tab)

Nick Tikhonov built a voice agent achieving roughly 400ms end-to-end latency by orchestrating STT, LLM, and TTS with optimized streaming, outperforming Vapi's equivalent setup by 2x.

Dev Agents Multi