Eerily realistic AI voice demo sparks amazement and discomfort online

May Be Interested In:Starmer rejects Enoch Powell parallel after ‘island of strangers’ speech



An example argument with Sesame’s CSM created by Gavin Purcell.

Gavin Purcell, co-host of the AI for Humans podcast, posted an example video on Reddit where the human pretends to be an embezzler and argues with a boss. It’s so dynamic that it’s difficult to tell who the human is and which one is the AI model. Judging by our own demo, it’s entirely capable of what you see in the video.

“Near-human quality”

Under the hood, Sesame’s CSM achieves its realism by using two AI models working together (a backbone and a decoder) based on Meta’s Llama architecture that processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) on approximately 1 million hours of primarily English audio.

Sesame’s CSM doesn’t follow the traditional two-stage approach used by many earlier text-to-speech systems. Instead of generating semantic tokens (high-level speech representations) and acoustic details (fine-grained audio features) in two separate stages, Sesame’s CSM integrates into a single-stage, multimodal transformer-based model, jointly processing interleaved text and audio tokens to produce speech. OpenAI’s voice model uses a similar multimodal approach.

In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.

Sesame co-founder Brendan Iribe acknowledged current limitations in a comment on Hacker News, noting that the system is “still too eager and often inappropriate in its tone, prosody and pacing” and has issues with interruptions, timing, and conversation flow. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he wrote.

share Share facebook pinterest whatsapp x print

Similar Content

Erin Patterson trial live updates: mushroom cook’s murder trial continues in Victoria’s supreme court
Erin Patterson trial live updates: mushroom cook’s murder trial continues in Victoria’s supreme court
Can Flash breathe ‘Neu’ life into the Tata group’s superapp dreams?
Can Flash breathe ‘Neu’ life into the Tata group’s superapp dreams?
«C’est un numéro qui aurait pu être encore plus hot que ça» - Garou
«C’est un numéro qui aurait pu être encore plus hot que ça» – Garou
CIA believes COVID likely originated from a lab, but agency has low confidence in its own finding
CIA believes COVID likely originated from a lab, but agency has low confidence in its own finding
Kemi Badenoch gives evidence to Covid inquiry – UK politics live
Kemi Badenoch gives evidence to Covid inquiry – UK politics live
Idris Ackamoor Ankhestra, Rhodessa Jones, Danny Glover: Artistic Being review – powerful live set from the Afrofuturist and friends
Idris Ackamoor Ankhestra, Rhodessa Jones, Danny Glover: Artistic Being review – powerful live set from the Afrofuturist and friends
Beyond Borders: Global News That Hits Home | © 2025 | Daily News