A Journey I Wish Started Sooner: From Flat Notes to AI's Vocal Highs with ElevenLabs
Companies discussed- ElevenLabs, Amazon Polly, Speechify, iSpeech
Please note: The insights presented in this article are derived from confidential consultations our team has conducted with clients across private equity, hedge funds, startups, and a leading investment bank, facilitated through specialized expert networks. Due to our agreements with these networks, we cannot reveal specific names or delve into detailed topics from these discussions. Therefore, we offer a summarized version of these insights, ensuring valuable content while upholding our confidentiality commitments.
Hey everyone, Recall Polly the Parrot from Peppa Pig and her repetitive, monotone chatter? Imagine if she existed in today's AI-driven world. Today, we're diving into the text-to-speech technology and pondering how our dear Polly would sound if AI had a say!
As we engage with our clients, many are asking about ElevenLabs. ElevenLabs’ Series A round witnessed a substantial inflow of $19 million. An inside source confided to TechCrunch that the post-money valuation for this AI startup, post the Seed round, was around $99 million . Overall, we've noticed an increasing curiosity and desire to delve deeper into this domain. Many are seeking insights, trying to discern the key players and the unique offerings they bring to the table. In this article, we will go through market potential and CAGR, players - ElevenLabs, Amazon Polly, Speechify, iSpeech, use cases and potential risks.
Personally, I recall using a computer-generated voice during what I playfully term my "online course teenage phase." Remember those days when piecing together your debut online course felt as puzzling as solving a Rubik's Cube? I found myself constantly re-recording and fiddling with tools like Final Cut Pro to perfect the audio. In moments like these, any helping hand felt like a blessing from the tech heavens.
But my initial experience was jarring. The AI voice was cold, robotic, devoid of any emotion or warmth. It felt like an alien entity, trying to mimic human speech but falling short. It lacked the nuances, the rises and falls, and the warmth of human intonation. I was left wondering if this was the future of digital assistance.
Fast forward to today, and the landscape has transformed dramatically. We have many Co-Pilots to help us now. Now this human-like AI voice is hard to distinguish them from the real thing. These aren't the monotonous, mechanical voices of yesteryears; they carry emotion, depth, and an ability to resonate with the listener. As I reflect on this journey, I'm filled with excitement for what the future holds for us in this incredible confluence of technology and human expression.
Market Overview
Text-To-Speech Market size is valued at USD 1.94 Billion in 2020 and is projected to reach USD 5.61 Billion by 2028, growing at a CAGR of 14.21% from 2021 to 2028. The technology is becoming more popular as the demand for AI, automation and convenience grows.
How Do Voice Computing and Text-to-Speech Tech Work?
At its core, text-to-speech technology follows a few key steps:
The system first listens to human voices, turning these sounds into digital data. This is known as automatic speech recognition (ASR).
Next, it tries to understand what these words mean, a step called natural-language generation (NLG).
Thanks to AI, the system can craft unique responses.
After deciding what to say, the system then figures out how to say it. It breaks down words into phonemes (distinct sounds) and ensures it uses the right tone and tense.
TTS vs Voice Cloning technology
Text-to-speech is the process of converting written text into audible speech using synthetic voices. It's like giving a computer a script and having it read it out loud, often in a voice that, while human-like, is clearly machine-generated.
On the other hand, voice cloning dives a level deeper. It involves capturing the unique tonal qualities, nuances, and idiosyncrasies of a specific individual's voice. Once a voice model is created, it can generate speech that sounds eerily similar to the original person, even if they've never uttered those words.
While TTS provides a general voice output, voice cloning offers a personalized audio experience, blurring the lines between man and machine.
Players in the space
1. ElevenLabs:
I read a recent twitter post about classics….
Wow! that’s huge leap from what was there a few years ago!
Check out the video below and tune in around the 20-second mark. Can you catch how she articulates, "This is just the beginning"? The richness in that tone is simply captivating!
It specializes in voice cloning technology. They offer a platform that can generate a unique voice from just a few minutes of audio data. This technology can be used for various applications, including virtual assistants, video game characters, and more.
We are thinking, "What's the buzz about ElevenLabs that has many of our clients talking?" Well, let me share our finding and a bit of a personal story. ElevenLabs was birthed in 2022 by two passionate individuals, Piotr Dabkowski and Mati Staniszewski. Both Piotr and Mati hail from Poland, and they often shared their childhood frustrations about the subpar dubbing of American movies in their homeland. It's this personal connection to language barriers that fueled their mission. With their combined expertise from Google and Palantir, they set out with a vision: to harness the power of AI and make voice universally accessible. The best startups are a reflection of a founder's vision. Investing in them transcends mere technology; it's an intimate endeavor.
2. Speechify:
Speechify is a text-to-speech solution designed to help people with dyslexia, ADHD, and other learning differences. It can convert text from books, articles, and other sources into natural-sounding speech.
Let’s hear this from Speechify customer - Endeavor’s earning call.
This is huge!!!
On Feb 28, 2023, Endeavor (NYSE: EDR) made history by delivering its annual earnings call using an AI voice over from Speechify.
Ari Emanuel, chief executive of talent agency and UFC owner Endeavor, has taken that idea to a literal extreme. A synthetic version of Emanuel’s voice delivered the opening remarks on Endeavor’s fourth-quarter earnings call, in place of Emanuel himself.
“We used a recording of Ari’s voice and our generative AI system to create a synthesized version of Ari’s voice,” said Cliff Weitzman, CEO of Speechify, in a statement.
3. Amazon Polly:
Coming back to Polly, the parrot! Let’s see what industry is saying.
Alvin Hung, founder and CEO of GoAnimate, says, "I studied and tested numerous vendors before choosing Amazon Polly. The voices are the most natural ones and the API is also great and easy to implement.”