这是用户在 2024-10-22 9:46 为 https://www.vccafe.com/2024/10/16/the-booming-voice-ai-landscape-a-vc-perspective/?utm_source=substa... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

The Booming Voice AI Landscape: A VC Perspective
蓬勃发展的语音人工智能领域:风险投资的视角

remagine ventures voice ai

Voice AI applications will unlock $10B of new software TAM over the next five years
语音人工智能应用将在未来五年内解锁 100 亿美元的新软件市场

Bessemer Venture Partners
贝塞默风险投资

Remember when talking to machines felt like science fiction? Those of you old enough to remember the ‘Google Duplex’ demo (which turned out to be fake) might recall the feeling of astonishment that tech can sound that natural. Well, that future is now knocking on our door. ChatGPT’s advanced voice mode and Eleven Labs are setting new benchmarks in conversational AI by enhancing voice quality and realism, NotebookLM’s natural voice podcast took the Internet by storm and new open source technologies are making high quality voice cloning easier than ever.
还记得与机器对话像科幻小说一样的感觉吗?那些足够老的人可能还记得“Google Duplex”演示(结果被证明是假的),可能会回想起科技能听起来如此自然的惊讶感。好吧,那个未来现在正敲响我们的门。ChatGPT 的高级语音模式和 Eleven Labs 正在通过提升语音质量和真实感设定对话 AI 的新基准,NotebookLM 的s自然语音播客在互联网上引起了轰动,而新的开源技术使高质量的语音克隆比以往任何时候都更容易。

Like many tech breakthroughs, it’s bringing unprecedented opportunities for startups. As a VC watching this space, Im seeing a perfect storm brewing: massive investment, breakthrough technologies, and untapped markets ripe for disruption. But it’s also not free of challenges – from powerful incumbents to questions about the dark side of these technologies.
像许多技术突破一样,它带来了前所未有的创业机会。作为一名关注这一领域的风险投资者,看到了一场完美风暴正在酝酿:巨额投资、突破性技术和尚未开发的市场准备好被颠覆。但它并非没有挑战——从强大的现有企业到关于这些技术阴暗面的问题。

In this post I tried to collate the best thinking about Voice AI, standing on the shoulders of research published by Lightspeed, A16Z, Bessemer and others and bringing examples that I found compelling. If you get chance, watch some of the videos to get a sense on how far the technology got. Let’s dive in!
在这篇文章中 尝试整理关于语音人工智能的最佳思考,基于 Lightspeed、A16Z、Bessemer 等发布的研究,并带来一些 认为引人注目的例子。如果有机会,观看一些视频以了解这项技术的发展程度。让我们 开始吧

The State of Play: Voice AI in 2024
2024 年语音人工智能的现状

In 2024, about a third of all venture capital funding has been going into AI companies. Most of that investment (dollar wise) has been going to companies building foundational AI models raised over $23 billion, with voice technology being a key beneficiary. This includes OpenAI’s latest round of $6.6 billion (largest VC round in history). But substantial investments are also being deployed into emerging startups, particularly into vertical applications. This trend is evident in the success of companies like DeepL (translation), Speak (language learning), and Retell AI (call centres). Sierra AI, founded by Bret Taylor (former co-CEO of Salesforce, CTO of Facebook and current chairman of OpenAI) is currently raising hundreds of millions of dollars at $4 billion valuation, just a year or so from launch after unlocking AI voice agents for companies.
在 2024 年,约三分之一的风险投资资金流入了人工智能公司。大部分投资(按美元计算)流向了构建基础人工智能模型的公司,筹集了超过 230 亿美元,其中语音技术是主要受益者。这包括 OpenAI 的最新一轮 66 亿美元的融资(历史上最大的风险投资轮)。但大量投资也正在投入到新兴初创公司,特别是垂直应用领域。这一趋势在像DeepL(翻译)、Speak(语言学习)和Retell AI(呼叫中心)等公司的成功中显而易见。Sierra AI由 Bret Taylor(前 Salesforce 联合首席执行官、Facebook 首席技术官及现任 OpenAI 主席)创立,目前正在以 40 亿美元的估值筹集数亿美元,距离推出仅一年左右,解锁了为公司提供的人工智能语音代理。

For the second quarter in a row, AI was the top sector by venture dollars invested. And funding to AI companies has grown this year not just in terms of absolute dollars invested, but also proportion. (source: Crunchbase)
连续第二个季度,人工智能成为风险投资资金投入最多的行业。今年对人工智能公司的资金不仅在绝对投资金额上增长,也在比例上有所增加。(来源:Crunchbase)

But what’s more interesting is how the technology is being deployed. First, It’s worth taking a look at the most updated landscapes and then dive into the trends.
但更有趣的是这项技术是如何被应用的。首先,值得看看最新的环境,然后深入探讨趋势。

The latest landscape in the Voice AI space was published by Lightspeed. It provides a comprehensive overview of the current state of voice technology and how it developed over time.
最新的语音人工智能领域的动态由 Lightspeed 发布。它提供了对当前语音技术状态的全面概述,以及其发展历程。

Another deep dive on Voice AI was recently published by A16Z, with a particular focus on voice AI agents and the desire to automate/reinvent the phone call. It’s particularly interesting to think about voice AI in terms of the tech stack needed to build the voice engines, but note that the application layer (for both B2B and B2C apps) sits on top of the tech stack doesn’t require to build the full infrastructure.
最近,A16Z 发布了另一篇关于语音人工智能的深度分析,特别关注语音人工智能代理以及自动化/重塑电话通话的愿望。考虑到构建语音引擎所需的技术栈,思考语音人工智能尤其有趣,但请注意,应用层(无论是 B2B 还是 B2C 应用)位于技术栈之上,并不需要构建完整的基础设施。

Voice AI tech stack – by A16Z
语音人工智能技术栈 - A16Z

The landscape is still relatively small, but growing. On the B2B side, Business voice applications have progressed significantly, from rudimentary interactive voice response (IVR) systems in the 1970s to sophisticated conversational AI systems powered by LLMs. Large players entering the AI agent space are starting to acquire companies in this space (or build their own solutions). In the landscape below, Israeli startup Tenyx was recently acquired by Salesforce for an undisclosed sum.
该领域仍然相对较小,但正在增长。在 B2B 方面,商业语音应用已经取得了显著进展,从 1970 年代的基本交互式语音响应(IVR)系统发展到由LLMs驱动的复杂对话式人工智能系统。进入 AI 代理领域的大型企业开始收购该领域的公司(或构建自己的解决方案)。在下面的图景中,以色列初创公司Tenyx 最近被 Salesforce 收购,具体金额未披露。

On the B2C side, with advancements in real-time conversational AI, businesses can now deliver seamless, interactive voice experiences that feel increasingly natural and personalised. For example Speak and Praktika, which use voice AI for language learning, grew very quickly to over $20M in revenue in the last 12 months.
在 B2C 方面,随着实时对话 AI 的进步,企业现在可以提供无缝的互动语音体验,这些体验感觉越来越自然和个性化。例如,Speak 和 Praktika 利用语音 AI 进行语言学习,在过去 12 个月中迅速增长,收入超过 2000 万美元。



Bessemer makes a bold prediction that Voice AI applications will drive $10 billion in new software TAM over the next five years. While early Voice AI companies focused on Automatic Speech Recognition (ASR), a new generation is emerging with conversational voice solutions that handle repetitive tasks. These advancements enable professionals in sales, recruiting, customer support, and administrative roles to concentrate on more strategic, high-value activities.
贝斯默大胆预测,语音人工智能应用将在未来五年内推动 100 亿美元的新软件市场总量。虽然早期的语音人工智能公司专注于自动语音识别(ASR),但新一代正在出现,提供处理重复任务的对话式语音解决方案。这些进展使销售、招聘、客户支持和行政角色的专业人士能够专注于更具战略性和高价值的活动。

State of the cloud 2024 by Bessemer
2024 年云计算状态报告 - 贝斯梅尔

Real time AI Audio Agents and live conversations – which coincided with the launch of its OpenAI’s Advanced Voice Mode, enables users to have a real time voice conversation with the chatbot, and even get it to sing. I’ve yet to try it personally, but the demos I’ve seen online have been very impressive. Another example is the startup Bland AI, a startup that can handle sales and customer service
实时 AI 音频代理和实时对话 – 这与 OpenAI 的s高级语音模式的发布相吻合,使用户能够与聊天机器人进行实时语音对话,甚至让它唱歌。还没有亲自尝试,但我在网上看到的演示觉得非常令人印象深刻。另一个例子是初创公司Bland AI,这是一家可以处理销售和客户服务的初创公司。

Bland AI is a startup that can handle sales and customer support calls
OpenAI rolls out Advanced Voice Mode for ChatGPT | TechCrunch

Google’ is building a real-time voice assistant called Project Astra, which aims to deliver real time multi modal user interaction by seeing the world and communicating with the user in natural language. Imagine if Siri and Alexa could do this?
谷歌正在构建一个名为 Project Astra 的实时语音助手,旨在通过观察世界并用自然语言与用户沟通,实现实时多模态用户交互。想象一下,如果 Siri 和 Alexa 也能做到这一点?

Project Astra: Our vision for the future of AI assistants

Multi-Modal Innovation The integration of voice with other AI capabilities is creating new possibilities. OpenAI’s voice mode isn’t just about speech – it’s about natural, contextual conversations. Google’s Illuminate and NotebookLM are great examples of taking content that is primarily text and making into human sounding podcast/voice conversation between two people.
多模态创新 语音与其他人工智能能力的结合正在创造新的可能性。OpenAI 的语音模式不仅仅是关于语音——它是关于自然、上下文的对话。谷歌的IlluminateNotebookLM是将主要是文本的内容转化为两人之间听起来像人类的播客/语音对话的优秀例子。

AI Podcast Hosts Discover They’re AI, Not Human - NotebookLM

Democratisation of Voice Tech Tools: ElevenLabs, the leader in the space, is pushing boundaries in voice synthesis, making AI characters sound increasingly human and available to any developer via API. The company is 2 years old and is reportedly doing $80M ARR per TechCrunch.
语音技术工具的民主化: ElevenLabs,该领域的领导者,正在推动语音合成的边界,使人工智能角色听起来越来越像人类,并通过 API 向任何开发者开放。该公司成立 2 年, reportedly 每年收入 $8000 万,根据 TechCrunch 的报道。

Another example is Cartesia AI. It enables creating real-time, multi-modal AI systems that can function independently of cloud connectivity, thereby enhancing privacy and reducing latency.
另一个例子是 Cartesia AI。它能够创建实时的多模态人工智能系统,这些系统可以独立于云连接运行,从而增强隐私并减少延迟。

What once required massive resources can now be accomplished with open-source tools and modest computing power. A case in point, Ethan Mollick recently shared a thread on how he cloned his voice using e2-f5-tts running locally (using Pinokio) with only 10 seconds of original voice recording. This democratisation is driving innovation at the edges. Think about the products and services people can come up with next.
曾经需要大量资源的事情,现在可以通过开源工具和适度的计算能力来完成。一个例子是,伊桑·莫利克最近分享了一条关于他如何使用本地运行的 e2-f5-tts(使用 Pinokio)克隆自己的声音的帖子,仅用 10 秒的原始声音录音。这种民主化正在推动边缘的创新。想想人们接下来能想出什么产品和服务。

The ElevenLabs Reader App. Listen to any article, PDF, ePub, or any text on the go with the highest quality AI voices.
ElevenLabs 阅读器应用。随时随地以最高质量的 AI 语音收听任何文章、PDF、ePub 或任何文本。

ElevenLabs Reader

Vertical Applications Taking Off. A large portion of the funding and innovation in voice AI is concentrated on applications for specific industry verticals.
垂直应用正在起飞。 语音人工智能的资金和创新大部分集中在特定行业垂直应用上。

  • Healthcare (remote patient monitoring, mental health support) like Suki which raised $70M earlier this month
    医疗保健(远程患者监测,心理健康支持)如 Suki,本月早些时候 筹集了 7000 万美元
  • Education (language learning, personalised tutoring) like Speak, which raised a Series B-3 round in July at a $500 million valuation
    教育(语言学习,个性化辅导)如 Speak,在七月份以 5 亿美元的估值完成了 B-3 轮融资
  • Customer Service (intelligent voice agents) like Ada
    客户服务(智能语音助手)如 Ada
  • Entertainment (gaming, interactive content) such as Volley, which creates AI voice games and recently raised $55M series C or Respeecher AI which can change voices for AI filmmaking or help you license celebrity voices.
    娱乐(游戏、互动内容)如 Volley,该公司创建了 AI 语音游戏,并最近筹集了 5500 万美元的 C 轮融资,或 Respeecher AI,它可以为 AI 电影制作更改声音或帮助您获得名人声音的许可。
Respeecher AI Audio Test | Changing Voices with AI Tools

Opportunities for Startups: Focusing on Niche Solutions
初创企业的机会:专注于细分市场解决方案

Despite the dominance of giants like OpenAI and Google, startups have ample room to innovate by focusing on niches. Here’s where startups can find room to grow:
尽管像 OpenAI 和 Google 这样的巨头占据主导地位,初创公司通过专注于细分市场有充足的创新空间。 在这里,初创公司可以找到成长的空间:

  • Industry Specialisation: Vertical AI applications are transforming industries by leveraging domain-specific data and AI models to address specialised use cases. This includes a wide range of verticals like In-car entertainment, hospitality, commerce, personal health, financial services etc.
    行业专业化:垂直 AI 应用通过利用特定领域的数据和 AI 模型来解决专业化的用例,正在改变各个行业。这包括汽车娱乐、酒店、商业、个人健康、金融服务等广泛的垂直领域。
  • Agentic Automation for Enterprise Functions: Generative AI agents are being deployed to automate complex business processes across various functions. As A16Z pointed out, there’s a huge opportunity in automating phone calls, especially those that have a predictable flow, this can include: customer service (although this space is getting very crowded), sales and marketing, IT helpdesk, meeting management etc. Virtual employees for hire.
    企业功能的 自主自动化生成式人工智能代理正在被部署以自动化各个职能中的复杂业务流程。正如 A16Z 所指出的,自动化电话呼叫存在巨大的机会,特别是那些具有可预测流程的呼叫,这可以包括:客户服务(尽管这个领域变得非常拥挤)、销售和市场营销、IT 帮助台、会议管理等。可雇佣的虚拟员工。
  • Consumer Cloud Applications: Bessemer forecasts that AI-driven content, including voice, will dominate by 2030. AI is revitalising the consumer cloud market, creating opportunities for startups building applications that leverage voice and other modalities. From voice enabled content creation to social media or education, users are willing to pay for high quality interactions to either reduce loneliness or get entertained. Google paid $2.6 billion to re-hire the founders of Character.ai and I could see a voice enabled version of that platform coming up in the near future. Would you pay $1 to have a phone call with virtual Elon Musk? Napoleon? Mahatma Gandhi?
    消费者云应用:贝塞默预测,到 2030 年,人工智能驱动的内容,包括语音,将主导市场。人工智能正在振兴消费者云市场,为构建利用语音和其他方式的应用程序的初创公司创造机会。从语音启用的内容创作到社交媒体或教育,用户愿意为高质量的互动付费,以减少孤独感或获得娱乐。谷歌花费 26 亿美元重新雇佣了 Character.ai 的创始人,我可以预见该平台的语音启用版本将在不久的将来出现。你愿意花 1 美元与虚拟的埃隆·马斯克、拿破仑或甘地进行电话交谈吗?
  • Innovating on-device – On-device processing requires balancing performance with power consumption and device resources. As mentioned in the example of Cartesia, enabling users to access voice AI applications via the phone is crucial as it’s a natural way that consumers use voice and has the widest availability. That being said there are also opportunities in other connected devices like home assistants, TVs, watches, car entertainment etc.
    设备创新 – 设备处理需要在性能、功耗和设备资源之间取得平衡。如Cartesia的例子所示,让用户通过手机访问语音 AI 应用至关重要,因为这是消费者使用语音的自然方式,并且具有最广泛的可用性。话虽如此,其他连接设备如家庭助手、电视、手表、汽车娱乐等也存在机会。

Ethical Challenges and Market Considerations
伦理挑战与市场考虑

The rapid growth of voice AI presents notable challenges:
语音人工智能的快速增长带来了显著的挑战:

  • Competition from AI Giants: Startups face competition from large, well-funded companies like OpenAI, Google, and Microsoft, which are developing sophisticated voice and translation models and have vast-amounts of data and distribution advantages.
    来自人工智能巨头的竞争:初创公司面临来自像 OpenAI、谷歌和微软等大型资金充足公司的竞争,这些公司正在开发复杂的语音和翻译模型,并拥有大量数据和分发优势。
  • Technical hurdles: Ensuring the accuracy of speech recognition and language understanding is essential for reliable performance. Another component of this technical challenge is accuracy. AI voices that sound ‘robotic’ can be disappointing for users.
    技术障碍:确保语音识别和语言理解的准确性对于可靠的性能至关重要。这个技术挑战的另一个组成部分是准确性。听起来“机械”的人工智能声音可能会让用户感到失望。
  • Latency and Cost: Training and deploying sophisticated voice models can be computationally expensive. Current architectures often involve multiple steps (speech to text, text processing, text to speech) that can introduce delays and make voice interactions costly. Reducing latency to sub-250 milliseconds is crucial for natural-sounding conversations
    延迟和成本:训练和部署复杂的语音模型可能会非常耗费计算资源。当前的架构通常涉及多个步骤(语音转文本、文本处理、文本转语音),这些步骤可能会引入延迟,使语音交互变得昂贵。将延迟减少到 250 毫秒以下对于自然的对话至关重要。
  • Ethical and IP Concerns: With the proliferation of voice cloning and tokenised speech, startups must address ethical concerns proactively to ensure responsible development and deployment. There’s a fairly good chance that bad actors are using the latest voice technology for malicious purposes.
    伦理和知识产权问题:随着语音克隆和代币化语音的普及,初创公司必须主动解决伦理问题,以确保负责任的开发和部署。坏人利用最新的语音技术进行恶意活动的可能性相当高。
  • Data Privacy and Security: Voice data is highly sensitive and subject to regulations like GDPR. Startups need to prioritise data security and privacy to maintain user trust and comply with legal requirements
    数据隐私和安全:语音数据高度敏感,受 GDPR 等法规的约束。初创公司需要优先考虑数据安全和隐私,以维护用户信任并遵守法律要求。
  • Managing Human-AI Interaction: Voice AI applications need to be designed to seamlessly hand off to human agents when necessary, for example in the case of health or customer service. It’s important to keep a human in the loop and maintain a high quality control.
    管理人机互动:语音人工智能应用需要设计成在必要时无缝地交接给人类代理,例如在健康或客户服务的情况下。保持人类参与并维持高质量控制是重要的。

A Call to Action: Innovating in Voice AI
行动呼吁:在语音人工智能领域创新

The voice AI revolution is unfolding, and startups operating at the application layer can benefit from a more robust infrastructure they can build on. This is a pivotal moment for startups to innovate, collaborate, and shape the future of voice technology.
语音人工智能革命正在展开,处于应用层的初创公司可以从更强大的基础设施中受益,他们可以在此基础上构建。这是初创公司创新、合作和塑造语音技术未来的关键时刻。

At Remagine Ventures, we invest in pre-seed startups in Israel and UK. If you’re a founder building the future of AI Voice applications/agents, we’d love to hear from you.
在 Remagine Ventures,我们投资于以色列和英国的种子前期初创公司。如果您是正在构建 AI 语音应用/代理未来的创始人,我们’d很想听到您的消息。

Follow me 跟我来
Co Founder and Managing Partner at Remagine Ventures
联合创始人兼管理合伙人Remagine Ventures
Eze is managing partner of Remagine Ventures, a seed fund investing in ambitious founders at the intersection of tech, entertainment, gaming and commerce with a spotlight on Israel.
Eze 是 Remagine Ventures 的管理合伙人,该种子基金投资于在科技、娱乐、游戏和商业交汇处的雄心勃勃的创始人,特别关注以色列。


I'm a former general partner at google ventures, head of Google for Entrepreneurs in Europe and founding head of Campus London, Google's first physical hub for startups.
我曾是谷歌风险投资的前合伙人,负责谷歌在欧洲的创业项目,也是伦敦校园的创始负责人,谷歌首个实体创业中心。


I'm also the founder of Techbikers, a non-profit bringing together the startup ecosystem on cycling challenges in support of Room to Read. Since inception in 2012 we've built 11 schools and 50 libraries in the developing world.
我也是 Techbikers 的创始人,这是一个非营利组织,旨在通过骑行挑战将创业生态系统聚集在一起,以支持 Room to Read。自 2012 年成立以来,我们在发展中国家建立了 11 所学校和 50 个图书馆。
Eze Vidra
Follow me

By Eze Vidra

Eze is managing partner of Remagine Ventures, a seed fund investing in ambitious founders at the intersection of tech, entertainment, gaming and commerce with a spotlight on Israel. I'm a former general partner at google ventures, head of Google for Entrepreneurs in Europe and founding head of Campus London, Google's first physical hub for startups. I'm also the founder of Techbikers, a non-profit bringing together the startup ecosystem on cycling challenges in support of Room to Read. Since inception in 2012 we've built 11 schools and 50 libraries in the developing world.

Leave a ReplyCancel reply

Exit mobile version