![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
当涉及到准确的音频访谈时,当口语不是英语时,更多的事情会变得更加复杂。
This article is co-authored by Ugo Pradère and David Haüet
本文由UgoPradère和DavidHaüet合着
How hard can it be to transcribe an interview ? You feed the audio to an AI model, wait a few minutes, and boom: perfect transcript, right ? Well… not quite.
抄录面试有多难?您将音频喂入AI型号,等待几分钟,然后繁荣:完美的成绩单,对吗?好吧……不完全。
When it comes to accurately transcribe long audio interviews, even more when the spoken language is not English, things get a lot more complicated. You need high quality transcription with reliable speaker identification, precise timestamps, and all that at an affordable price. Not so simple after all.
当涉及到准确的音频访谈时,当口语不是英语时,更多的事情会变得更加复杂。您需要具有可靠的扬声器标识,精确的时间戳以及所有这些价格的高质量转录。毕竟不是那么简单。
In this article, we take you behind the scenes of our journey to build a scalable and production-ready transcription pipeline using Google’s Vertex AI and Gemini models. From unexpected model limitations to budget evaluation and timestamp drift disasters, we’ll walk you through the real challenges, and how we solved them.
在本文中,我们将带您在旅途的幕后,使用Google的Vertex AI和Gemini型号构建可扩展和生产的转录管道。从意外的模型限制到预算评估和时间戳漂移灾难,我们将带您解决真正的挑战,以及我们如何解决它们。
Whether you are building your own Audio Processing tool or just curious about what happens “under the hood” of a robust transcription system using a multimodal model, you will find practical insights, clever workarounds, and lessons learned that should be worth your time.
无论您是要构建自己的音频处理工具,还是对使用多模式模型的强大转录系统“引擎盖下”发生的事情感到好奇,您都会找到实用的见解,聪明的解决方法以及所学的经验教训,应该值得您花费时间。
Context of the project and constraints
项目的背景和约束
At the beginning of 2025, we started an interview transcription project with a clear goal : to build a system capable of transcribing interviews in French, typically involving a journalist and a guest, but not restricted to this situation, and lasting from a few minutes to over an hour. The final output was expected to be just a raw transcript but had to reflect the natural spoken dialogue written in a “book-like” dialogue, ensuring both a faithful transcription of the original audio content and a good readability.
2025年初,我们开始了一个面试转录项目,其目标是一个明确的目标:建立一个能够抄录法语采访的系统,通常涉及记者和客人,但不限于这种情况,并且持续了几分钟到一个多小时。预计最终的输出只是一个原始的成绩单,但必须反映以“书本”对话进行的自然口头对话,从而确保了原始音频内容的忠实转录和良好的可读性。
Before diving into development, we conducted a short market review of existing solutions, but the outcomes were never satisfactory : the quality was often disappointing, the pricing definitely too high for an intensive usage, and in most cases, both at once. At that point, we realized a custom pipeline would be necessary.
在进行发展之前,我们对现有解决方案进行了简短的市场审查,但是结果从未令人满意:质量通常令人失望,价格绝对太高,无法达到大量使用,并且在大多数情况下都同时进行。那时,我们意识到需要进行自定义管道。
Because our organization is engaged in the Google ecosystem, we were required to use Google Vertex AI services. Google Vertex AI offers a variety of Speech-to-Text (S2T) models for audio transcription, including specialized ones such as “Chirp,” “Latestlong,” or “Phone call,” whose names already hint at their intended use cases. However, producing a complete transcription of an interview that combines high accuracy, speaker diarization, and precise timestamping, especially for long recordings, remains a real technical and operational challenge.
由于我们的组织参与了Google生态系统,因此我们必须使用Google Vertex AI服务。 Google Vertex AI为音频转录提供了各种语音到文本(S2T)模型,包括“ Chirp”,“最新llong”或“电话”等专业版本,其名称已经暗示了他们的预期用例。但是,通过访谈的完整转录结合了高精度,扬声器诊断和精确的时间戳,尤其是对于长期录音,仍然是一个真正的技术和运营挑战。
First attempts and limitations
首次尝试和限制
We initiated our project by evaluating all those models on our use case. However, after extensive testing, we came quickly to the following conclusion : no Vertex AI service fully meets the complete set of requirements and will allow us to achieve our goal in a simple and effective manner. There was always at least one missing specification, usually on timestamping or diarization.
我们通过评估用例中的所有这些模型来启动我们的项目。但是,经过广泛的测试,我们迅速得出了以下结论:没有顶点AI服务完全满足完整的要求,并将使我们能够以简单有效的方式实现我们的目标。通常至少有一个缺少的规范,通常是在时间戳或诊断上。
The terrible Google documentation, this must be said, cost us a significant amount of time during this preliminary research. This prompted us to ask Google for a meeting with a Google Cloud Machine Learning Specialist to try and find a solution to our problem. After a quick video call, our discussion with the Google rep quickly confirmed our conclusions : what we aimed to achieve was not as simple as it seemed at first. The entire set of requirements could not be fulfilled by a single Google service and a custom implementation of a VertexAI S2T service had to be developed.
必须说,可怕的Google文档使我们花费了大量时间。这促使我们要求Google与Google Cloud Machine学习专家会面,以尝试找到解决我们问题的解决方案。经过快速的视频通话后,我们与Google Rep的讨论迅速确认了我们的结论:我们的目标并不像一开始看起来那么简单。单个Google服务无法满足整个要求,并且必须开发Vertexai S2T服务的自定义实现。
We presented our preliminary work and decided to continue exploring two strategies :
我们提出了初步工作,并决定继续探索两种策略:
In parallel of these investigations, we also had to consider the financial aspect. The tool would be used for hundreds of hours of transcription per month. Unlike text, which is generally cheap enough not to have to think about it, audio can be quite costly. We therefore included this parameter from the beginning of our exploration to avoid ending up with a solution that worked but was too expensive to be exploited in production.
与这些调查并行,我们还必须考虑财务方面。该工具每月将使用数百小时的转录。与文本通常很便宜的文本不同,不必考虑它,音频可能很昂贵。因此,我们从探索开始时就将此参数包括在内,以避免最终获得有效但太昂贵的解决方案,无法在生产中利用。
Deep dive into transcription with Chirp2
深入研究Chirp2转录
We began with a deeper investigation of the Chirp2 model since it is considered as the “best in class” Google S2T service. A straightforward application of the documentation provided the expected result. The model turned out to be quite effective, offering good transcription with word-by-word timestamping according to the following output in json format:
我们从对CHIRP2模型进行了更深入的研究,因为它被认为是“班级最佳” Google S2T服务。文档的直接应用提供了预期的结果。事实证明,该模型非常有效,根据以下输出以JSON格式提供了良好的转录:
However, a new requirement came along the project added by the operational team : the transcription must be as faithful as possible to the original audio content and include small filler words, interjections, onomatopoeia or even mumbling that can add meaning to a conversation, and typically come from the non-speaking participant either at the same time or toward the end of a sentence of the speaking one. We’re talking about words like “oui oui,” “en effet” but also simple expressions like (hmm, ah, etc.), so typical of the French language! It’s actually not uncommon to validate or, more rarely, oppose someone point with a simple “Hmm Hmm”. Upon analyzing Chirp with transcription, we noticed that while some of these small words were present, a
但是,运营团队添加的项目出现了一个新的要求:转录必须尽可能忠实于原始音频内容,并包含小填充单词,插话,拟声词甚至喃喃自语,这些词可能会增加对话的含义,通常来自非言论的参与者,同时或朝着说话的句子的结尾。我们正在谈论诸如“ oui oui”,“ en efet”之类的词,但也很典型地说法语!实际上,验证或很少有人用简单的“嗯嗯”指向某人并不少见。在用转录分析CHIRP后,我们注意到,尽管存在其中一些小单词,但
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
- 阿布扎比推出了迪拉姆支持的稳定的stablecoin
- 2025-04-30 14:35:13
- 在旨在加速阿联酋数字金融基础设施的重大发展中,阿布扎比的三个机构
-
-
- 标题:韩国当局正在寻求引入旨在遏制加密货币价格急剧尖峰的新法规
- 2025-04-30 14:30:47
- 投入:韩国当局寻求引入新的法规,旨在遏制加密货币价格急剧的尖峰,并在列为投资者损失山后。
-
-
-
- Tether的金牌代币Xaut在2025年第一季度有了显着增长
- 2025-04-30 14:25:13
- 随着2025年3月关闭,XAU的市值将等于7.7亿美元,每个代币交易的价格为3123美元。
-
-
-
- Proshares已于5月14日宣布为新目标日期
- 2025-04-30 14:15:12
- 在猜测和误导性报告之后,先前提出了4月30日的发布,后来被否认。