![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
This article is co-authored by Ugo Pradère and David Haüet
How hard can it be to transcribe an interview ? You feed the audio to an AI model, wait a few minutes, and boom: perfect transcript, right ? Well… not quite.
When it comes to accurately transcribe long audio interviews, even more when the spoken language is not English, things get a lot more complicated. You need high quality transcription with reliable speaker identification, precise timestamps, and all that at an affordable price. Not so simple after all.
In this article, we take you behind the scenes of our journey to build a scalable and production-ready transcription pipeline using Google’s Vertex AI and Gemini models. From unexpected model limitations to budget evaluation and timestamp drift disasters, we’ll walk you through the real challenges, and how we solved them.
Whether you are building your own Audio Processing tool or just curious about what happens “under the hood” of a robust transcription system using a multimodal model, you will find practical insights, clever workarounds, and lessons learned that should be worth your time.
Context of the project and constraints
At the beginning of 2025, we started an interview transcription project with a clear goal : to build a system capable of transcribing interviews in French, typically involving a journalist and a guest, but not restricted to this situation, and lasting from a few minutes to over an hour. The final output was expected to be just a raw transcript but had to reflect the natural spoken dialogue written in a “book-like” dialogue, ensuring both a faithful transcription of the original audio content and a good readability.
Before diving into development, we conducted a short market review of existing solutions, but the outcomes were never satisfactory : the quality was often disappointing, the pricing definitely too high for an intensive usage, and in most cases, both at once. At that point, we realized a custom pipeline would be necessary.
Because our organization is engaged in the Google ecosystem, we were required to use Google Vertex AI services. Google Vertex AI offers a variety of Speech-to-Text (S2T) models for audio transcription, including specialized ones such as “Chirp,” “Latestlong,” or “Phone call,” whose names already hint at their intended use cases. However, producing a complete transcription of an interview that combines high accuracy, speaker diarization, and precise timestamping, especially for long recordings, remains a real technical and operational challenge.
First attempts and limitations
We initiated our project by evaluating all those models on our use case. However, after extensive testing, we came quickly to the following conclusion : no Vertex AI service fully meets the complete set of requirements and will allow us to achieve our goal in a simple and effective manner. There was always at least one missing specification, usually on timestamping or diarization.
The terrible Google documentation, this must be said, cost us a significant amount of time during this preliminary research. This prompted us to ask Google for a meeting with a Google Cloud Machine Learning Specialist to try and find a solution to our problem. After a quick video call, our discussion with the Google rep quickly confirmed our conclusions : what we aimed to achieve was not as simple as it seemed at first. The entire set of requirements could not be fulfilled by a single Google service and a custom implementation of a VertexAI S2T service had to be developed.
We presented our preliminary work and decided to continue exploring two strategies :
In parallel of these investigations, we also had to consider the financial aspect. The tool would be used for hundreds of hours of transcription per month. Unlike text, which is generally cheap enough not to have to think about it, audio can be quite costly. We therefore included this parameter from the beginning of our exploration to avoid ending up with a solution that worked but was too expensive to be exploited in production.
Deep dive into transcription with Chirp2
We began with a deeper investigation of the Chirp2 model since it is considered as the “best in class” Google S2T service. A straightforward application of the documentation provided the expected result. The model turned out to be quite effective, offering good transcription with word-by-word timestamping according to the following output in json format:
However, a new requirement came along the project added by the operational team : the transcription must be as faithful as possible to the original audio content and include small filler words, interjections, onomatopoeia or even mumbling that can add meaning to a conversation, and typically come from the non-speaking participant either at the same time or toward the end of a sentence of the speaking one. We’re talking about words like “oui oui,” “en effet” but also simple expressions like (hmm, ah, etc.), so typical of the French language! It’s actually not uncommon to validate or, more rarely, oppose someone point with a simple “Hmm Hmm”. Upon analyzing Chirp with transcription, we noticed that while some of these small words were present, a
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
- G硬币:在Web3中为全球采用的交易3
- 2025-07-24 22:04:39
- 探索G Coin的主动效用,可扩展性和战略整合如何推动其全球采用和转换Web3交易。
-
- 睡眠令牌,旋转门和岩石复兴:流派弯曲和掩盖神秘
- 2025-07-24 22:03:14
- 睡眠代币,旋转门和其他人通过融合流派并拥抱神秘感,带领岩石复兴。发现这些乐队如何塑造摇滚的未来。
-
-
-
- 比特币,XYZVERSE和价格飙升:有什么问题?
- 2025-07-24 21:50:38
- 比特币的拉力赛激发了对Xyzverse等山寨币的兴趣。是锅中的闪光灯还是下一个大事?让我们潜入加密嗡嗡声。
-
- 角色5硬币地点和拼图解决方案:偷宝藏!
- 2025-07-24 21:50:22
- Persona 5硬币位置和拼图解决方案的细分。包括P5X的最新更新以及用于挑战迷宫的技巧。
-
-
- 比特币,Xyzverse和加密货币热潮:有什么交易?
- 2025-07-24 21:48:50
- 比特币的激增激发了对Xyzverse等山寨币的兴趣。该博客介绍了趋势,见解和模因硬币躁狂症。
-
- Solana上的泵令牌:空投延迟激发价格暴跌和投资者恐慌
- 2025-07-24 21:48:36
- 随着气流延迟触发价格下跌和投资者的抛售,Solana区块链上的泵令牌面临动荡。这是炒作模因硬币的终点吗?