![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
當涉及到準確的音頻訪談時,當口語不是英語時,更多的事情會變得更加複雜。
This article is co-authored by Ugo Pradère and David Haüet
本文由UgoPradère和DavidHaüet合著
How hard can it be to transcribe an interview ? You feed the audio to an AI model, wait a few minutes, and boom: perfect transcript, right ? Well… not quite.
抄錄面試有多難?您將音頻餵入AI型號,等待幾分鐘,然後繁榮:完美的成績單,對嗎?好吧……不完全。
When it comes to accurately transcribe long audio interviews, even more when the spoken language is not English, things get a lot more complicated. You need high quality transcription with reliable speaker identification, precise timestamps, and all that at an affordable price. Not so simple after all.
當涉及到準確的音頻訪談時,當口語不是英語時,更多的事情會變得更加複雜。您需要具有可靠的揚聲器標識,精確的時間戳以及所有這些價格的高質量轉錄。畢竟不是那麼簡單。
In this article, we take you behind the scenes of our journey to build a scalable and production-ready transcription pipeline using Google’s Vertex AI and Gemini models. From unexpected model limitations to budget evaluation and timestamp drift disasters, we’ll walk you through the real challenges, and how we solved them.
在本文中,我們將帶您在旅途的幕後,使用Google的Vertex AI和Gemini型號構建可擴展和生產的轉錄管道。從意外的模型限製到預算評估和時間戳漂移災難,我們將帶您解決真正的挑戰,以及我們如何解決它們。
Whether you are building your own Audio Processing tool or just curious about what happens “under the hood” of a robust transcription system using a multimodal model, you will find practical insights, clever workarounds, and lessons learned that should be worth your time.
無論您是要構建自己的音頻處理工具,還是對使用多模式模型的強大轉錄系統“引擎蓋下”發生的事情感到好奇,您都會找到實用的見解,聰明的解決方法以及所學的經驗教訓,應該值得您花費時間。
Context of the project and constraints
項目的背景和約束
At the beginning of 2025, we started an interview transcription project with a clear goal : to build a system capable of transcribing interviews in French, typically involving a journalist and a guest, but not restricted to this situation, and lasting from a few minutes to over an hour. The final output was expected to be just a raw transcript but had to reflect the natural spoken dialogue written in a “book-like” dialogue, ensuring both a faithful transcription of the original audio content and a good readability.
2025年初,我們開始了一個面試轉錄項目,其目標是一個明確的目標:建立一個能夠抄錄法語採訪的系統,通常涉及記者和客人,但不限於這種情況,並且持續了幾分鐘到一個多小時。預計最終的輸出只是一個原始的成績單,但必須反映以“書本”對話進行的自然口頭對話,從而確保了原始音頻內容的忠實轉錄和良好的可讀性。
Before diving into development, we conducted a short market review of existing solutions, but the outcomes were never satisfactory : the quality was often disappointing, the pricing definitely too high for an intensive usage, and in most cases, both at once. At that point, we realized a custom pipeline would be necessary.
在進行發展之前,我們對現有解決方案進行了簡短的市場審查,但是結果從未令人滿意:質量通常令人失望,價格絕對太高,無法達到大量使用,並且在大多數情況下都同時進行。那時,我們意識到需要進行自定義管道。
Because our organization is engaged in the Google ecosystem, we were required to use Google Vertex AI services. Google Vertex AI offers a variety of Speech-to-Text (S2T) models for audio transcription, including specialized ones such as “Chirp,” “Latestlong,” or “Phone call,” whose names already hint at their intended use cases. However, producing a complete transcription of an interview that combines high accuracy, speaker diarization, and precise timestamping, especially for long recordings, remains a real technical and operational challenge.
由於我們的組織參與了Google生態系統,因此我們必須使用Google Vertex AI服務。 Google Vertex AI為音頻轉錄提供了各種語音到文本(S2T)模型,包括“ Chirp”,“最新llong”或“電話”等專業版本,其名稱已經暗示了他們的預期用例。但是,通過訪談的完整轉錄結合了高精度,揚聲器診斷和精確的時間戳,尤其是對於長期錄音,仍然是一個真正的技術和運營挑戰。
First attempts and limitations
首次嘗試和限制
We initiated our project by evaluating all those models on our use case. However, after extensive testing, we came quickly to the following conclusion : no Vertex AI service fully meets the complete set of requirements and will allow us to achieve our goal in a simple and effective manner. There was always at least one missing specification, usually on timestamping or diarization.
我們通過評估用例中的所有這些模型來啟動我們的項目。但是,經過廣泛的測試,我們迅速得出了以下結論:沒有頂點AI服務完全滿足完整的要求,並將使我們能夠以簡單有效的方式實現我們的目標。通常至少有一個缺少的規範,通常是在時間戳或診斷上。
The terrible Google documentation, this must be said, cost us a significant amount of time during this preliminary research. This prompted us to ask Google for a meeting with a Google Cloud Machine Learning Specialist to try and find a solution to our problem. After a quick video call, our discussion with the Google rep quickly confirmed our conclusions : what we aimed to achieve was not as simple as it seemed at first. The entire set of requirements could not be fulfilled by a single Google service and a custom implementation of a VertexAI S2T service had to be developed.
必須說,可怕的Google文檔使我們花費了大量時間。這促使我們要求Google與Google Cloud Machine學習專家會面,以嘗試找到解決我們問題的解決方案。經過快速的視頻通話後,我們與Google Rep的討論迅速確認了我們的結論:我們的目標並不像一開始看起來那麼簡單。單個Google服務無法滿足整個要求,並且必須開發Vertexai S2T服務的自定義實現。
We presented our preliminary work and decided to continue exploring two strategies :
我們提出了初步工作,並決定繼續探索兩種策略:
In parallel of these investigations, we also had to consider the financial aspect. The tool would be used for hundreds of hours of transcription per month. Unlike text, which is generally cheap enough not to have to think about it, audio can be quite costly. We therefore included this parameter from the beginning of our exploration to avoid ending up with a solution that worked but was too expensive to be exploited in production.
與這些調查並行,我們還必須考慮財務方面。該工具每月將使用數百小時的轉錄。與文本通常很便宜的文本不同,不必考慮它,音頻可能很昂貴。因此,我們從探索開始時就將此參數包括在內,以避免最終獲得有效但太昂貴的解決方案,無法在生產中利用。
Deep dive into transcription with Chirp2
深入研究Chirp2轉錄
We began with a deeper investigation of the Chirp2 model since it is considered as the “best in class” Google S2T service. A straightforward application of the documentation provided the expected result. The model turned out to be quite effective, offering good transcription with word-by-word timestamping according to the following output in json format:
我們從對CHIRP2模型進行了更深入的研究,因為它被認為是“班級最佳” Google S2T服務。文檔的直接應用提供了預期的結果。事實證明,該模型非常有效,根據以下輸出以JSON格式提供了良好的轉錄:
However, a new requirement came along the project added by the operational team : the transcription must be as faithful as possible to the original audio content and include small filler words, interjections, onomatopoeia or even mumbling that can add meaning to a conversation, and typically come from the non-speaking participant either at the same time or toward the end of a sentence of the speaking one. We’re talking about words like “oui oui,” “en effet” but also simple expressions like (hmm, ah, etc.), so typical of the French language! It’s actually not uncommon to validate or, more rarely, oppose someone point with a simple “Hmm Hmm”. Upon analyzing Chirp with transcription, we noticed that while some of these small words were present, a
但是,運營團隊添加的項目出現了一個新的要求:轉錄必須盡可能忠實於原始音頻內容,並包含小填充單詞,插話,擬聲詞甚至喃喃自語,這些詞可能會增加對話的含義,通常來自非言論的參與者,同時或朝著說話的句子的結尾。我們正在談論諸如“ oui oui”,“ en efet”之類的詞,但也很典型地說法語!實際上,驗證或很少有人用簡單的“嗯嗯”指向某人並不少見。在用轉錄分析CHIRP後,我們注意到,儘管存在其中一些小單詞,但
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
- 阿布扎比推出了迪拉姆支持的穩定的stablecoin
- 2025-04-30 14:35:13
- 在旨在加速阿聯酋數字金融基礎設施的重大發展中,阿布扎比的三個機構
-
-
- 標題:韓國當局正在尋求引入旨在遏制加密貨幣價格急劇尖峰的新法規
- 2025-04-30 14:30:47
- 投入:韓國當局尋求引入新的法規,旨在遏制加密貨幣價格急劇的尖峰,並在列為投資者損失山後。
-
-
-
- Tether的金牌代幣Xaut在2025年第一季度有了顯著增長
- 2025-04-30 14:25:13
- 隨著2025年3月關閉,XAU的市值將等於7.7億美元,每個代幣交易的價格為3123美元。
-
-
-
- Proshares已於5月14日宣佈為新目標日期
- 2025-04-30 14:15:12
- 在猜測和誤導性報告之後,先前提出了4月30日的發布,後來被否認。