Speech Recognition and Generation Service (SRS) supports speech recognition in several wide-spreading languages like English, Spanish, French, German, Portuguese, Italian, and Russian. The end user must be able to change the currently recognized language to another translated language, during the RCV meeting or PBX call session. 语音识别与生成服务(SRS)支持多种广泛使用的语言的语音识别,如英语、西班牙语、法语、德语、葡萄牙语、意大利语和俄语。终端用户必须能够在 RCV 会议或 PBX 通话会话期间将当前识别的语言更改为另一种翻译语言。
1.1 Problem statement 1.1 问题陈述
The spoken language could be changed during the meeting/call multiple times, the target language could be changed during the meeting/call multiple times as well, despite that the end-user experience shall be kept consistent with the target language for translation. The translation should keep a high accuracy, and don't introduce a high delay. 会议/通话期间口语可能会多次更改,目标语言在会议/通话期间也可能会多次更改,尽管如此,最终用户体验应与翻译的目标语言保持一致。翻译应保持高准确性,且不应引入高延迟。
Closed Captions received through the Closed Captions protocol; 通过字幕隐藏协议接收的隐藏字幕;
Live Transcript received through the Live Transcripts protocol; 通过实时转录协议接收的实时转录文本;
Participants can control their own language in the Live Transcripts protocol; 参与者可以在实时转录协议中控制自己的语言;
Participants can't control their own language in the Closed Captions protocol ; 参与者无法在字幕协议 中控制自己的语言;
Participants can't control the target language of the Live Transcripts ; 参与者无法控制实时转录 的目标语言;
Participants can't control the target language of the Closed Captions . 参与者无法控制字幕 的目标语言。
PBX: 专用分组交换机(Private Branch Exchange)
Live Transcript received through the Closed Captions protocol . 通过字幕协议 接收的实时转录文本。
Common: 常见的:
There is no real-time language detection for participants, the own language must be set manually. 对于参与者没有实时的语言检测,必须手动设置自己的语言。
3. Requirements 3. 要求
Functional requirements 功能需求
Support translation from English, Spanish, French, German, Portuguese, Italian, and Russian languages (source language) to another language (target language) from this list; 支持从英语、西班牙语、法语、德语、葡萄牙语、意大利语和俄语(源语言)翻译到该列表中的另一种语言(目标语言);
Among the other languages, the used could choose to receive Closed Captions and Live Transcript as is, without translation ; 在其他语言中,使用者可以选择直接接收隐藏式字幕和实时转录,无需翻译 ;
The source language could be changed during the meeting/call by the participants: 在会议/通话期间,参与者可以更改源语言:
The source language could be changed during the meeting/call and automatically detected . 在会议/通话期间,源语言可能会发生变化,并会自动检测 。
The target language could be changed multiple times per meeting/call per end user; 每个终端用户在每次会议/通话中都可以多次更改目标语言;
The whole available transcript should be translated into the target language; 整个可用的文字记录应被翻译成目标语言;
The target language for Closed Captions and Live Transcript should be selected independently. 应独立选择闭路字幕和实时转录的目标语言。
Support translation from recognized languages — English, Spanish, and German to Spanish, French, German, Portuguese, Italian, Russian, and Ukrainian languages (CC, LT); 支持从已识别的语言(英语、西班牙语和德语)翻译为西班牙语、法语、德语、葡萄牙语、意大利语、俄语和乌克兰语(CC、LT);
The source language CAN'T be changed multiple times per meeting/call by the participants (CC, LT); 参与者(抄送、本地通话)不能在每次会议/通话中多次更改源语言;
The target language could be changed multiple times per meeting/call per end user (CC, LT). 每个会议/通话中,每个终端用户(抄送方、本地终端)都可能多次更改目标语言。
PBX: 专用分组交换机(Private Branch Exchange)
Migration from the Closed Captions protocol to the Live Transcripts protocol for the Live Transcripts. 对于实时字幕,从隐藏字幕协议迁移到实时转录协议。
Phase 1, scope 阶段 1,范围
The whole available transcript should be translated into the target language (LT). 整个可用的文字记录应被翻译成目标语言(LT)。
Phase 2, scope 第二阶段,范围
The source language CAN be changed multiple times per meeting/call (Language Identification is required). 在每次会议/通话中,源语言可以多次更改(需要进行语言识别)。
3. Current state 3. 当前状态
AS-IS: 现状:
TO-BE: 待办事项:
请提供具体的“TO-BE”内容呀,没有具体的文本可翻译呢
4. Implementation details 4. 实施细节
4.1. General flow extension 4.1. 通用流程扩展
Change language settings and options: 更改语言设置和选项:
participants could change the language of the Closed Captions by the ClosedCaptionLanguage option in the Closed Captions protocol; 参与者可以通过字幕协议中的 ClosedCaptionLanguage 选项来更改隐藏式字幕的语言;
participants could change the language of the Live Transcript by the LiveTranscriptLanguage option in the Live Transcript protocol; 参与者可以通过直播转录协议中的 LiveTranscriptLanguage 选项来更改直播转录的语言;
participants could change their own, source language with ParticipantLanguage option in the Closed Captions or Live Transcript (!sic) protocol. The option is supposed to be shared for Closed Caption and Live Transcripts protocols and affects the participant language recognition. 参与者可以通过字幕或实时转录(!sic)协议中的 ParticipantLanguage 选项更改自己的源语言。该选项应在字幕和实时转录协议中共享,并影响参与者的语言识别。
4.2. Closed Captions service extension 4.2. 字幕服务扩展
Authentification must be synchronous, i.e. one request - one response; 认证必须是同步的,即一个请求 - 一个响应;
Meeting settings (list of supported languages, permissions, settings, and various real-time options) must be sent to the participant after authentification is finished; 认证完成后,必须将会议设置(支持的语言列表、权限、设置以及各种实时选项)发送给参与者;
Meeting settings must be sent to the participant right after they are changed (by ex. ParticipantLanguage is changed from the Live Transcript protocol); 会议设置必须在更改后立即发送给参与者(例如, ParticipantLanguage 从实时转录协议中更改);
Participants must explicitly specifyParticipantLanguage and ClosedCaptionsLanguage language; 参与者必须明确指定 ParticipantLanguage 和 ClosedCaptionsLanguage 语言;
Participants could change the ParticipantLanguage and ClosedCaptionsLanguage multiple times per meeting/call. 参与者可以在每次会议/通话中多次更改 ParticipantLanguage 和 ClosedCaptionsLanguage 。
for backward compatibility means: 为了保持向后兼容性意味着:
if ParticipantLanguage message is absent: 如果 ParticipantLanguage 消息不存在:
default value English should be used; 应使用默认值英语;
if ClosedCaptionsLanguage message is absent 如果 ClosedCaptionsLanguage 消息不存在
default value None should be used (i.e. no translations are applied to the provided transcripts); 应使用默认值 None(即不对提供的转录应用任何翻译);
4.4. Live Transcript service extension 4.4. 实时转录服务扩展
Originally, the live transcripts consisted of two parts: 最初,现场文字记录由两部分组成:
the real-time data, that comes from Kafka; 来自 Kafka 的实时数据;
the historical data, stored in transient in-memory storage and identified by some top-level-id (for RCV it could be meeting_id+conference_id, for PBX it could be telephony_session_id). 历史数据存储在临时内存存储中,并由一些 top-level-id 标识(对于 RCV 可能是 meeting_id + conference_id ,对于 PBX 可能是 telephony_session_id )。
To store the per target language historical data, the transient in-memory storage should be identified by top-level-id+target-language and created on-demand. The historical data from the oldest transient in-memory storage identified by top-level-id+* are translated by the translation subsystem (based on the original language of the message) and saved into newly created storage. The real-time data are translated by the translation subsystem on the fly into the subset of target languages (based on the original language of the message), saved into top-level-id+target-language storage for historical data and published to the Live Transcript clients via Live Transcript protocol. 为存储每个目标语言的历史数据,应通过 top-level-id + target-language 标识瞬态内存存储,并按需创建。由 top-level-id +*标识的最旧瞬态内存存储中的历史数据由 translation subsystem (基于消息的原始语言)进行翻译,并保存到新创建的存储中。实时数据由 translation subsystem 实时翻译为目标语言的子集(基于消息的原始语言),保存到 top-level-id + target-language 存储中用于历史数据,并通过实时转录协议发布到实时转录客户端。
4.5. Live Transcript protocol extension 4.5. 实时转录协议扩展
Authentification must be synchronous, i.e. one request - one response; 认证必须是同步的,即一个请求 - 一个响应;
Meeting settings (list of supported languages, permissions, settings, and various real-time options) must be sent to the participant after authentification is finished; 认证完成后,必须将会议设置(支持的语言列表、权限、设置以及各种实时选项)发送给参与者;
Meeting settings must be sent to the participant right after they are changed (by ex. ParticipantLanguage is changed from the Closed Captions (!sic) protocol); 会议设置必须在更改后立即发送给参与者(例如, ParticipantLanguage 从关闭字幕(!sic)协议更改);
Participants must explicitly specify ParticipantLanguage and LiveTranscriptionLanguage language; 参与者必须明确指定 ParticipantLanguage 和 LiveTranscriptionLanguage 语言;
Participants could change the ParticipantLanguage and LiveTranscriptionLanguage multiple times per meeting/call. 参与者可以在每次会议/通话中多次更改 ParticipantLanguage 和 LiveTranscriptionLanguage 。
History and search: 历史与搜索:
History should be translated to LiveTranscriptionLanguage (additional getHistory needed from client side to display it correctly in UI); 历史应该被翻译为 LiveTranscriptionLanguage (需要从客户端获取额外的 getHistory 才能在用户界面中正确显示它);
Search should be performed in the current LiveTranscriptionLanguage; 应在当前的 LiveTranscriptionLanguage 中执行搜索;
If LiveTranscriptionLanguage is None history should be provided in the original language; 如果 LiveTranscriptionLanguage 为 None,则应以原始语言提供历史记录;
5. Quality 5. 质量
Quality (BLEU); 质量(BLEU);
Stability; 稳定性;
Latency. 延迟
max(Quality, Stability, -1 * Latency)
6. Tech aspect 6. 技术方面
6. Open questions 6. 开放性问题
Translation accuracy for Closed Captions and Live Transcript: 字幕和实时转录的翻译准确性:
What is the minimum size (of time) of the sentence (i.e. buffer) for translation? 翻译句子(即缓冲区)的最小时间尺寸是多少?
Should the sentence be translated only after it has a final flag? 句子是否只有在具有 final 标志后才应被翻译?
Should the sentence be enriched with the conversational context (current participant conversational context OR whole dialog conversational context)? 是否应该用会话上下文(当前参与者会话上下文或整个对话会话上下文)来丰富该句子?
Handle the situation when the client has the wrong language value in ParticipantLanguage by example: 通过示例处理客户端在 ParticipantLanguage 中具有错误语言值的情况:
ParticipantLanguage is English, but the participant speaks in Russian, the voice is not recognized correctly, and the text contains a mess but is labeled as English (maybe pay attention to the confedence ); ParticipantLanguage 是英语,但参与者说的是俄语,语音未被正确识别,文本内容混乱但被标记为英语(或许需注意 confedence );
ParticipantLanguage is English, but the participant speaks in Spanish, the voice is recognized almost correctly — like "Hola! Buenos dias!" but is labeled as English. ParticipantLanguage 是英语,但参与者说的是西班牙语,语音几乎被正确识别——就像“你好!早上好!”但被标记为英语。
Client-initiated "Ping-Pong" communication can not be removed from the protocol. Discussed that with Dmitrii Zlygin , https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket constructor of WebSocket for browser does not provide ways to configure ping-pong. 客户端发起的“乒乓”通信不能从协议中删除。与 Dmitrii Zlygin 讨论过,https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket 构造函数的 WebSocket 对于浏览器来说,没有提供配置乒乓的方法。
However, server-initiated "Heartbeat" messaging indeed could be removed. 然而,服务器发起的“心跳”消息传递确实可以被移除。
Alexander Petrovsky The diagrams helps a lot in understanding the RCV flow. Can you do the same and add the flows for phone call? 亚历山大·彼得罗夫斯基 这些图表在理解 RCV 流程方面帮助很大。你能做同样的事情并添加电话呼叫的流程吗?
3 Comments 3 条评论
Alexander Petrovsky 亚历山大·彼得罗夫斯基
https://cloud.google.com/translate/docs/overview
Mikhail Pokhikhilov 米哈伊尔·波基希洛夫
Alexander Petrovsky 亚历山大·彼得罗夫斯基
Client-initiated "Ping-Pong" communication can not be removed from the protocol. Discussed that with Dmitrii Zlygin , https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket constructor of WebSocket for browser does not provide ways to configure ping-pong.
客户端发起的“乒乓”通信不能从协议中删除。与 Dmitrii Zlygin 讨论过,https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket 构造函数的 WebSocket 对于浏览器来说,没有提供配置乒乓的方法。
However, server-initiated "Heartbeat" messaging indeed could be removed.
然而,服务器发起的“心跳”消息传递确实可以被移除。
Joe Hong 乔·洪
Alexander Petrovsky The diagrams helps a lot in understanding the RCV flow. Can you do the same and add the flows for phone call?
亚历山大·彼得罗夫斯基 这些图表在理解 RCV 流程方面帮助很大。你能做同样的事情并添加电话呼叫的流程吗?
Add Comment