借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错...借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题,我们提出了VALL-E R,一个鲁棒且高效的零样本TTS系统,并以VALL-E为基础进行构建。具体而言,我们引入了一种音素单调对齐策略,通过约束声学标记与其对应的音素严格匹配,增强了音素与声学序列之间的映射关系,从而确保更精确的对齐。此外,我们采用编解码器合并的方法,在浅层量化层对离散码进行降采样,以减少解码计算量,同时保持语音输出的高质量。受益于这些策略,VALL-E R在音素可控性方面取得了显著提升,并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外,该系统仅需较少的自回归推理步骤,推理时间降低超过60%,极大提升了推理效率。展开更多
QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出...QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出了音素向量空间模型和音素状态转移模型对音素分布特性进行了量化表示.基于所得量化特征并结合SVM(Support Vector Machine,支持向量机)构建了隐写检测器.针对典型的低速率语音编码标准G.729以及G.723.1的实验表明,文中方法性能远优于现有检测方法,实现了对QIM隐写的快速准确检测.展开更多
Steganography based on bits-modification of speech frames is a kind of commonly used method, which targets at RTP payloads and offers covert communications over voice-over-IP(Vo IP). However, direct modification on fr...Steganography based on bits-modification of speech frames is a kind of commonly used method, which targets at RTP payloads and offers covert communications over voice-over-IP(Vo IP). However, direct modification on frames is often independent of the inherent speech features, which may lead to great degradation of speech quality. A novel frame-bitrate-change based steganography is proposed in this work, which discovers a novel covert channel for Vo IP and introduces less distortion. This method exploits the feature of multi-rate speech codecs that the practical bitrate of speech frame is identified only by speech decoder at receiving end. Based on this characteristic, two steganography strategies called bitrate downgrading(BD) and bitrate switching(BS)are provided. The first strategy substitutes high bit-rate speech frames with lower ones to embed secret message, which introduces very low distortion in practice, and much less than other bits-modification based methods with the same embedding capacity. The second one encodes secret message bits into different types of speech frames, which is an alternative choice for supplement. The two strategies are implemented and tested on our covert communication system Steg Vo IP. The experiment results show that our proposed method is effective and fulfills the real-time requirement of Vo IP communication.展开更多
文摘借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题,我们提出了VALL-E R,一个鲁棒且高效的零样本TTS系统,并以VALL-E为基础进行构建。具体而言,我们引入了一种音素单调对齐策略,通过约束声学标记与其对应的音素严格匹配,增强了音素与声学序列之间的映射关系,从而确保更精确的对齐。此外,我们采用编解码器合并的方法,在浅层量化层对离散码进行降采样,以减少解码计算量,同时保持语音输出的高质量。受益于这些策略,VALL-E R在音素可控性方面取得了显著提升,并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外,该系统仅需较少的自回归推理步骤,推理时间降低超过60%,极大提升了推理效率。
文摘QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出了音素向量空间模型和音素状态转移模型对音素分布特性进行了量化表示.基于所得量化特征并结合SVM(Support Vector Machine,支持向量机)构建了隐写检测器.针对典型的低速率语音编码标准G.729以及G.723.1的实验表明,文中方法性能远优于现有检测方法,实现了对QIM隐写的快速准确检测.
基金Project(2011CB302305)supported by National Basic Research Program(973 Program)of ChinaProjects(61232004,61302094)supported by National Natural Science Foundation of China+2 种基金Project(ZQN-PY115)supported by Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University,ChinaProject(JA13012)supported by Education Science Research Program for Young and Middle-aged Teacher of Fujian Province of ChinaProject(2014J01238)supported by Natural Science Foundation of Fujian Province of China
文摘Steganography based on bits-modification of speech frames is a kind of commonly used method, which targets at RTP payloads and offers covert communications over voice-over-IP(Vo IP). However, direct modification on frames is often independent of the inherent speech features, which may lead to great degradation of speech quality. A novel frame-bitrate-change based steganography is proposed in this work, which discovers a novel covert channel for Vo IP and introduces less distortion. This method exploits the feature of multi-rate speech codecs that the practical bitrate of speech frame is identified only by speech decoder at receiving end. Based on this characteristic, two steganography strategies called bitrate downgrading(BD) and bitrate switching(BS)are provided. The first strategy substitutes high bit-rate speech frames with lower ones to embed secret message, which introduces very low distortion in practice, and much less than other bits-modification based methods with the same embedding capacity. The second one encodes secret message bits into different types of speech frames, which is an alternative choice for supplement. The two strategies are implemented and tested on our covert communication system Steg Vo IP. The experiment results show that our proposed method is effective and fulfills the real-time requirement of Vo IP communication.