摘要
分布式光纤声波传感器(DAS)可用于隧道塌陷事故中的人员搜救、人声信号定位。但在基于DAS的语音活动检测(VAD)中,使用户外采集的真实数据进行语音提取面临着以下问题:受限于嘈杂的现场环境和有限的采集信号方式,收集到的语音易被复杂强噪声干扰,无法获得干净的语音数据用于监督训练。为了解决上述问题,本文提出一种基于短期自相关特征的算法(ST-ACF)进行语音活动检测,结合了音高信息和自相关函数检测语音帧的相关谐波特征,使得算法在极低信噪比(小于-10 dB)的DAS环境下仍能提取所有有效人声。ST-ACF算法包括预去噪阶段和语音检测阶段。在预去噪阶段,基于对语音音高信息周期性的研究,设计双通道时间窗口,对两类典型噪声进行预去噪。在语音检测阶段,提出一种改进式自相关函数,考虑特征值和变化幅度两个维度,通过其乘积最大化语音和噪音之间的距离,提高了算法对临界数据的处理能力。算法改进后能得到与特征出现频率匹配的最佳频谱窗口,可利用其寻找局部谐波,并通过分析局部谐波区分语音和非语音。实验使用DAS真实数据和NOISEX-92数据集中的6类噪声,采用指标误帧率对算法进行评估。结果表明,ST-ACF在高能噪声环境中表现优异,误帧率仅为19.74%,相较于基线算法提升了5.91%;同时,在DAS数据集上,ST-ACF也表现出最佳性能。总体而言,通过时间窗口和自相关函数的改进,ST-ACF在处理DAS语音数据时表现出色,对不同噪声环境都具有良好的检测性能,展现出应用于多种复杂场景的潜力,拓展了基于分布式光纤语音信号处理方向的研究。
Objective The distributed acoustic sensing(DAS)system can be applied to personnel search and rescue and voice signal localization in the event of tunnel collapse accidents.However,as the front end of the speech signal processing system,existing voice activity detection(VAD)algorithms do not yield satisfactory results in detecting human voices from DAS speech data.Conducting experiments in a real tunnel environment presents several challenges:(1)the inability to manually annotate extensive DAS speech data makes it difficult to obtain labeled data for supervised training,and(2)due to the noisy on-site environment and limited signal acquisition methods,DAS-collected speech signals are accompanied by substantial and complex high-energy noise,causing some VAD algorithms to lose robustness.Therefore,this study proposes a robust VAD algorithm(ST-ACF)based on short-term autocorrelation features.Methods The algorithm investigates the acoustic characteristics of DAS speech by combining pitch information and the autocorrelation function to detect relevant harmonic features of speech frames.This enables the VAD algorithm to extract all actual human voices,even in DAS system environments with an extremely low signal-to-noise ratio(SNR),less than−10 dB.Due to the significant interference caused by strong noise in DAS speech,the ST-ACF algorithm consists of denoising and speech detection channels.DAS noise primarily consists of continuous high-frequency noise and sudden high-energy noise.In the denoising channel,based on the study of the periodicity of pitch information in speech,a dualchannel time window is designed to denoise these two typical types of noise.Feature analysis of DAS speech data revealed that pitch periodicity and non-stationarity in speech and these two types of noise exhibit distinct patterns.Continuous high-frequency noise lacks harmonic properties,presenting stable non-pitched characteristics.Sudden high-energy noise has periodicity,and there is a traceable changing trend in pitch at the moment of eruption.Speech shares similarities with the latter,but due to the continuity of human speech,pitch period changes in speech are shorter.Therefore,ST-ACF uses spectral flatness(SFT)as an indicator to determine the presence of pitch in speech frames.A dual-channel time window is designed to capture short-term pitch changes.In a continuous time window,the SFT value change curve of DAS speech is fitted into a cosine function,revealing multiple“valleys”in speech segments,indicating the presence of multiple pitch frames,which is not a characteristic of noise.The incorporation of a time window in ST-ACF enables more accurate detection of different noise types,eliminating strong noise interference in VAD.ST-ACF improves the spectral local harmonicity feature(SLH)in the speech detection channel.Despite SLH’s stability in low SNR conditions,its overall performance is suboptimal.Given that SLH feature values in speech exhibit larger absolute values and variability than noise,the ST-ACF algorithm optimizes SLH by considering two dimensions of frame variation,SLH feature values and variability,by multiplying them to maximize the distinction between speech and noise.Considering the significant differences in variability scales between different sound types,normalization of all variability values is required before computation.The improved ST-ACF considers the magnitude of its values and accounts for their changing trends,enhancing the algorithm’s capability to process critical data and improving its accuracy in distinguishing noise from speech onset.Results and Discussions The performance of VAD was evaluated using the frame error rate(FER).The dataset employed for testing comprises two parts:(1)authentic speech signals collected by the DAS system in the Ya’an Shuikou Tunnel,with an average SNR of−10.3 dB,and(2)simulated data generated by combining noise from the NOISEX-92 dataset with human speech from the TIMIT database.The results indicate that ST-ACF exhibits minimal susceptibility to high-energy noise environments,demonstrating robustness even at-10 dB,with a FER of only 19.74%.Compared to the-5 dB environment,the FER fluctuates by approximately 2%.Following optimization,ST-ACF achieves a 5.91%performance improvement compared to SLH.This significant enhancement is also evident in the DAS dataset,where ST-ACF attains its best per38工程科学与技术第57卷formance,showing a remarkable 21.11%improvement.ST-ACF maintains robust performance across different noise sets,proving its capability to handle complex environments.The comparison of ST-ACF performance under different noise sets demonstrates that the time window strategy,based on assessing the invariance of audio features over a period,effectively eliminates stationary noise.LTSV,which follows a similar concept,performs well in high-frequency stationary noise.Due to the uniqueness of the proposed time window,ST-ACF demonstrates the ability to handle non-stationary noise.Even when subjected to the most challenging gunshot noise,ST-ACF maintains an FER of less than 25%.The introduction of the time window is crucial to improving ST-ACF’s performance,contributing to a 5.07%enhancement in the DAS dataset,whereas optimizing the autocorrelation function results in a modest 1.89%improvement.This is primarily because the time window removes a significant portion of noise,preventing interference from complex noise in VAD,while optimizing the autocorrelation function preserves the integrity of speech extraction.Conclusions The maximum correlation between DAS noise and speech can be identified across multiple dimensions,facilitating targeted detection for each by integrating speech algorithm principles and analyzing DAS speech data.This research approach addresses certain shortcomings in VAD algorithms.ST-ACF successfully achieves its intended objectives,fully extracting effective speech from DAS data while preserving the integrity of the speech signal.ST-ACF exhibits remarkable performance in low SNR environments,highlighting its potential for application in diverse and complex scenarios.This fulfills its intended function and paves the way for future research in speech signal processing based on DAS.
作者
张晨思
王茂宁
钟羽中
张建伟
刘严才
闫海卫
王伟
晏世伟
ZHANG Chensi;WANG Maoning;ZHONG Yuzhong;ZHANG Jianwei;LIU Yancai;YAN Haiwei;WANG Wei;YAN Shiwei(School of Computer Science,Sichuan University,Chengdu 610065,China;School of Electrical Engineering,Sichuan University,Chengdu 610065,China;National Key Laboratory of Fundamental Science on Synthetic Vision,Sichuan University,Chengdu 610065,China;Sichuan Dehui Expressway Limited Liability Company,Liangshan Yi Autonomous Prefecture 615100,China)
出处
《工程科学与技术》
北大核心
2025年第2期29-39,共11页
Advanced Engineering Sciences
基金
四川省科技计划项目(2022YFG0084)
交通运输部交通运输行业重点科研项目(2020-MS5-146)。
关键词
分布式光纤声波传感
语音端点检测
低信噪比
音高信息
自相关函数
distributed acoustic sensing
voice activity detection
low signal-to-noise ratio
pitch information
autocorrelation function
作者简介
张晨思(1999-),男,硕士生.研究方向:语音信号处理.E-mail:zhangcs99@foxmail.com;通信作者:张建伟,教授,E-mail:zhangjianwei@scu.edu.cn。