摘要
声学场景分类是计算机听觉领域的热点方向之一,相比计算机视觉,特定场景下音频数据的收集和标注成本相对较高,如何利用有限的声学场景音频获得较高的分类准确率成为当前研究的重点内容。利用深度学习技术,结合轻量化网络模型mobilenetv2以及Mel声谱特征,基于城市场景分类数据集(urbansound8k)对三种数据增广技术SpecAugment、Mixup以及Cutmix进行全面的消融实验,结果表明:Cutmix增广技术将基线结果提升了0.71%,单独的Mixup以及SpecAugment增广技术对分类结果起抑制效果,SpecAugment结合Cutmix增广技术获得了最优的测试结果,分类准确率达到97.097%;其次对比最优方案下各类标签的F1分数和T-SNE降维分布图发现,二者具有很好的对应关系,表明T-SNE技术适用于Mel声谱特征的降维及分布观测。
Acoustic scene classification is one of the hot topics in the field of computer hearing.Compared with computer vision,the cost of audio data collection and annotation in specific scenes is relatively high.How to use limited acoustic scene audio to obtain higher classification accuracy has become the focus of current research.In this paper,using deep learning technology,combined with the lightweight network mobilenetv2 and Mel spectral features,based on the urban scene classification dataset(urbansound8k),three kinds of data augmentation technologies,SpecAument,Mixup and Cutmix are carried out.The results show that the Cutmix augmentation technology can improve results by 0.71%.The separated Mixup and SpecAument augmentation technology inhibite the classification results.The best test result is obtained by combining SpecAument with Cutmix augmentation technology,and the classification accuracy reaches 97.097%.Secondly,through the comparison of F1 score of all kinds of labels under the optimal scheme and the distribution map using T-SNE,it is found that the two have a good corresponding relationship,which indicates that T-SNE technology is suitable for dimension reduction and distribution observation of Mel Spectrogram.
作者
李源
马成男
李关防
王强
张文武
LI Yuan;MA Cheng-nan;LI Guan-fang;WANG Qiang;ZHANG Wen-wu(Navy Marine Equipment Project Management Center, Beijing 100071;Jiangsu Automation Research Institute, Lianyungang 222061, China)
出处
《指挥控制与仿真》
2021年第1期60-64,共5页
Command Control & Simulation
作者简介
李源(1983—),男,辽宁本溪人,工程师,研究方向为智能指挥与控制;马成男(1993—),男,硕士,助理工程师。