摘要
随着互联网应用的日益普及,短文本作为电子数据证据在法庭科学中日益重要,法院亟需对大量网络聊天内容作者归属进行同一认定。传统机器学习方法对特征选取非常敏感,因为在实践中较难提取到准确的作者写作习惯特征,所以影响了传统机器学习方法的实践效果。针对文本短、特征少、特征提取困难的缺点,提出了融合多属性的神经网络中文短文本作者识别方法。首先将文本的结构特征、语义特征、发送时间、发送位置、发送频率等属性融合进文本序列,对文本序列进行词向量化表示,采用卷积层和Bi-LSTM层自动提取局部特征和上下文关系特征,通过注意力机制动态调整特征权重,使用Softmax分类器得到文本作者。以最大熵模型做对比实验,实验结果表明卷积层和Bi-LSTM层能“学习”到短文本上下文特征,注意力机制能更多“学习”到文本序列不同位置的关键特征,融合多属性的神经网络方法的作者识别精度比传统模型大约提高了5%。
With the increasing popularity of Internet applications,short text as electronic data evidence is increasingly important in forensic science.The court urgently needs to identify the author of a large number of online chat content.Traditional machine learning methods are very sensitive to feature selection,because it is difficult to extract accurate author style recognition features in practice,so it affects the practical effect of traditional machine learning methods.In view of the shortcomings of short text,including few features and difficult feature extraction,a Chinese short text author recognition method based on a neural network with multi-attribute fusion was proposed.Firstly,the text structure features,semantic features,sending time,sending location,sending frequency and other attributes are integrated into the text sequence,and the text sequence is represented by word vectorization.Local features and context features are extracted automatically by convolutional layer and Bi-LSTM layer,and the feature weight is adjusted dynamically through the attention mechanism,and the text author is obtained by Softmax classifier.Using the maximum entropy model as a comparative experiment,the results show that the convolution layer and the Bi-LSTM layer can“learn”the short text context features,and the attention mechanism can“learn”the key features of different positions of the text sequence.The author's recognition accuracy of the neural network method with multi-attribute fusion is improved by about 5%compared with the traditional model.
作者
李孟林
罗文华
李绍鸣
LI Menglin;LUO Wenhua;LI Shaoming(Department of Cyber Crime Investigation,Criminal Investigation Police University of China,Shenyang 110854,China;Shenyang Aerospace University,Human-computer Intelligence Research Center,Shenyang 110136,China)
出处
《中国人民公安大学学报(自然科学版)》
2020年第2期61-67,共7页
Journal of People’s Public Security University of China(Science and Technology)
关键词
短文本
多属性
Bi-LSTM
最大熵
作者识别
short text
Multi-Attribute
Bi-LSTM
maximum entropy
authorship attribution
作者简介
李孟林(1994—),男,湖北宜昌人,在读硕士研究生,研究方向为网络安全执法技术、自然语言处理;通讯作者:罗文华(1977—),男,教授,E-mail:luowenhua770404@126.com。