摘要
Web已经发展为世界上最大的公共数据来源,从海量异构、半结构化的Web页面中提取特定信息逐渐成为数据挖掘的重要任务。目前信息提取任务研究的重心正在逐渐转向深度学习领域,本文提出基于双向GRU(Gated Recurrent Unit)的深度神经网络模型进行Web信息提取,解决序列长距离依赖问题,采用词嵌入结合字符嵌入算法加强语义表达,减少冗余文本干扰,双向模型充分利用文本上下文,快速、准确地对输入序列实现特定信息提取。
Web is the largest public data source,and extracting specific information from the massive heterogeneous and semi-structured Web pages gradually becomes an important task of data mining. Currently,the focus of information extraction research is turning to deep learning field. This paper proposes a deep neural network model based on bi-directional GRU. It can solve the problem of sequences long-range dependency,and by using word embedding and character embedding algorithm can help to enhance semantic expression and reduce the interference of redundant text. Besides, two-way model can make full use of text context,which makes extracting information from specific sequence fast and accurately.
出处
《信息技术》
2018年第3期1-5,9,共6页
Information Technology
基金
国家重点研发计划项目资助(2017YFB0802202)
关键词
信息提取
GRU神经网络
词嵌入
information extraction
gated recurrent unit neural network
word embedding
作者简介
李骁(1993-),男,硕士研究生,研究方向为数据挖掘和深度学习.