摘要
为了比较客观了解现有主要中文命名实体识别系统与开源系统的性能,基于字的双向长短时记忆循环神经网络(BiLSTM)接入条件随机场(CRF)的系统,利用微软亚洲研究院的MSRA数据集实现中文命名实体识别模型,然后使用MSRA测试数据对自建模型、哈工大的语言技术平台(LTP)和斯坦福大学CoreNLP自然语言处理工具进行对比测试与分析。实验表明:BiLSTM对地名实体的识别效果最佳,与地名和人名相比机构名识别效果与开源工具保持同等水平。实验在语料规模以及实验设计方面有提升空间。后续将实验模型作为重点,将特定领域实体与序列标注问题相结合进行开展研究。
In order to get a considerable understanding about the existing major Chinese named entity recognition models and the performance of open source systems,adopts char-based Bi-directional Long Short Term Memory with Conditional Random Field which uses the Microsoft Research Lab-Asia's MSRA dataset to implement the Chinese named entity recognition model,and also The MSRA test data is used to compare and test the self-built model,Harbin Institute of Technology's Language Technology Platform(LTP)and Stanford University CoreNLP natural language processing tools.Experiments show that BiLSTM has the best recognition effect on place name,compared with location names,person and organization name are sustaining the same level with the open source tools.The experiment has room for improvement in terms of size of corpus and experimental design.Subsequent focus on the experimental model,combining specific domain entities with sequence labeling issues to conduct research.
作者
祖木然提古丽·库尔班
艾山·吾买尔
Zumurantiguli Kuerban;Aishan Wumaier(School of Information Science and Engineering,Xinjiang University,Urumqi 830046;Xinjiang Laboratory of Multi-Language Information Technology,Urumqi 830046)
出处
《现代计算机》
2019年第14期3-7,共5页
Modern Computer
基金
国家自然科学基金(No.61662077、No.61262060)
作者简介
祖木然提古丽·库尔班(1992-),女,新疆阿克苏人,硕士,研究方向为自然语言处理与机器翻译;通信作者:艾山·吾买尔(1981-),男,新疆乌鲁木齐人,博士,副教授,研究方向为自然语言处理与机器翻译E-mail: hasan1479@xju.edu.cn.