摘要
近年来随着人工智能和深度学习的发展,在神经机器翻译(NMT)的加持下,机器翻译的水平取得了长足的进步,但是在较大语料的情况下才能取得好的效果.此外,NMT的成功需要依赖于大量高质量的双语语料作为训练数据.在英法等丰富资源的语种(Rich resource language)翻译任务上,神经机器翻译机器的表现几乎可以媲美人类的水平.对于一些小语种(俗称低资源语种:Low resource language),无法提供足够多的双语数据,导致NMT出现过拟合问题,从而降低翻译效果.据此本文以低资源的汉傣语翻译为例,针对神经机器在低资源汉傣语机器翻译表现不佳的问题现状,开展了如下研究:(1)构造了以词向量为基础的初始化模型,利用傣汉词向量空间对齐的方法,来初始化神经翻译模型的词嵌入层以提高翻译的性能;(2)设计了傣汉词向量空间的对齐方法;(3)提出了一种基于词对齐的神经机器翻译框架.通过汉/傣、傣/汉双向翻译实验证明,该方法可以分别使汉/傣、傣/汉机器翻译的BLEU值提高2.38个和0.43个BLEU点.
In recent years,with the development of artificial intelligence and deep learning,the level of machine translation has made great progress under the application of neural machine translation(NMT).However,the success of NMT depends on a large number of high-quality bilingual corpus as training data.In the rich resource language translation tasks,such as English and French,the performance of neural machine translation machine can almost match that of human beings.However,for some small languages(commonly known as low resource languages),due to the lack of bilingual data,there are some fitting problems in NMT,which reduces the translation effect.Therefore,taking the low resource translation of Dai-Chinese as an example,we carried out the following research:(1)Constructed the initialization model based on word vector,and initialized the word embedding layer of neural translation model to improve translation by the method of vector space alignment.(2)Designed an alignment method of Dai-Chinese word vector space.(3)Proposed a neural machine translation model based on word alignment.The experiments of Chinese-Dai and Dai-Chinese two-way translation showed that this method could respectively increase 2.38 and 0.43 points of the BLUE value of Chinese-Dai and Dai-Chinese machine translation.
作者
高翊
付莎
胡泽林
李淼
冯韬
麻之润
GAO Yi;FU Sha;HU Zelin;LI Miao;Feng Tao;MA Zhirun(Yunnan Minority Language Working Committee,Kunming 650499,China;Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China)
出处
《昆明理工大学学报(自然科学版)》
CAS
北大核心
2020年第4期57-63,共7页
Journal of Kunming University of Science and Technology(Natural Science)
基金
国家自然科学基金项目(61572462)
中国科学院信息化专项子课题(XXH13505-03-203)
云南省民族事务委员会农业信息化项目(SZKM201835035)
关键词
低资源神经机器翻译
初始化模型
词向量对齐
注意力机制
low-resource neural machine translation
model initialization
word vector aligning
attention mechanism
作者简介
高翊(1970-),男,高级工程师.主要研究方向:少数民族语言文字信息处理、数据库、农业知识工程.E-mail:498898209@qq.com;通信作者:付莎(1982-),女,硕士,工程师.主要研究方向:少数民族语言文字信息处理、数据库、农业知识工程.E-mail:1769816@qq.com