摘要
文本匹配是自然语言理解的关键技术之一,其任务是判断两段文本的相似程度.近年来随着预训练模型的发展,基于预训练语言模型的文本匹配技术得到了广泛的应用.然而,这类文本匹配模型仍然面临着在某一特定领域泛化能力不佳、语义匹配时鲁棒性较弱这两个挑战.为此,本文提出了基于低频词的增量预训练及对抗训练方法来提高文本匹配模型的效果.本文通过针对领域内低频词的增量预训练,帮助模型向目标领域迁移,增强模型的泛化能力;同时本文尝试多种针对低频词的对抗训练方法,提升模型对词级别扰动的适应能力,提高模型的鲁棒性.本文在LCQMC数据集和房产领域文本匹配数据集上的实验结果表明,增量预训练、对抗训练以及这两种方式的结合使用均可明显改善文本匹配结果.
Text matching is one of the key techniques in natural language understanding,and its task is to determine the similarity of two texts.In recent years,with the development of pre-trained models,text-matching techniques based on pre-trained language models have been widely used.However,these text matching models still face the challenges of poor generalization ability in a particular domain and weak robustness in semantic matching.Therefore,this study proposes an incremental pre-training and adversarial training method for low-frequency words to improve the effect of the text matching model.The incremental pre-training of low-frequency words in the source domain helps the model migrate to the target domain and enhances the generalization ability of the model.Additionally,various adversarial training methods for low-frequency words are tried to improve the model’s adaptability to word-level perturbations and the robustness of the model.The experimental results on the LCQMC dataset and the text-matching dataset in the real estate domain indicate that incremental pre-training,adversarial training,and the combination of the two approaches can significantly improve the text matching results.
作者
司志博文
李少博
单丽莉
孙承杰
刘秉权
SI Zhi-Bo-Wen;LI Shao-Bo;SHAN Li-Li;SUN Cheng-Jie;LIU Bing-Quan(State Key Laboratory of Communication Content Cognition,People’s Daily Online,Beijing 100733,China;Faculty of Computing,Harbin Institute of Technology,Harbin 150001,China)
出处
《计算机系统应用》
2022年第11期349-357,共9页
Computer Systems & Applications
基金
国家自然科学基金(62176074)
关键词
文本匹配
预训练模型
增量预训练
对抗训练
低频词
深度学习
自然语言处理
text matching
pre-trained model
incremental pre-training
adversarial training
low-frequency word
deep learning
natural language processing(NLP)
作者简介
通信作者:刘秉权,E-mail:liubq@hit.edu.cn