摘要
                
                    针对当前中文NL2SQL(Natural language to SQL)监督学习中需要大量标注数据问题,该文提出基于对偶学习的方式在少量训练数据集上进行弱监督学习,将中文查询生成SQL语句。该文同时使用两个任务来训练自然语言转化到SQL,再从SQL转化到自然语言,让模型学习到任务之间的对偶约束性,获取更多相关的语义信息。同时在训练时使用不同比例带有无标签的数据进行训练,验证对偶学习在NL2SQL解析任务上的有效性。实验表明,在不同中英文数据集ATIS、GEO以及TableQA中,本文模型与基准模型Seq2Seq、Seq2Tree、Seq2SQL、以及-dual等相比,百分比准确率至少增加2.1%,其中在中文TableQA数据集上采用对偶学习执行准确率(Execution Accuracy)至少提升5.3%,只使用60%的标签数据就能取得和监督学习使用90%的标签数据相似的效果。
                
                To address the current challenges of requiring large amounts of annotated data for Chinese NL2SQL(Natural language to SQL)methods,this paper introduces a dual learning NL2SQL model,DualSQL,for weakly supervised learning on a small number of trained datasets to generate SQL statements from Chinese queries.Specifically,two tasks as dual tasks are used simultaneously to train the natural language to SQL and vice versa,so that the model learns the dual constraints between tasks and obtains more relevant semantic information.To verify the effectiveness of dual learning on the NL2SQL parsing task,we use different proportions of data without labels during training.Experimental results show that the percentage accuracy of the proposed model is increased by at least 2.1%compared with the benchmark models such as Seq2Seq,Seq2Tree,Seq2SQL,SQLNet,-dual etc.,in different Chinese and English datasets including ATIS,GEO,and TableQA,and execution accuracy by at least 5.3%on the Chinese TableQA dataset.Further,we show that using only 60%of labelled data can achieve similar effects to those with 90%of labelled data for supervised learning.
    
    
                作者
                    赵志超
                    游进国
                    何培蕾
                    李晓武
                ZHAO Zhichao;YOU Jinguo;HE Peilei;LI Xiaowu(Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunnan 650500,China)
     
    
    
                出处
                
                    《中文信息学报》
                        
                                CSCD
                                北大核心
                        
                    
                        2023年第3期164-172,共9页
                    
                
                    Journal of Chinese Information Processing
     
            
                基金
                    国家自然科学基金(62062046)
            
    
    
    
                作者简介
赵志超(1994—),硕士,主要研究领域为自然语言处理和时间序列预测。E-mail:zhaozhichao_study@stu.kust.edu.cn;通信作者:游进国(1977—),博士,教授,主要研究领域为大数据分析与数据挖掘。E-mail:jgyou@126.com;何培蕾(1999—),硕士,主要研究领域为数据挖掘与数据立方体。E-mail:20212104068@stu.kust.edu.cn 172