The existing software bug localization models treat the source file as natural language, which leads to the loss of syntactical and structure information of the source file. A bug localization model based on syntactic...The existing software bug localization models treat the source file as natural language, which leads to the loss of syntactical and structure information of the source file. A bug localization model based on syntactical and semantic information of source code is proposed. Firstly, abstract syntax tree(AST) is divided based on node category to obtain statement sequence. The statement tree is encoded into vectors to capture lexical and syntactical knowledge at the statement level.Secondly, the source code is transformed into vector representation by the sequence naturalness of the statement. Therefore,the problem of gradient vanishing and explosion caused by a large AST size is obviated when using AST to the represent source code. Finally, the correlation between bug reports and source files are comprehensively analyzed from three aspects of syntax, semantics and text to locate the buggy code. Experiments show that compared with other standard models, the proposed model improves the performance of bug localization, and it has good advantages in mean reciprocal rank(MRR), mean average precision(MAP) and Top N Rank.展开更多
最长公共子序列(longest common subsequence,LCS)是一种衡量代码相似度的可行指标.然而,经典LCS算法的时间复杂度较高,难以应对大型数据集,并且,由于代码文本序列中的词(token)本质为一种基于离散表示的编码,直接使用LCS算法无法有效...最长公共子序列(longest common subsequence,LCS)是一种衡量代码相似度的可行指标.然而,经典LCS算法的时间复杂度较高,难以应对大型数据集,并且,由于代码文本序列中的词(token)本质为一种基于离散表示的编码,直接使用LCS算法无法有效识别文本不同但语义相似的代码片段中的关键语义.针对这两方面的不足,提出一种面向LCS的嵌入方法,将代码间的LCS计算转换为代码低维稠密嵌入向量间的数值运算,并可以利用近似最近邻算法进一步加速其计算.为此,设计了一个可嵌入的基于LCS的距离度量方法,实验证明这种代码度量在提取函数关键语义的表现上优于对比嵌入工具使用的基于文本的距离或基于树的距离.同时,为了在嵌入过程中有重点地保留代码的关键语义,构建了两种损失函数和相应的训练集,识别文本上不同但语义上相似的代码元素,使模型在检测复杂代码克隆时有更好的表现.实验证明了该方法拥有很强的可扩展性,且其对复杂克隆的检测能力也保持在很高水平.将该技术应用于相似缺陷的识别,上报了23个未知缺陷,这些缺陷已被开发人员在实际项目中确认,其中有些复杂缺陷是难以被基于文本的LCS算法检出的.展开更多
基金supported by the National Key R&D Program of China (2018YFB1702700)。
文摘The existing software bug localization models treat the source file as natural language, which leads to the loss of syntactical and structure information of the source file. A bug localization model based on syntactical and semantic information of source code is proposed. Firstly, abstract syntax tree(AST) is divided based on node category to obtain statement sequence. The statement tree is encoded into vectors to capture lexical and syntactical knowledge at the statement level.Secondly, the source code is transformed into vector representation by the sequence naturalness of the statement. Therefore,the problem of gradient vanishing and explosion caused by a large AST size is obviated when using AST to the represent source code. Finally, the correlation between bug reports and source files are comprehensively analyzed from three aspects of syntax, semantics and text to locate the buggy code. Experiments show that compared with other standard models, the proposed model improves the performance of bug localization, and it has good advantages in mean reciprocal rank(MRR), mean average precision(MAP) and Top N Rank.
文摘最长公共子序列(longest common subsequence,LCS)是一种衡量代码相似度的可行指标.然而,经典LCS算法的时间复杂度较高,难以应对大型数据集,并且,由于代码文本序列中的词(token)本质为一种基于离散表示的编码,直接使用LCS算法无法有效识别文本不同但语义相似的代码片段中的关键语义.针对这两方面的不足,提出一种面向LCS的嵌入方法,将代码间的LCS计算转换为代码低维稠密嵌入向量间的数值运算,并可以利用近似最近邻算法进一步加速其计算.为此,设计了一个可嵌入的基于LCS的距离度量方法,实验证明这种代码度量在提取函数关键语义的表现上优于对比嵌入工具使用的基于文本的距离或基于树的距离.同时,为了在嵌入过程中有重点地保留代码的关键语义,构建了两种损失函数和相应的训练集,识别文本上不同但语义上相似的代码元素,使模型在检测复杂代码克隆时有更好的表现.实验证明了该方法拥有很强的可扩展性,且其对复杂克隆的检测能力也保持在很高水平.将该技术应用于相似缺陷的识别,上报了23个未知缺陷,这些缺陷已被开发人员在实际项目中确认,其中有些复杂缺陷是难以被基于文本的LCS算法检出的.