摘要
为解决当前软件缺陷报告库中经常存在着大量重复缺陷报告被提交的问题,提出了一种基于LDA-BERT的重复缺陷报告检测模型模型.首先,将潜在狄利克雷分配模型(LDA,Latent Dirichlet Allocation)模型输出向量同BERT(Bidirectional Encoder Representations from Transformers)模型输出向量连接成新的模型向量,目的是融合主题模型LDA对于主题识别的优势和BERT模型识别上下文语义优势;然后,为了保证在检测的精度的同时,缩短检测时间,提出了二级特征向量再检测方法,通过二次抽取特征向量,以达到检测的精度与时间上的平衡的问题;最后,将大型开源项目缺陷报告库作为实验数据集,对所提出的模型方法与同类模型进行实验比对,实验结果表明本模型的召回率、精度在实验数据集的TOP-2000等指标上分别达到61.35%、47.34%.与同类模型相比该模型提高的百分比分别是4.3%和5.2%.实验结果表明,与已有的方法相比,提出的模型对于重复缺陷报告检测是有效果的.
There are a large number of duplicate defect reports being submitted in the software defect report database.In order to solve this problem,this paper proposed LDA-BERT,a model that connects LDA(latent Dirichlet allocation)model output vectors and BERT(Bidirectional Encoder Representations from Transformers)model output vectors into a new model vector.The aim was to integrate the advantages of topic model LDA for topic recognition and BERT model for contextual semantic recognition.The paper also proposed a two-stage feature vector redetection method to improve the detection accuracy and shorten the detection time.Finally,the experimental comparison between the proposed model method and similar models showed that the recall rate and accuracy of this model reached 61.35%and 47.34%respectively in TOP-2000 index,and the percentage improvement of this model compared with similar models was 4.3%and 5.2%respectively.The experimental results showed that the proposed model was effective for the detection of duplicate defect reports compared with the existing models.
作者
崔梦天
杨善矿
袁启航
CUI Meng-tian;YANG Shan-kuang;YUAN Qi-hang(School of Computer Science and Engineering,Southwest Minzu University,Chengdu 610041,China)
出处
《西南民族大学学报(自然科学版)》
CAS
2023年第4期414-423,共10页
Journal of Southwest Minzu University(Natural Science Edition)
基金
四川省科技计划项目(2023YFH0057)
科技部高端外国专家引进计划项目(G2022186003L)。