摘要
【目的】对用户历史问答文本实现考虑上下文语义信息的主题识别,进而提升问答社区专家推荐的准确度。【方法】通过构建BERT-LLDA模型,将BERT模型与Labeled-LDA主题模型相结合,充分利用标签信息对用户历史问答文本进行向量化,通过降维和主题聚类实现考虑上下文语义信息的主题识别,获得用户的主题兴趣概率分布;根据主题兴趣挖掘结果构建主题敏感PageRank算法(TSPR),并加入用户质量权重迭代计算用户的领域权威;基于此得到考虑主题兴趣和领域权威的问答社区专家推荐算法TIDARank,为新问题推荐潜在回答专家。【结果】基于Stack Exchange公开数据集,BERT-LLDA模型经过主题聚类后相比TF-IDF、BERT、BERT-LDA等对比模型具有更高的轮廓系数(0.5756)和主题连贯性(0.4766);TIDARank算法的最佳回答者命中率ACC@20和平均倒数排名MRR@20分别为0.5807和0.2430,相比于表现最优的对比模型BiLSTM+TSPR分别提升0.145和0.081。【局限】在链接分析中未考虑用户的活跃情况。【结论】BERT-LLDA模型不仅可以优化主题聚类的效果,且有助于提升问答社区专家推荐的性能。
[Objective]This paper aims to enhance the accuracy of expert recommendations in Q&A communities based on topics of users’historical Q&A texts and contextual information.[Methods]First,we combined the BERT model with the Labeled-LDA model.Then,we utilized the label information to vectorize users’historical Q&A texts.Third,we identified contextual topics with dimension reduction and topic clustering.We also obtained the probability distribution of the expert’s topic interests.Fourth,based on the results of topic interest mining,we constructed the Topic Sensitive PageRank Algorithm(TSPR).We used the users’quality weight to calculate their domain authority iteratively.From this,we proposed the TIDARank algorithm for expert recommendation.[Results]Based on the Stack Exchange public dataset,the BERT-LLDA model outperformed TF-IDF,BERT,and BERT-LDA models on silhouette coefficient(0.5756)and topic coherence(0.4766).The ACC@20 and MRR@20 of TIDARank reached 0.5807 and 0.2430,respectively,improved by 0.145 and 0.081 compared with the bestperforming Bi-LSTM+TSPR baseline algorithm.[Limitations]We did not consider user activity in link analysis.[Conclusions]The BERT-LLDA model could optimize topic clustering for question-answering texts and improve the performances of expert recommendations in Q&A communities.
作者
李明珠
米传民
苟小义
肖琳
Li Mingzhu;Mi Chuanmin;Gou Xiaoyi;Xiao Lin(College of Economics and Management,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)
出处
《数据分析与知识发现》
EI
CSCD
北大核心
2024年第5期68-79,共12页
Data Analysis and Knowledge Discovery
基金
教育部人文社会科学基金项目(项目编号:20YJC630163)的研究成果之一。