摘要
结合文档本身的结构信息与外部词语的语义信息,提出一种融合BERT词向量与TextRank的关键词抽取方法。在基于网络图的TextRank方法基础上,引入语义差异性并利用BERT词向量加权方式优化TextRank转移概率矩阵计算过程,同时通过迭代运算对文档中的词语进行综合影响力得分排序,最终提取得分最高的Top N个词语作为关键词。实验结果表明,当选取Top3、Top5、Top7和Top10个关键词时,与基于词向量聚类质心与TextRank加权的关键词抽取方法相比,该方法的平均F值提升了2.5%,关键词抽取效率更高。
Based on the structural information of the document and the semantic information of external words,this paper proposes a keyword extraction method based on Bidirectional Encoder Representation from Transformer(BERT)word vectors and TextRank.Using network graph-based TextRank,this method introduces the semantic difference and uses BERT word vector weighting to optimize the calculation process of the transfer possibility matrix of TextRank.At the same time,the overall influence scores of words in the document are sorted by iteration,and the words with the Top N scores are selected as keywords.Experimental results show that when keywords are selected Top3,Top5,Top7 and Top10 words,the average F value of the proposed method is 2.5%higher than that of the keyword extraction method based on word vector clustering centroid and TextRank weighting.The proposed method can improve the efficiency of keyword extraction.
作者
李俊
吕学强
LI Jun;Lü Xueqiang(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2020年第9期89-94,共6页
Computer Engineering
基金
国家自然科学基金(61671070)
国家语委重点科研项目(ZDI135-53)。